感谢您发送咨询!我们的团队成员将很快与您联系。
感谢您发送预订!我们的团队成员将很快与您联系。
课程大纲
Introduction to AIOps
- What is AIOps and why it matters
- Traditional monitoring vs. AIOps-driven observability
- AIOps architecture and key components
Collecting and Normalizing Operational Data
- Types of observability data: metrics, logs, and traces
- Ingesting data from multiple sources (servers, containers, cloud)
- Using agents and exporters (Prometheus, Beats, Fluentd)
Data Correlation and Anomaly Detection
- Time series correlation and statistical methods
- Using ML models for anomaly detection
- Detecting incidents across distributed systems
Alerting and Noise Reduction
- Designing intelligent alert rules and thresholds
- Suppression, deduplication, and alert grouping
- Integrating with Alertmanager, Slack, PagerDuty, or Opsgenie
Root Cause Analysis and Visualization
- Using dashboards to visualize metrics and detect trends
- Exploring events and timelines for RCA
- Tracing issues across layers with distributed tracing tools
Automation and Remediation
- Triggering automated scripts or workflows from incidents
- Integrating with ITSM systems (ServiceNow, Jira)
- Use cases: self-healing, scaling, traffic rerouting
Open Source and Commercial AIOps Platforms
- Overview of tools: Prometheus, Grafana, ELK, Moogsoft, Dynatrace
- Evaluation criteria for selecting an AIOps platform
- Demo and hands-on with a selected stack
Summary and Next Steps
要求
- An understanding of IT operations and system monitoring concepts
- Experience with monitoring tools or dashboards
- Familiarity with basic log and metric formats
Audience
- Operations teams responsible for infrastructure and applications
- Site Reliability Engineers (SREs)
- IT monitoring and observability teams
14 小时