1 option
Observability in the AI-Native Era : Leveraging AIOps to Build, Observe, and Operate Resilient Systems.
- Format:
- Book
- Author/Creator:
- Lipsig, Hilliary.
- Language:
- English
- Subjects (All):
- Artificial intelligence.
- Distributed parameter systems.
- Physical Description:
- 1 online resource (420 pages)
- Edition:
- 1st ed.
- Place of Publication:
- Birmingham : Packt Publishing, Limited, 2026.
- Summary:
- Discover how AIOps is transforming the observability landscape for cloud-native and traditional systems.Learn how to build, monitor, and operate resilient services using AI-drive dynamic insights for smarter and more scalable operations Key Features Practical Integration of AI and Observability in Modern Engineering Workflows Real-World Use.
- Contents:
- Intro
- Observability in the AI-Native Era
- Leveraging AIOps to build, observe, and operate resilient systems
- Foreword
- Contributors
- About the authors
- About the reviewers
- Table of Contents
- Preface
- What this book covers
- To get the most out of this book
- Download the example code files
- Download the color images
- Conventions used
- Get in touch
- Free benefits with your book
- How to Unlock
- Stay Sharp in Cloud and DevOps - Join 44,000+ Subscribers of CloudPro
- Share your thoughts
- Part 1
- From Monitoring via Observability to AIOps
- 1
- Observability: The Art of Turning Data into Insights
- What is observability?
- What is observability and how does it differ from monitoring?
- Let's ask ChatGPT!
- The early days: monitoring static systems
- The dawn of more complex and dynamic systems
- Cloud-native monitoring doesn't scale the way we need it to
- AIOps 2.0: observability ready for the cloud-native AI era
- Three pillars and beyond: use cases for logs, metrics, and traces
- What are metrics?
- Challenge: being careful with high cardinality and privacy in dimensional data
- What are logs?
- Challenge: tackling high-volume, excessive, and unstructured logs
- What are traces?
- Challenge: over-instrumentation, duplicated data, and sampling as challenges
- Beyond the three pillars: use cases based on events, profiling, and real users
- Use case: track your software development life cycle through events
- Use case: provide real-time business insights by extracting business events
- More use cases on existing observability signals
- Use cases: more to come as observability is evolving
- Emerging standards over the years
- OpenTelemetry
- Prometheus
- Visualization standards: Grafana and Perses
- Observability and distributed systems.
- Highly distributed systems and the increased complexity
- Understanding distributed systems through observability
- Inventory: which components are part of the system we are responsible for?
- Dependencies: how are components connected and dependent on each other?
- Interfaces/APIs: what are the boundaries of our distributed system to the consumers?
- Health: are all components working as expected or is there abnormal behavior?
- SLA and root cause: are end-to-end critical transactions experiencing issues, and why?
- Shared infrastructure and how it impacts components
- Use case: identifying a noisy neighbor
- Use case: right-sizing infrastructure based on real needs
- Synchronous and asynchronous communication
- Network: metrics or eBPF
- Connections: pools on each side of the call
- Queues: the distributors of messages
- VMs, containers, and databases - oh my!
- Observing hypervisors and VMs
- Observing web and application servers
- Observing databases
- Observing containers
- Observing serverless
- Full stack: from networking to cloud to application observability
- What is full stack observability?
- The observability goal is 100% coverage: start with production infrastructure, then expand up and left
- Defining the focus of this book
- You will be able to prove the value of observability and AI!
- You will use exponential data growth as an opportunity!
- You will expand observability to the left!
- You will provide observability as a self-service!
- You will unleash the power of AIOps through AI-driven automation!
- You will see the journey of Financial One ACME
- Summary
- Further reading
- Get this book's PDF version and more
- 2
- The Elephant in the Room: Artificial Intelligence
- Technical requirements
- Why the hype around AI? What is AI good for?
- AI versus automation.
- What is AI's unique value proposition?
- What is AI good for (right now)
- A value-adding abstraction layer
- Model Context Protocol
- MCP server components and features
- RAG versus CAG and how they relate to LLMs
- RAG versus CAG
- Choosing a language model
- What can and will go wrong (and what you can do about it)
- Incorrect user expectations
- Hallucination and errors
- Data poisoning
- Catastrophic forgetting
- Infinite loops
- Prompt engineering helps
- Why do AI projects fail, and how can you succeed?
- Join us on Discord
- 3
- From Observability to AIOps and the Use Cases it Solves Today
- When data on glass and static alerts fail
- Alternatives to static alerts on infrastructure metrics
- Static thresholds: where they make sense
- Baselining: where static thresholds are impractical
- Beyond CPU and memory: cloud-native golden signals
- Resource layer
- Orchestration layer
- Workload layer
- Platform service layer
- Service layer
- Application layer
- Observability layer
- Choosing the right metrics through proper load testing
- Step 1: setting up a test environment
- Step 2: defining realistic scenarios
- Step 3: expand-left observability
- Step 4: running tests
- Step 5: identifying critical indicators
- Use case: baseline alerting on Kubernetes health
- Observability-driven development
- Step 1: defining internal and external system health indicators
- Step 2: defining how to measure health indicators
- Step 3: providing easy access to this data for engineering
- Where was the data captured, and who needs it?
- Where and when does this data need to be made available, and to whom?
- Self-service use case 1: providing the top three database queries for team standup
- Step 4: refining, enriching, and automating toward production.
- Self-service use case 2: right-sizing container recommendations as a Git pull request
- Context is king: the quality of observability data
- From pets to cattle: semantics for observability
- Enriching observability data across the stack
- Enriching with tags from your infrastructure
- Use case: logs only accessible for sales
- Use case: traces from development only kept for two sprints
- Enriching with tags, labels, and annotations from your deployment
- Use case: is version 4.0.3 good enough to keep, or do we need to roll it back?
- Use case: is the quality of customer-service-portal in preproduction good enough for production?
- Enriching observability data across the SDLC
- Tracking the SDLC
- Use case: automated real-time artifact inventory and software catalog
- Use case: tracking DORA and other DevEx efficiency metrics
- Use case: automated correlation of deployment changes with problems
- Observing your DevOps tools
- Quality of data: where to enrich it and what to sample
- How and where to enrich observability data with context
- Do we need all the data? Sampling strategies
- AIOps: reducing the noise with anomaly and root cause detection
- Step 1: detecting abnormal events
- Scope
- Dependencies
- Shared resources
- External events
- Step 2: connecting the dots to find the root cause
- Horizontal stack, or call chain
- Vertical stack, or runs on another component
- Network connectivity
- Cross-application or cross-stack
- Step 3: explaining all the evidence
- From ops to business: SLO-based impact analysis
- The right questions to ask!
- Moving from technical to business objectives
- Why we still drown in incidents
- Expanded scope
- Tool consolidation
- Gaps in end-to-end observability
- Missing ownership
- Lack of criticality and business impact.
- A primer on SLOs: learnings from Google's SRE handbook
- Service-level indicator
- Service-level objective
- Error budget and burndown rate
- Service-level agreement
- From SLOs to business objectives
- Asking business impact questions
- Connecting business objectives with technical objectives
- Business impact analysis as part of incident response
- Incident analysis without SLO context
- Connecting the SLO with the incident
- How to start this journey
- How would Financial One ACME make this transition?
- 4
- ACME Financial Services: Implementing AIOps
- Our fictitious company
- After the great cloud migration
- ACME Financial Services' current state of observability
- How their observability practices became unmanageable
- An explosion of operating costs
- What about regulations?
- Controlled deployments
- The old deployment process
- From continuous delivery to continuous incident
- The strain from adding features
- The technology stack
- Deciding what to improve from a sea of options
- Tackling the issues
- Issues determined to be medium effort
- An eye toward aiding fraud prevention
- Tracking feature usage
- Plans to address alert fatigue
- New tools, new possibilities
- The tool selection game
- Build versus buy
- Time to investigate
- The traceability group
- The feature usage group
- The fraud group
- The alerts group
- The SLO mess
- Static alerts
- Siloed data
- Alerts from logs
- Bringing it all together
- Plan of action
- Having the "what" and determining the "how
- It's finally done. What did we gain?
- Part 2
- Expanding Left: Moving AIOps into Platform Engineering
- 5
- Democratizing Observability: A Primer to Self-Service Platforms.
- Technical requirements.
- Notes:
- Description based on publisher supplied metadata and other sources.
- Part of the metadata in this record was created by AI, based on the text of the resource.
- ISBN:
- 1-80638-958-4
- 9781806389582
- OCLC:
- 1579267460
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.