My Account Log in

1 option

Observability in the AI-Native Era : Leveraging AIOps to Build, Observe, and Operate Resilient Systems.

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Lipsig, Hilliary.
Language:
English
Subjects (All):
Artificial intelligence.
Distributed parameter systems.
Physical Description:
1 online resource (420 pages)
Edition:
1st ed.
Place of Publication:
Birmingham : Packt Publishing, Limited, 2026.
Summary:
Discover how AIOps is transforming the observability landscape for cloud-native and traditional systems.Learn how to build, monitor, and operate resilient services using AI-drive dynamic insights for smarter and more scalable operations Key Features Practical Integration of AI and Observability in Modern Engineering Workflows Real-World Use.
Contents:
Intro
Observability in the AI-Native Era
Leveraging AIOps to build, observe, and operate resilient systems
Foreword
Contributors
About the authors
About the reviewers
Table of Contents
Preface
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Free benefits with your book
How to Unlock
Stay Sharp in Cloud and DevOps - Join 44,000+ Subscribers of CloudPro
Share your thoughts
Part 1
From Monitoring via Observability to AIOps
1
Observability: The Art of Turning Data into Insights
What is observability?
What is observability and how does it differ from monitoring?
Let's ask ChatGPT!
The early days: monitoring static systems
The dawn of more complex and dynamic systems
Cloud-native monitoring doesn't scale the way we need it to
AIOps 2.0: observability ready for the cloud-native AI era
Three pillars and beyond: use cases for logs, metrics, and traces
What are metrics?
Challenge: being careful with high cardinality and privacy in dimensional data
What are logs?
Challenge: tackling high-volume, excessive, and unstructured logs
What are traces?
Challenge: over-instrumentation, duplicated data, and sampling as challenges
Beyond the three pillars: use cases based on events, profiling, and real users
Use case: track your software development life cycle through events
Use case: provide real-time business insights by extracting business events
More use cases on existing observability signals
Use cases: more to come as observability is evolving
Emerging standards over the years
OpenTelemetry
Prometheus
Visualization standards: Grafana and Perses
Observability and distributed systems.
Highly distributed systems and the increased complexity
Understanding distributed systems through observability
Inventory: which components are part of the system we are responsible for?
Dependencies: how are components connected and dependent on each other?
Interfaces/APIs: what are the boundaries of our distributed system to the consumers?
Health: are all components working as expected or is there abnormal behavior?
SLA and root cause: are end-to-end critical transactions experiencing issues, and why?
Shared infrastructure and how it impacts components
Use case: identifying a noisy neighbor
Use case: right-sizing infrastructure based on real needs
Synchronous and asynchronous communication
Network: metrics or eBPF
Connections: pools on each side of the call
Queues: the distributors of messages
VMs, containers, and databases - oh my!
Observing hypervisors and VMs
Observing web and application servers
Observing databases
Observing containers
Observing serverless
Full stack: from networking to cloud to application observability
What is full stack observability?
The observability goal is 100% coverage: start with production infrastructure, then expand up and left
Defining the focus of this book
You will be able to prove the value of observability and AI!
You will use exponential data growth as an opportunity!
You will expand observability to the left!
You will provide observability as a self-service!
You will unleash the power of AIOps through AI-driven automation!
You will see the journey of Financial One ACME
Summary
Further reading
Get this book's PDF version and more
2
The Elephant in the Room: Artificial Intelligence
Technical requirements
Why the hype around AI? What is AI good for?
AI versus automation.
What is AI's unique value proposition?
What is AI good for (right now)
A value-adding abstraction layer
Model Context Protocol
MCP server components and features
RAG versus CAG and how they relate to LLMs
RAG versus CAG
Choosing a language model
What can and will go wrong (and what you can do about it)
Incorrect user expectations
Hallucination and errors
Data poisoning
Catastrophic forgetting
Infinite loops
Prompt engineering helps
Why do AI projects fail, and how can you succeed?
Join us on Discord
3
From Observability to AIOps and the Use Cases it Solves Today
When data on glass and static alerts fail
Alternatives to static alerts on infrastructure metrics
Static thresholds: where they make sense
Baselining: where static thresholds are impractical
Beyond CPU and memory: cloud-native golden signals
Resource layer
Orchestration layer
Workload layer
Platform service layer
Service layer
Application layer
Observability layer
Choosing the right metrics through proper load testing
Step 1: setting up a test environment
Step 2: defining realistic scenarios
Step 3: expand-left observability
Step 4: running tests
Step 5: identifying critical indicators
Use case: baseline alerting on Kubernetes health
Observability-driven development
Step 1: defining internal and external system health indicators
Step 2: defining how to measure health indicators
Step 3: providing easy access to this data for engineering
Where was the data captured, and who needs it?
Where and when does this data need to be made available, and to whom?
Self-service use case 1: providing the top three database queries for team standup
Step 4: refining, enriching, and automating toward production.
Self-service use case 2: right-sizing container recommendations as a Git pull request
Context is king: the quality of observability data
From pets to cattle: semantics for observability
Enriching observability data across the stack
Enriching with tags from your infrastructure
Use case: logs only accessible for sales
Use case: traces from development only kept for two sprints
Enriching with tags, labels, and annotations from your deployment
Use case: is version 4.0.3 good enough to keep, or do we need to roll it back?
Use case: is the quality of customer-service-portal in preproduction good enough for production?
Enriching observability data across the SDLC
Tracking the SDLC
Use case: automated real-time artifact inventory and software catalog
Use case: tracking DORA and other DevEx efficiency metrics
Use case: automated correlation of deployment changes with problems
Observing your DevOps tools
Quality of data: where to enrich it and what to sample
How and where to enrich observability data with context
Do we need all the data? Sampling strategies
AIOps: reducing the noise with anomaly and root cause detection
Step 1: detecting abnormal events
Scope
Dependencies
Shared resources
External events
Step 2: connecting the dots to find the root cause
Horizontal stack, or call chain
Vertical stack, or runs on another component
Network connectivity
Cross-application or cross-stack
Step 3: explaining all the evidence
From ops to business: SLO-based impact analysis
The right questions to ask!
Moving from technical to business objectives
Why we still drown in incidents
Expanded scope
Tool consolidation
Gaps in end-to-end observability
Missing ownership
Lack of criticality and business impact.
A primer on SLOs: learnings from Google's SRE handbook
Service-level indicator
Service-level objective
Error budget and burndown rate
Service-level agreement
From SLOs to business objectives
Asking business impact questions
Connecting business objectives with technical objectives
Business impact analysis as part of incident response
Incident analysis without SLO context
Connecting the SLO with the incident
How to start this journey
How would Financial One ACME make this transition?
4
ACME Financial Services: Implementing AIOps
Our fictitious company
After the great cloud migration
ACME Financial Services' current state of observability
How their observability practices became unmanageable
An explosion of operating costs
What about regulations?
Controlled deployments
The old deployment process
From continuous delivery to continuous incident
The strain from adding features
The technology stack
Deciding what to improve from a sea of options
Tackling the issues
Issues determined to be medium effort
An eye toward aiding fraud prevention
Tracking feature usage
Plans to address alert fatigue
New tools, new possibilities
The tool selection game
Build versus buy
Time to investigate
The traceability group
The feature usage group
The fraud group
The alerts group
The SLO mess
Static alerts
Siloed data
Alerts from logs
Bringing it all together
Plan of action
Having the "what" and determining the "how
It's finally done. What did we gain?
Part 2
Expanding Left: Moving AIOps into Platform Engineering
5
Democratizing Observability: A Primer to Self-Service Platforms.
Technical requirements.
Notes:
Description based on publisher supplied metadata and other sources.
Part of the metadata in this record was created by AI, based on the text of the resource.
ISBN:
1-80638-958-4
9781806389582
OCLC:
1579267460

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Library Catalog Using Articles+ Library Account