My Account Log in

2 options

Big Data on Kubernetes : A Practical Guide to Building Efficient and Scalable Data Solutions / Neylson Crepalde.

Ebook Central Academic Complete Available online

View online

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Crepalde, Neylson, author.
Language:
English
Subjects (All):
Kubernetes.
Application software--Development.
Application software.
Application program interfaces (Computer software).
Big data.
Physical Description:
1 online resource (297 pages)
Edition:
First edition.
Place of Publication:
Birmingham, England : Packt Publishing, [2024]
Biography/History:
Crepalde Neylson: Neylson Crepalde is a Generative AI Strategist at AWS. Prior to that, he was CTO at A3Data, a consulting company focused on Data, Analytics and Artificial Intelligence. He holds a PhD in Economic Sociology, a master in Sociology of Culture, an MBA in Cultural Management and a Bachelor in Orchestra Conducting. He has been working with data since 2015. He is committed to sharing knowledge with people of every professional level and helping data teams achieve their best. He is several times AWS certified, Spark certified, Neo4j certified and Airflow certified. Neylson has been teaching for 10+ years now in colleges and MBA programs and he gives regular talks and lectures on Data Architecture, AI strategy, Data Governance and Network Science.
Summary:
Gain hands-on experience in building efficient and scalable big data architecture on Kubernetes, utilizing leading technologies such as Spark, Airflow, Kafka, and Trino Key Features Leverage Kubernetes in a cloud environment to integrate seamlessly with a variety of tools Explore best practices for optimizing the performance of big data pipelines Build end-to-end data pipelines and discover real-world use cases using popular tools like Spark, Airflow, and Kafka Purchase of the print or Kindle book includes a free PDF eBook Book Description In today's data-driven world, organizations across different sectors need scalable and efficient solutions for processing large volumes of data. Kubernetes offers an open-source and cost-effective platform for deploying and managing big data tools and workloads, ensuring optimal resource utilization and minimizing operational overhead. If you want to master the art of building and deploying big data solutions using Kubernetes, then this book is for you. Written by an experienced data specialist, Big Data on Kubernetes takes you through the entire process of developing scalable and resilient data pipelines, with a focus on practical implementation. Starting with the basics, you'll progress toward learning how to install Docker and run your first containerized applications. You'll then explore Kubernetes architecture and understand its core components. This knowledge will pave the way for exploring a variety of essential tools for big data processing such as Apache Spark and Apache Airflow. You'll also learn how to install and configure these tools on Kubernetes clusters. Throughout the book, you'll gain hands-on experience building a complete big data stack on Kubernetes. By the end of this Kubernetes book, you'll be equipped with the skills and knowledge you need to tackle real-world big data challenges with confidence. What you will learn Install and use Docker to run containers and build concise images Gain a deep understanding of Kubernetes architecture and its components Deploy and manage Kubernetes clusters on different cloud platforms Implement and manage data pipelines using Apache Spark and Apache Airflow Deploy and configure Apache Kafka for real-time data ingestion and processing Build and orchestrate a complete big data pipeline using open-source tools Deploy Generative AI applications on a Kubernetes-based architecture Who this book is for If you're a data engineer, BI analyst, data team leader, data architect, or tech manager with a basic understanding of big data technologies, then this big data book is for you. Familiarity with the basics of Python programming, SQL queries, and YAML is required to understand the topics discussed in this book.
Contents:
Cover
Title page
Copyright and credits
Dedication
Contributors
Table of Contents
Preface
Part 1: Docker and Kubernetes
Chapter 1: Getting Started with Containers
Technical requirements
Container architecture
Installing Docker
Windows
macOS
Linux
Getting started with Docker images
hello-world
NGINX
Julia
Building your own image
Batch processing job
API service
Summary
Chapter 2: Kubernetes Architecture
Kubernetes architecture
Control plane
Node components
Pods
Deployments
StatefulSets
Jobs
Services
ClusterIP Service
NodePort Service
LoadBalancer Service
Ingress and Ingress Controller
Gateway
Persistent Volumes
StorageClasses
ConfigMaps and Secrets
ConfigMaps
Secrets
Chapter 3: Getting Hands-On with Kubernetes
Installing kubectl
Deploying a local cluster using Kind
Installing kind
Deploying the cluster
Deploying an AWS EKS cluster
Deploying a Google Cloud GKE cluster
Deploying an Azure AKS cluster
Running your API on Kubernetes
Creating the deployment
Creating a service
Using an ingress to access the API
Running a data processing job in Kubernetes
Part 2: Big Data Stack
Chapter 4: The Modern Data Stack
Data architectures
The Lambda architecture
The Kappa architecture
Comparing Lambda and Kappa
Data lake design for big data
Data warehouses
The rise of big data and data lakes
The rise of the data lakehouse
Implementing the lakehouse architecture
Batch ingestion
Storage
Batch processing
Orchestration
Batch serving
Data visualization
Real-time ingestion
Real-time processing
Real-time serving
Real-time data visualization
Summary.
Chapter 5: Big Data Processing with Apache Spark
Getting started with Spark
Installing Spark locally
Spark architecture
Spark executors
Components of execution
Starting a Spark program
The DataFrame API and the Spark SQL API
Transformations
Actions
Lazy evaluation
Data partitioning
Narrow versus wide transformations
Analyzing the titanic dataset
Working with real data
How Spark performs joins
Joining IMDb tables
Chapter 6: Building Pipelines with Apache Airflow
Getting started with Airflow
Installing Airflow with Astro
Airflow architecture
Airflow's distributed architecture
Building a data pipeline
Airflow integration with other tools
Chapter 7: Apache Kafka for Real-Time Events and Data Ingestion
Getting started with Kafka
Exploring the Kafka architecture
The PubSub design
How Kafka delivers exactly-once semantics
First producer and consumer
Streaming from a database with Kafka Connect
Real-time data processing with Kafka and Spark
Part 3: Connecting It All Together
Chapter 8: Deploying the Big Data Stack on Kubernetes
Deploying Spark on Kubernetes
Deploying Airflow on Kubernetes
Deploying Kafka on Kubernetes
Chapter 9: Data Consumption Layer
Getting started with SQL query engines
The limitations of traditional data warehouses
The rise of SQL query engines
The architecture of SQL query engines
Deploying Trino in Kubernetes
Connecting DBeaver with Trino
Deploying Elasticsearch in Kubernetes
How Elasticsearch stores, indexes and manages data
Elasticsearch deployment
Chapter 10: Building a Big Data Pipeline on Kubernetes.
Technical requirements
Checking the deployed tools
Building a batch pipeline
Building the Airflow DAG
Creating SparkApplication jobs
Creating a Glue crawler
Building a real-time pipeline
Deploying Kafka Connect and Elasticsearch
Real-time processing with Spark
Deploying the Elasticsearch sink connector
Chapter 11: Generative AI on Kubernetes
What generative AI is and what it is not
The power of large neural networks
Challenges and limitations
Using Amazon Bedrock to work with foundational models
Building a generative AI application on Kubernetes
Deploying the Streamlit app
Building RAG with Knowledge Bases for Amazon Bedrock
Adjusting the code for RAG retrieval
Building action models with agents
Creating a DynamoDB table
Configuring the agent
Deploying the application on Kubernetes
Chapter 12: Where to Go from Here
Important topics for big data in Kubernetes
Kubernetes monitoring and application monitoring
Building a service mesh
Security considerations
Automated scalability
GitOps and CI/CD for Kubernetes
Kubernetes cost control
What about team skills?
Key skills for monitoring
Skills for GitOps and CI/CD
Cost control skills
Index
Other Books You May Enjoy.
Notes:
Description based on publisher supplied metadata and other sources.
Description based on print version record.
Other Format:
Print version: Crepalde, Neylson Big Data on Kubernetes
ISBN:
9781835468999
OCLC:
1442730873

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account