My Account Log in

1 option

Engineering Lakehouses with Open Table Formats : Build Scalable and Efficient Lakehouses with Apache Iceberg, Apache Hudi, and Delta Lake.

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Mazumdar, Dipankar.
Language:
English
Subjects (All):
Apache Hudi (Electronic resource).
Apache Iceberg (Electronic resource).
Delta Lake (Electronic resource).
Information storage and retrieval systems.
Open source software.
Physical Description:
1 online resource (414 pages)
Edition:
1st ed.
Place of Publication:
Birmingham : Packt Publishing, Limited, 2025.
Summary:
Jump-start your journey toward mastering open data architectural patterns by learning the fundamentals and applications of open table formats Key Features Build lakehouses with open table formats using compute engines such as Apache Spark, Flink, Trino, and Python Optimize lakehouses with techniques such as pruning, partitioning, compaction.
Contents:
Cover
Title Page
Copyright Page
Contributors
Table of Contents
Preface
Free Benefits with Your Book
Chapter 1: Open Data Lakehouse: A New Architectural Paradigm
The evolution of data systems
OLTP: The transactional backbone
OLAP: Analyzing historical data
Data lakes: The centralized data storage for a new era
The emergence of the lakehouse architecture
An introduction to data lakehouses
Inside the lakehouse architecture
Lake storage
File formats
Table formats
Storage engine
Compute engine
Catalog
Attributes of an open data lakehouse
Open data architecture
Unification of batch and streaming
Cost efficiency
Improvements in query performance
Reliable transactions
Interoperability across diverse compute engines
Summary
Questions
Answers
Chapter 2: Transactional Capabilities of the Lakehouse
Understanding transactions and ACID properties
Deep dive into ACID properties
ACID properties in traditional databases
ACID properties in lakehouse architectures
Discovering conflict resolution mechanisms
Types of conflict resolutions
Pessimistic concurrency control
Optimistic concurrency control
Multi-version concurrency control
Locking granularity
Conflict resolution in distributed systems
Conflict resolution in lakehouse architectures
Understanding the storage engine
Components of a traditional database storage engine
How a lakehouse handles transactions
ACID guarantees
Table management services
Chapter 3: Apache Iceberg Deep Dive
Apache Iceberg architecture
What happens during writes?
What happens during reads?
Catalog considerations for production
Metadata
Metadata file
Manifest list
Manifest file
Data layer
Data files
Delete files.
Apache Iceberg features
Critical capabilities
Schema evolution
Time travel and rollback
Row-level updates/deletes: Copy-on-Write (CoW) and Merge-on-Read (MoR) tables
Unique capabilities
Hidden partitioning
Partition evolution
Branching and tagging
Advanced statistics with Puffin files
Hands-on with Apache Iceberg and Apache Spark
Installation requirements
Write and read operations with Spark and Iceberg
Getting the required packages
Configuring catalogs
DDL statements
Adding a column
Renaming a column
Dropping a column
Adding a partition field
DML statements
Read queries
Iceberg procedures
Expire snapshot
Roll back to snapshot
Remove orphan files
Rewrite data files
Add files
Hands-on with Apache Iceberg and Apache Flink
Configuration (Flink, Iceberg, storage)
Setting up Flink SQL
Flink SQL Client
Catalog configuration
CREATE CATALOG
Key properties
CREATE DATABASE
CREATE TABLE
INSERT INTO
INSERT OVERWRITE
UPSERT
Batch reads
Streaming reads
PyIceberg
Installation
Connecting to a catalog
Create table and insert records
DuckDB
Read metadata
Daft
Write data
Chapter 4: Apache Hudi Deep Dive
Technical requirements
Architecture
Metadata layer
Core metadata files
The importance of metadata
Base files: The foundation
Log files: The change log
File slices and file groups: Organizing data
Design principles of Apache Hudi
Hudi's indexing mechanisms
Catalog integration
Hive sync tool
Apache Hudi features
Row-level updates and deletes
Hudi Streamer.
Hands-on with Apache Hudi and Apache Spark
Write and read operations with Spark and Hudi
Syncing to catalogs
Hands-on with Apache Hudi and Apache Flink
Configuration (Flink, Hudi, storage)
Prerequisites
Steps to start the SQL CLI
Configuration for SQL operations
Syncing to a catalog
UPDATE
DELETE FROM
Setting writer/reader configs
Hudi table services
Compaction
Clustering
Clustering types
Inline clustering
Async clustering
Cleaning orphan files
Importance of cleaning
Cleaning retention policies
Triggering cleaning
Configuration
Rollback mechanism
Handling failed commits
File sizing
Auto-sizing during writes
Clustering after writes
Disaster recovery
Chapter 5: Delta Lake Deep Dive
Delta Lake architecture
Transaction log
commitInfo
protocol
metaData
add/remove
Log checkpoints
Transaction log protocol
How the transaction log protocol works
Protocol versions and features
Limitations of protocol versions
Table features
Reader and writer features in the transaction log
Enforcing ACID properties via the transaction protocol
Atomicity - all-or-nothing transactions
Consistency - validating transactions against rules
Isolation - preventing interference in concurrent operations
Durability - preserving changes across failures
Highlights
How read and write work in Delta Lake
Life cycle of a read query
Life cycle of a write query
Delta Lake features
Schema enforcement and evolution
Schema enforcement.
Schema evolution
Time travel
Row-level upserts/deletes - copy-on-write and merge-on-read
Copy-on-Write (CoW)
Merge-on-Read (MoR)
Change data feed
How CDF works
Liquid clustering
How Liquid clustering works
Generated column
How generated columns work
Delta Kernel
Core functionality
Abstraction and simplification
Cross-engine compatibility
Extensibility and advanced integration
The Delta Kernel advantage
Delta Sharing
Hands-on with Delta Lake and Apache Spark
Write and read operations with Spark and Delta
Configuring catalogs for Delta Lake
Delta Lake commands and procedures
Hands-on with Delta Lake and Apache Flink
Bounded mode (batch reading)
Continuous mode (streaming)
Best practices
Chapter 6: Catalog and Metadata Management
The importance of catalogs in a lakehouse architecture
Introduction to catalogs
Why metadata management matters
Core features of a catalog
Role of catalogs across table formats
Challenges without a catalog
Iceberg REST catalog specification
What is the REST catalog?
Architecture of the REST catalog
Benefits of using the REST catalog
Implementing an Iceberg REST catalog
Server implementation and setup
Client implementation and configuration
Using the REST catalog
Creating a table
Loading a table
Performing a scan
Committing changes
Best practices for REST catalogs
Popular catalog options for lakehouses
Hive Metastore (HMS) - the established option
Modern relevance
Hands-on example.
Unity Catalog - a unified governance solution
Governance in action
Hands-on example
Polaris - a next-generation metadata service
AWS Glue Catalog - a cloud-native option
Apache Gravitino - an emerging open source catalog
Other popular catalogs
Project Nessie
Azure Purview
Google Cloud Data Catalog
Chapter 7: Interoperability in Lakehouses
Need for interoperability
Apache XTable (incubating)
Apache XTable architecture
Inner workings of XTable sync
Full sync
Incremental sync
How to run translation with Apache XTable and Apache Spark
Run translation
XTable limitations
Delta UniForm
Inner workings of UniForm
Metadata generation in UniForm
How to use UniForm
Create a new Delta Lake table with UniForm
Enable UniForm on an existing Delta Lake table
UniForm limitations
Use cases for interoperability
Chapter 8: Performance Optimization and Tuning in a Lakehouse
Performance optimization
Optimization techniques in open table formats
Partition pruning
Apache Iceberg
Apache Hudi
Delta Lake
Balancing the cost of compaction
Cleaning
Practical Tuning Cheat Sheet
Delta Lake (managed runtimes such as Databricks or Fabric)
Putting it together
Query optimization techniques
Leveraging column statistics and metadata for efficient data pruning
Bloom filters and advanced indexing strategies
Delta Lake.
Vectorized execution and hardware-accelerated processing.
Notes:
Description based on publisher supplied metadata and other sources.
ISBN:
1-83620-722-0
9781836207221
OCLC:
1552154707
Publisher Number:
CIPO000306484

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Library Catalog Using Articles+ Library Account