1 option
Engineering Lakehouses with Open Table Formats : Build Scalable and Efficient Lakehouses with Apache Iceberg, Apache Hudi, and Delta Lake.
- Format:
- Book
- Author/Creator:
- Mazumdar, Dipankar.
- Language:
- English
- Subjects (All):
- Apache Hudi (Electronic resource).
- Apache Iceberg (Electronic resource).
- Delta Lake (Electronic resource).
- Information storage and retrieval systems.
- Open source software.
- Physical Description:
- 1 online resource (414 pages)
- Edition:
- 1st ed.
- Place of Publication:
- Birmingham : Packt Publishing, Limited, 2025.
- Summary:
- Jump-start your journey toward mastering open data architectural patterns by learning the fundamentals and applications of open table formats Key Features Build lakehouses with open table formats using compute engines such as Apache Spark, Flink, Trino, and Python Optimize lakehouses with techniques such as pruning, partitioning, compaction.
- Contents:
- Cover
- Title Page
- Copyright Page
- Contributors
- Table of Contents
- Preface
- Free Benefits with Your Book
- Chapter 1: Open Data Lakehouse: A New Architectural Paradigm
- The evolution of data systems
- OLTP: The transactional backbone
- OLAP: Analyzing historical data
- Data lakes: The centralized data storage for a new era
- The emergence of the lakehouse architecture
- An introduction to data lakehouses
- Inside the lakehouse architecture
- Lake storage
- File formats
- Table formats
- Storage engine
- Compute engine
- Catalog
- Attributes of an open data lakehouse
- Open data architecture
- Unification of batch and streaming
- Cost efficiency
- Improvements in query performance
- Reliable transactions
- Interoperability across diverse compute engines
- Summary
- Questions
- Answers
- Chapter 2: Transactional Capabilities of the Lakehouse
- Understanding transactions and ACID properties
- Deep dive into ACID properties
- ACID properties in traditional databases
- ACID properties in lakehouse architectures
- Discovering conflict resolution mechanisms
- Types of conflict resolutions
- Pessimistic concurrency control
- Optimistic concurrency control
- Multi-version concurrency control
- Locking granularity
- Conflict resolution in distributed systems
- Conflict resolution in lakehouse architectures
- Understanding the storage engine
- Components of a traditional database storage engine
- How a lakehouse handles transactions
- ACID guarantees
- Table management services
- Chapter 3: Apache Iceberg Deep Dive
- Apache Iceberg architecture
- What happens during writes?
- What happens during reads?
- Catalog considerations for production
- Metadata
- Metadata file
- Manifest list
- Manifest file
- Data layer
- Data files
- Delete files.
- Apache Iceberg features
- Critical capabilities
- Schema evolution
- Time travel and rollback
- Row-level updates/deletes: Copy-on-Write (CoW) and Merge-on-Read (MoR) tables
- Unique capabilities
- Hidden partitioning
- Partition evolution
- Branching and tagging
- Advanced statistics with Puffin files
- Hands-on with Apache Iceberg and Apache Spark
- Installation requirements
- Write and read operations with Spark and Iceberg
- Getting the required packages
- Configuring catalogs
- DDL statements
- Adding a column
- Renaming a column
- Dropping a column
- Adding a partition field
- DML statements
- Read queries
- Iceberg procedures
- Expire snapshot
- Roll back to snapshot
- Remove orphan files
- Rewrite data files
- Add files
- Hands-on with Apache Iceberg and Apache Flink
- Configuration (Flink, Iceberg, storage)
- Setting up Flink SQL
- Flink SQL Client
- Catalog configuration
- CREATE CATALOG
- Key properties
- CREATE DATABASE
- CREATE TABLE
- INSERT INTO
- INSERT OVERWRITE
- UPSERT
- Batch reads
- Streaming reads
- PyIceberg
- Installation
- Connecting to a catalog
- Create table and insert records
- DuckDB
- Read metadata
- Daft
- Write data
- Chapter 4: Apache Hudi Deep Dive
- Technical requirements
- Architecture
- Metadata layer
- Core metadata files
- The importance of metadata
- Base files: The foundation
- Log files: The change log
- File slices and file groups: Organizing data
- Design principles of Apache Hudi
- Hudi's indexing mechanisms
- Catalog integration
- Hive sync tool
- Apache Hudi features
- Row-level updates and deletes
- Hudi Streamer.
- Hands-on with Apache Hudi and Apache Spark
- Write and read operations with Spark and Hudi
- Syncing to catalogs
- Hands-on with Apache Hudi and Apache Flink
- Configuration (Flink, Hudi, storage)
- Prerequisites
- Steps to start the SQL CLI
- Configuration for SQL operations
- Syncing to a catalog
- UPDATE
- DELETE FROM
- Setting writer/reader configs
- Hudi table services
- Compaction
- Clustering
- Clustering types
- Inline clustering
- Async clustering
- Cleaning orphan files
- Importance of cleaning
- Cleaning retention policies
- Triggering cleaning
- Configuration
- Rollback mechanism
- Handling failed commits
- File sizing
- Auto-sizing during writes
- Clustering after writes
- Disaster recovery
- Chapter 5: Delta Lake Deep Dive
- Delta Lake architecture
- Transaction log
- commitInfo
- protocol
- metaData
- add/remove
- Log checkpoints
- Transaction log protocol
- How the transaction log protocol works
- Protocol versions and features
- Limitations of protocol versions
- Table features
- Reader and writer features in the transaction log
- Enforcing ACID properties via the transaction protocol
- Atomicity - all-or-nothing transactions
- Consistency - validating transactions against rules
- Isolation - preventing interference in concurrent operations
- Durability - preserving changes across failures
- Highlights
- How read and write work in Delta Lake
- Life cycle of a read query
- Life cycle of a write query
- Delta Lake features
- Schema enforcement and evolution
- Schema enforcement.
- Schema evolution
- Time travel
- Row-level upserts/deletes - copy-on-write and merge-on-read
- Copy-on-Write (CoW)
- Merge-on-Read (MoR)
- Change data feed
- How CDF works
- Liquid clustering
- How Liquid clustering works
- Generated column
- How generated columns work
- Delta Kernel
- Core functionality
- Abstraction and simplification
- Cross-engine compatibility
- Extensibility and advanced integration
- The Delta Kernel advantage
- Delta Sharing
- Hands-on with Delta Lake and Apache Spark
- Write and read operations with Spark and Delta
- Configuring catalogs for Delta Lake
- Delta Lake commands and procedures
- Hands-on with Delta Lake and Apache Flink
- Bounded mode (batch reading)
- Continuous mode (streaming)
- Best practices
- Chapter 6: Catalog and Metadata Management
- The importance of catalogs in a lakehouse architecture
- Introduction to catalogs
- Why metadata management matters
- Core features of a catalog
- Role of catalogs across table formats
- Challenges without a catalog
- Iceberg REST catalog specification
- What is the REST catalog?
- Architecture of the REST catalog
- Benefits of using the REST catalog
- Implementing an Iceberg REST catalog
- Server implementation and setup
- Client implementation and configuration
- Using the REST catalog
- Creating a table
- Loading a table
- Performing a scan
- Committing changes
- Best practices for REST catalogs
- Popular catalog options for lakehouses
- Hive Metastore (HMS) - the established option
- Modern relevance
- Hands-on example.
- Unity Catalog - a unified governance solution
- Governance in action
- Hands-on example
- Polaris - a next-generation metadata service
- AWS Glue Catalog - a cloud-native option
- Apache Gravitino - an emerging open source catalog
- Other popular catalogs
- Project Nessie
- Azure Purview
- Google Cloud Data Catalog
- Chapter 7: Interoperability in Lakehouses
- Need for interoperability
- Apache XTable (incubating)
- Apache XTable architecture
- Inner workings of XTable sync
- Full sync
- Incremental sync
- How to run translation with Apache XTable and Apache Spark
- Run translation
- XTable limitations
- Delta UniForm
- Inner workings of UniForm
- Metadata generation in UniForm
- How to use UniForm
- Create a new Delta Lake table with UniForm
- Enable UniForm on an existing Delta Lake table
- UniForm limitations
- Use cases for interoperability
- Chapter 8: Performance Optimization and Tuning in a Lakehouse
- Performance optimization
- Optimization techniques in open table formats
- Partition pruning
- Apache Iceberg
- Apache Hudi
- Delta Lake
- Balancing the cost of compaction
- Cleaning
- Practical Tuning Cheat Sheet
- Delta Lake (managed runtimes such as Databricks or Fabric)
- Putting it together
- Query optimization techniques
- Leveraging column statistics and metadata for efficient data pruning
- Bloom filters and advanced indexing strategies
- Delta Lake.
- Vectorized execution and hardware-accelerated processing.
- Notes:
- Description based on publisher supplied metadata and other sources.
- ISBN:
- 1-83620-722-0
- 9781836207221
- OCLC:
- 1552154707
- Publisher Number:
- CIPO000306484
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.