1 option
Delta Lake: up and running : modern data Lakehouse architectures with Delta Lake / Bennie Haelen and Dan Davis.
- Format:
- Book
- Author/Creator:
- Haelen, Bennie, author.
- Davis, Dan, author.
- Language:
- English
- Subjects (All):
- Storage area networks (Computer networks).
- Computer network architectures.
- Cloud computing.
- Physical Description:
- 1 online resource
- Edition:
- First edition.
- Place of Publication:
- Sebastopol, California : O'Reilly Media, Inc., 2023.
- Summary:
- With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS. This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights. You'll learn how to: Use modern data management and data engineering techniques Understand how ACID transactions bring reliability to data lakes at scale Run streaming and batch jobs against your data lake concurrently Execute update, delete, and merge commands against your data lake Use time travel to roll back and examine previous data versions Build a streaming data quality pipeline following the medallion architecture.
- Contents:
- Intro
- Copyright
- Table of Contents
- Preface
- How to Contact Us
- Conventions Used in This Book
- Using Code Examples
- O'Reilly Online Learning
- Acknowledgment
- Chapter 1. The Evolution of Data Architectures
- A Brief History of Relational Databases
- Data Warehouses
- Data Warehouse Architecture
- Dimensional Modeling
- Data Warehouse Benefits and Challenges
- Introducing Data Lakes
- Data Lakehouse
- Data Lakehouse Benefits
- Implementing a Lakehouse
- Delta Lake
- The Medallion Architecture
- The Delta Ecosystem
- Delta Lake Storage
- Delta Sharing
- Delta Connectors
- Conclusion
- Chapter 2. Getting Started with Delta Lake
- Getting a Standard Spark Image
- Using Delta Lake with PySpark
- Running Delta Lake in the Spark Scala Shell
- Running Delta Lake on Databricks
- Creating and Running a Spark Program: helloDeltaLake
- The Delta Lake Format
- Parquet Files
- Writing a Delta Table
- The Delta Lake Transaction Log
- How the Transaction Log Implements Atomicity
- Breaking Down Transactions into Atomic Commits
- The Transaction Log at the File Level
- Scaling Massive Metadata
- Conclusion
- Chapter 3. Basic Operations on Delta Tables
- Creating a Delta Table
- Creating a Delta Table with SQL DDL
- The DESCRIBE Statement
- Creating Delta Tables with the DataFrameWriter API
- Creating a Delta Table with the DeltaTableBuilder API
- Generated Columns
- Reading a Delta Table
- Reading a Delta Table with SQL
- Reading a Table with PySpark
- Writing to a Delta Table
- Cleaning Out the YellowTaxis Table
- Inserting Data with SQL INSERT
- Appending a DataFrame to a Table
- Using the OverWrite Mode When Writing to a Delta Table
- Inserting Data with the SQL COPY INTO Command
- Partitions
- User-Defined Metadata
- Using SparkSession to Set Custom Metadata
- Using the DataFrameWriter to Set Custom Metadata
- Chapter 4. Table Deletes, Updates, and Merges
- Deleting Data from a Delta Table
- Table Creation and DESCRIBE HISTORY
- Performing the DELETE Operation
- DELETE Performance Tuning Tips
- Updating Data in a Table
- Use Case Description
- UPDATE Performance Tuning Tips
- Upsert Data Using the MERGE Operation
- The MERGE Dataset
- The MERGE Statement
- Analyzing the MERGE operation with DESCRIBE HISTORY
- Inner Workings of the MERGE Operation
- Chapter 5. Performance Tuning
- Data Skipping
- Partitioning
- Partitioning Warnings and Considerations
- Compact Files
- Compaction
- OPTIMIZE
- ZORDER BY
- ZORDER BY Considerations
- Liquid Clustering
- Enabling Liquid Clustering
- Operations on Clustered Columns
- Liquid Clustering Warnings and Considerations
- Chapter 6. Using Time Travel
- Delta Lake Time Travel
- Restoring a Table
- Restoring via Timestamp
- Time Travel Under the Hood
- Notes:
- OCLC-licensed vendor bibliographic record.
- Includes index.
- ISBN:
- 9781098139711
- 1098139712
- OCLC:
- 1404818352
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.