My Account Log in

1 option

Delta Lake: up and running : modern data Lakehouse architectures with Delta Lake / Bennie Haelen and Dan Davis.

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Haelen, Bennie, author.
Davis, Dan, author.
Language:
English
Subjects (All):
Storage area networks (Computer networks).
Computer network architectures.
Cloud computing.
Physical Description:
1 online resource
Edition:
First edition.
Place of Publication:
Sebastopol, California : O'Reilly Media, Inc., 2023.
Summary:
With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS. This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights. You'll learn how to: Use modern data management and data engineering techniques Understand how ACID transactions bring reliability to data lakes at scale Run streaming and batch jobs against your data lake concurrently Execute update, delete, and merge commands against your data lake Use time travel to roll back and examine previous data versions Build a streaming data quality pipeline following the medallion architecture.
Contents:
Intro
Copyright
Table of Contents
Preface
How to Contact Us
Conventions Used in This Book
Using Code Examples
O'Reilly Online Learning
Acknowledgment
Chapter 1. The Evolution of Data Architectures
A Brief History of Relational Databases
Data Warehouses
Data Warehouse Architecture
Dimensional Modeling
Data Warehouse Benefits and Challenges
Introducing Data Lakes
Data Lakehouse
Data Lakehouse Benefits
Implementing a Lakehouse
Delta Lake
The Medallion Architecture
The Delta Ecosystem
Delta Lake Storage
Delta Sharing
Delta Connectors
Conclusion
Chapter 2. Getting Started with Delta Lake
Getting a Standard Spark Image
Using Delta Lake with PySpark
Running Delta Lake in the Spark Scala Shell
Running Delta Lake on Databricks
Creating and Running a Spark Program: helloDeltaLake
The Delta Lake Format
Parquet Files
Writing a Delta Table
The Delta Lake Transaction Log
How the Transaction Log Implements Atomicity
Breaking Down Transactions into Atomic Commits
The Transaction Log at the File Level
Scaling Massive Metadata
Conclusion
Chapter 3. Basic Operations on Delta Tables
Creating a Delta Table
Creating a Delta Table with SQL DDL
The DESCRIBE Statement
Creating Delta Tables with the DataFrameWriter API
Creating a Delta Table with the DeltaTableBuilder API
Generated Columns
Reading a Delta Table
Reading a Delta Table with SQL
Reading a Table with PySpark
Writing to a Delta Table
Cleaning Out the YellowTaxis Table
Inserting Data with SQL INSERT
Appending a DataFrame to a Table
Using the OverWrite Mode When Writing to a Delta Table
Inserting Data with the SQL COPY INTO Command
Partitions
User-Defined Metadata
Using SparkSession to Set Custom Metadata
Using the DataFrameWriter to Set Custom Metadata
Chapter 4. Table Deletes, Updates, and Merges
Deleting Data from a Delta Table
Table Creation and DESCRIBE HISTORY
Performing the DELETE Operation
DELETE Performance Tuning Tips
Updating Data in a Table
Use Case Description
UPDATE Performance Tuning Tips
Upsert Data Using the MERGE Operation
The MERGE Dataset
The MERGE Statement
Analyzing the MERGE operation with DESCRIBE HISTORY
Inner Workings of the MERGE Operation
Chapter 5. Performance Tuning
Data Skipping
Partitioning
Partitioning Warnings and Considerations
Compact Files
Compaction
OPTIMIZE
ZORDER BY
ZORDER BY Considerations
Liquid Clustering
Enabling Liquid Clustering
Operations on Clustered Columns
Liquid Clustering Warnings and Considerations
Chapter 6. Using Time Travel
Delta Lake Time Travel
Restoring a Table
Restoring via Timestamp
Time Travel Under the Hood
Notes:
OCLC-licensed vendor bibliographic record.
Includes index.
ISBN:
9781098139711
1098139712
OCLC:
1404818352

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Library Catalog Using Articles+ Library Account