My Account Log in

2 options

Modern data architectures with Python : a practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python / Brian Lipp.

EBSCOhost Academic eBook Collection (North America) Available online

View online

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Lipp, Brian, author.
Language:
English
Subjects (All):
Python (Computer program language).
Data structures (Computer science).
Big data.
Physical Description:
1 online resource (318 pages)
Edition:
First edition.
Place of Publication:
Birmingham, England : Packt Publishing Ltd., [2023]
System Details:
Mode of access: World Wide Web.
Summary:
Modern Data Architectures with Python will teach you how to seamlessly incorporate your machine learning and data science work streams into your open data platforms. You'll learn how to take your data and create open lakehouses that work with any technology using tried-and-true techniques, including the medallion architecture and Delta Lake. Starting with the fundamentals, this book will help you build pipelines on Databricks, an open data platform, using SQL and Python. You'll gain an understanding of notebooks and applications written in Python using standard software engineering tools such as git, pre-commit, Jenkins, and Github.
Contents:
Cover
Title Page
Copyright and Credits
Dedications
Contributors
Table of Contents
Preface
Part 1: Fundamental Data Knowledge
Chapter 1: Modern Data Processing Architecture
Technical requirements
Databases, data warehouses, and data lakes
OLTP
OLAP
Data lakes
Event stores
File formats
Data platform architecture at a high level
Comparing the Lambda and Kappa architectures
Lambda architecture
Kappa architecture
Lakehouse and Delta architectures
Lakehouses
The seven central tenets
The medallion data pattern and the Delta architecture
Data mesh theory and practice
Defining terms
The four principles of data mesh
Summary
Practical lab
Solution
Chapter 2: Understanding Data Analytics
Setting up your environment
Python
venv
Graphviz
Workflow initialization
Cleaning and preparing your data
Duplicate values
Working with nulls
Using RegEx
Outlier identification
Casting columns
Fixing column names
Complex data types
Data documentation
diagrams
Data lineage graphs
Data modeling patterns
Relational
Dimensional modeling
Key terms
OBT
Loading the problem data
Part 2: Data Engineering Toolset
Chapter 3: Apache Spark Deep Dive
Python, AWS, and Databricks
Databricks CLI
Cloud data storage
Object storage
NoSQL
Spark architecture
Introduction to Apache Spark
Key components
Working with partitions
Shuffling partitions
Caching
Broadcasting
Job creation pipeline
Delta Lake
Transaction log
Grouping tables with databases
Table
Adding speed with Z-ordering
Bloom filters
Problem 1
Problem 2.
Problem 3
Chapter 4: Batch and Stream Data Processing Using PySpark
Batch processing
Partitioning
Data skew
Reading data
Spark schemas
Making decisions
Removing unwanted columns
Working with data in groups
The UDF
Stream processing
Reading from disk
Debugging
Writing to disk
Batch stream hybrid
Delta streaming
Batch processing in a stream
Setup
Creating fake data
Problem 2
Problem 3
Solution 1
Solution 2
Solution 3
Chapter 5: Streaming Data with Kafka
Confluent Kafka
Signing up
Kafka architecture
Topics
Partitions
Brokers
Producers
Consumers
Schema Registry
Kafka Connect
Spark and Kafka
Part 3: Modernizing the Data Platform
Chapter 6: MLOps
Introduction to machine learning
Understanding data
The basics of feature engineering
Splitting up your data
Fitting your data
Cross-validation
Understanding hyperparameters and parameters
Training our model
Working together
AutoML
MLflow
MLOps benefits
Feature stores
Hyperopt
Create an MLflow project
Chapter 7: Data and Information Visualization
Principles of data visualization
Understanding your user
Validating your data
Data visualization using notebooks
Line charts
Bar charts
Histograms
Scatter plots
Pie charts.
Bubble charts
A single line chart
A multiple line chart
A bar chart
A scatter plot
A histogram
A bubble chart
GUI data visualizations
Tips and tricks with Databricks notebooks
Magic
Markdown
Other languages
Terminal
Filesystem
Running other notebooks
Widgets
Databricks SQL analytics
Accessing SQL analytics
SQL Warehouses
SQL editors
Queries
Dashboards
Alerts
Query history
Connecting BI tools
Loading problem data
Chapter 8: Integrating Continous Integration into Your Workflow
Databricks
The DBX CLI
Docker
Git
GitHub
Pre-commit
Terraform
Install Jenkins, container setup, and compose
CI tooling
Git and GitHub
Python wheels and packages
Anatomy of a package
DBX
Important commands
Testing code
Terraform - IaC
IaC
The CLI
HCL
Jenkins
Jenkinsfile
Chapter 9: Orchestrating Your Data Workflows
Orchestrating data workloads
Making life easier with Autoloader
Reading
Writing
Two modes
Useful options
Databricks Workflows
Failed runs
REST APIs
The Databricks API
Python code
Logging
Lambda code
Notebook code
Part 4: Hands-on Project
Chapter 10: Data Governance
The Databricks CLI
What is data governance?
Data standards
Data catalogs
Data lineage
Data security and privacy
Data quality.
Great Expectations
Creating test data
Data context
Data source
Batch request
Validator
Adding tests
Saving the suite
Creating a checkpoint
Datadocs
Testing new data
Profiler
Databricks Unity
Chapter 11: Building out the Groundwork
pre-commit
PyPI
Creating GitHub repos
Terraform setup
Initial file setup
Schema repository
ML repository
Infrastructure repository
Chapter 12: Completing Our Project
Documentation
Schema diagram
C4 System Context diagram
Faking data with Mockaroo
Managing our schemas with code
Building our data pipeline application
Creating our machine learning application
Displaying our data with dashboards
Index
Other Books You May Enjoy.
Notes:
Includes index.
Includes bibliographical references and index.
Description based on print version record.
ISBN:
9781801076418
1801076413
OCLC:
1398279353

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Library Catalog Using Articles+ Library Account