2 options
Modern data architectures with Python : a practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python / Brian Lipp.
- Format:
- Book
- Author/Creator:
- Lipp, Brian, author.
- Language:
- English
- Subjects (All):
- Python (Computer program language).
- Data structures (Computer science).
- Big data.
- Physical Description:
- 1 online resource (318 pages)
- Edition:
- First edition.
- Place of Publication:
- Birmingham, England : Packt Publishing Ltd., [2023]
- System Details:
- Mode of access: World Wide Web.
- Summary:
- Modern Data Architectures with Python will teach you how to seamlessly incorporate your machine learning and data science work streams into your open data platforms. You'll learn how to take your data and create open lakehouses that work with any technology using tried-and-true techniques, including the medallion architecture and Delta Lake. Starting with the fundamentals, this book will help you build pipelines on Databricks, an open data platform, using SQL and Python. You'll gain an understanding of notebooks and applications written in Python using standard software engineering tools such as git, pre-commit, Jenkins, and Github.
- Contents:
- Cover
- Title Page
- Copyright and Credits
- Dedications
- Contributors
- Table of Contents
- Preface
- Part 1: Fundamental Data Knowledge
- Chapter 1: Modern Data Processing Architecture
- Technical requirements
- Databases, data warehouses, and data lakes
- OLTP
- OLAP
- Data lakes
- Event stores
- File formats
- Data platform architecture at a high level
- Comparing the Lambda and Kappa architectures
- Lambda architecture
- Kappa architecture
- Lakehouse and Delta architectures
- Lakehouses
- The seven central tenets
- The medallion data pattern and the Delta architecture
- Data mesh theory and practice
- Defining terms
- The four principles of data mesh
- Summary
- Practical lab
- Solution
- Chapter 2: Understanding Data Analytics
- Setting up your environment
- Python
- venv
- Graphviz
- Workflow initialization
- Cleaning and preparing your data
- Duplicate values
- Working with nulls
- Using RegEx
- Outlier identification
- Casting columns
- Fixing column names
- Complex data types
- Data documentation
- diagrams
- Data lineage graphs
- Data modeling patterns
- Relational
- Dimensional modeling
- Key terms
- OBT
- Loading the problem data
- Part 2: Data Engineering Toolset
- Chapter 3: Apache Spark Deep Dive
- Python, AWS, and Databricks
- Databricks CLI
- Cloud data storage
- Object storage
- NoSQL
- Spark architecture
- Introduction to Apache Spark
- Key components
- Working with partitions
- Shuffling partitions
- Caching
- Broadcasting
- Job creation pipeline
- Delta Lake
- Transaction log
- Grouping tables with databases
- Table
- Adding speed with Z-ordering
- Bloom filters
- Problem 1
- Problem 2.
- Problem 3
- Chapter 4: Batch and Stream Data Processing Using PySpark
- Batch processing
- Partitioning
- Data skew
- Reading data
- Spark schemas
- Making decisions
- Removing unwanted columns
- Working with data in groups
- The UDF
- Stream processing
- Reading from disk
- Debugging
- Writing to disk
- Batch stream hybrid
- Delta streaming
- Batch processing in a stream
- Setup
- Creating fake data
- Problem 2
- Problem 3
- Solution 1
- Solution 2
- Solution 3
- Chapter 5: Streaming Data with Kafka
- Confluent Kafka
- Signing up
- Kafka architecture
- Topics
- Partitions
- Brokers
- Producers
- Consumers
- Schema Registry
- Kafka Connect
- Spark and Kafka
- Part 3: Modernizing the Data Platform
- Chapter 6: MLOps
- Introduction to machine learning
- Understanding data
- The basics of feature engineering
- Splitting up your data
- Fitting your data
- Cross-validation
- Understanding hyperparameters and parameters
- Training our model
- Working together
- AutoML
- MLflow
- MLOps benefits
- Feature stores
- Hyperopt
- Create an MLflow project
- Chapter 7: Data and Information Visualization
- Principles of data visualization
- Understanding your user
- Validating your data
- Data visualization using notebooks
- Line charts
- Bar charts
- Histograms
- Scatter plots
- Pie charts.
- Bubble charts
- A single line chart
- A multiple line chart
- A bar chart
- A scatter plot
- A histogram
- A bubble chart
- GUI data visualizations
- Tips and tricks with Databricks notebooks
- Magic
- Markdown
- Other languages
- Terminal
- Filesystem
- Running other notebooks
- Widgets
- Databricks SQL analytics
- Accessing SQL analytics
- SQL Warehouses
- SQL editors
- Queries
- Dashboards
- Alerts
- Query history
- Connecting BI tools
- Loading problem data
- Chapter 8: Integrating Continous Integration into Your Workflow
- Databricks
- The DBX CLI
- Docker
- Git
- GitHub
- Pre-commit
- Terraform
- Install Jenkins, container setup, and compose
- CI tooling
- Git and GitHub
- Python wheels and packages
- Anatomy of a package
- DBX
- Important commands
- Testing code
- Terraform - IaC
- IaC
- The CLI
- HCL
- Jenkins
- Jenkinsfile
- Chapter 9: Orchestrating Your Data Workflows
- Orchestrating data workloads
- Making life easier with Autoloader
- Reading
- Writing
- Two modes
- Useful options
- Databricks Workflows
- Failed runs
- REST APIs
- The Databricks API
- Python code
- Logging
- Lambda code
- Notebook code
- Part 4: Hands-on Project
- Chapter 10: Data Governance
- The Databricks CLI
- What is data governance?
- Data standards
- Data catalogs
- Data lineage
- Data security and privacy
- Data quality.
- Great Expectations
- Creating test data
- Data context
- Data source
- Batch request
- Validator
- Adding tests
- Saving the suite
- Creating a checkpoint
- Datadocs
- Testing new data
- Profiler
- Databricks Unity
- Chapter 11: Building out the Groundwork
- pre-commit
- PyPI
- Creating GitHub repos
- Terraform setup
- Initial file setup
- Schema repository
- ML repository
- Infrastructure repository
- Chapter 12: Completing Our Project
- Documentation
- Schema diagram
- C4 System Context diagram
- Faking data with Mockaroo
- Managing our schemas with code
- Building our data pipeline application
- Creating our machine learning application
- Displaying our data with dashboards
- Index
- Other Books You May Enjoy.
- Notes:
- Includes index.
- Includes bibliographical references and index.
- Description based on print version record.
- ISBN:
- 9781801076418
- 1801076413
- OCLC:
- 1398279353
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.