1 option
Building ETL Pipelines with Python : Create and Deploy Enterprise-Ready ETL Pipelines by Employing Modern Methods / Brij Kishore Pandey and Emily Ro Schoof.
- Format:
- Book
- Author/Creator:
- Pandey, Brij Kishore, author.
- Schoof, Emily Ro, author.
- Language:
- English
- Subjects (All):
- Data mining.
- Python (Computer program language).
- Big data.
- Electronic data processing.
- Physical Description:
- 1 online resource (0 pages)
- Edition:
- First edition.
- Place of Publication:
- Birmingham, England : Packt Publishing Ltd., [2023]
- Summary:
- Develop production-ready ETL pipelines by leveraging Python libraries and deploying them for suitable use cases Key Features Understand how to set up a Python virtual environment with PyCharm Learn functional and object-oriented approaches to create ETL pipelines Create robust CI/CD processes for ETL pipelines Purchase of the print or Kindle book includes a free PDF eBook Book Description Modern extract, transform, and load (ETL) pipelines for data engineering have favored the Python language for its broad range of uses and a large assortment of tools, applications, and open source components. With its simplicity and extensive library support, Python has emerged as the undisputed choice for data processing. In this book, you'll walk through the end-to-end process of ETL data pipeline development, starting with an introduction to the fundamentals of data pipelines and establishing a Python development environment to create pipelines. Once you've explored the ETL pipeline design principles and ET development process, you'll be equipped to design custom ETL pipelines. Next, you'll get to grips with the steps in the ETL process, which involves extracting valuable data; performing transformations, through cleaning, manipulation, and ensuring data integrity; and ultimately loading the processed data into storage systems. You'll also review several ETL modules in Python, comparing their pros and cons when building data pipelines and leveraging cloud tools, such as AWS, to create scalable data pipelines. Lastly, you'll learn about the concept of test-driven development for ETL pipelines to ensure safe deployments. By the end of this book, you'll have worked on several hands-on examples to create high-performance ETL pipelines to develop robust, scalable, and resilient environments using Python. What you will learn Explore the available libraries and tools to create ETL pipelines using Python Write clean and resilient ETL code in Python that can be extended and easily scaled Understand the best practices and design principles for creating ETL pipelines Orchestrate the ETL process and scale the ETL pipeline effectively Discover tools and services available in AWS for ETL pipelines Understand different testing strategies and implement them with the ETL process Who this book is for If you are a data engineer or software professional looking to create enterprise-level ETL pipelines using Python, this book is for you. Fundamental knowledge of Python is a prerequisite.
- Contents:
- Cover
- Title Page
- Copyright
- Dedication
- Contributors
- Table of Contents
- Preface
- Part 1: Introduction to ETL, Data Pipelines, and Design Principles
- Chapter 1: A Primer on Python and the Development Environment
- Introducing Python fundamentals
- An overview of Python data structures
- Python if…else conditions or conditional statements
- Python looping techniques
- Python functions
- Object-oriented programming with Python
- Working with files in Python
- Establishing a development environment
- Version control with Git tracking
- Documenting environment dependencies with requirements.txt
- Utilizing module management systems (MMSs)
- Configuring a Pipenv environment in PyCharm
- Summary
- Chapter 2: Understanding the ETL Process and Data Pipelines
- What is a data pipeline?
- How do we create a robust pipeline?
- Pre-work - understanding your data
- Design planning - planning your workflow
- Architecture development - developing your resources
- Putting it all together - project diagrams
- What is an ETL data pipeline?
- Batch processing
- Streaming method
- Cloud-native
- Automating ETL pipelines
- Exploring use cases for ETL pipelines
- References
- Chapter 3: Design Principles for Creating Scalable and Resilient Pipelines
- Technical requirements
- Understanding the design patterns for ETL
- Basic ETL design pattern
- ETL-P design pattern
- ETL-VP design pattern
- ELT two-phase pattern
- Preparing your local environment for installations
- Open source Python libraries for ETL pipelines
- Pandas
- NumPy
- Scaling for big data packages
- Dask
- Numba
- Part 2: Designing ETL Pipelines with Python
- Chapter 4: Sourcing Insightful Data and Data Extraction Strategies
- What is data sourcing?
- Accessibility to data.
- Types of data sources
- Getting started with data extraction
- CSV and Excel data files
- Parquet data files
- API connections
- Databases
- Data from web pages
- Creating a data extraction pipeline using Python
- Data extraction
- Logging
- Chapter 5: Data Cleansing and Transformation
- Scrubbing your data
- Data transformation
- Data cleansing and transformation in ETL pipelines
- Understanding the downstream applications of your data
- Strategies for data cleansing and transformation in Python
- Preliminary tasks - the importance of staging data
- Transformation activities in Python
- Creating data pipeline activity in Python
- Chapter 6: Loading Transformed Data
- Introduction to data loading
- Choosing the load destination
- Types of load destinations
- Best practices for data loading
- Optimizing data loading activities by controlling the data import method
- Creating demo data
- Full data loads
- Incremental data loads
- Precautions to consider
- Tutorial - preparing your local environment for data loading activities
- Downloading and installing PostgreSQL
- Creating data schemas in PostgreSQL
- Chapter 7: Tutorial - Building an End-to-End ETL Pipeline in Python
- Introducing the project
- The approach
- The data
- Creating tables in PostgreSQL
- Sourcing and extracting the data
- Transformation and data cleansing
- Loading data into PostgreSQL tables
- Making it deployable
- Chapter 8: Powerful ETL Libraries and Tools in Python
- Architecture of Python files
- Configuring your local environment
- config.ini
- config.yaml
- Part 1 - ETL tools in Python
- Bonobo
- Odo
- Mito ETL
- Riko
- pETL
- Luigi.
- Part 2 - pipeline workflow management platforms in Python
- Airflow
- Part 3: Creating ETL Pipelines in AWS
- Chapter 9: A Primer on AWS Tools for ETL Processes
- Common data storage tools in AWS
- Amazon RDS
- Amazon Redshift
- Amazon S3
- Amazon EC2
- Discussion - Building flexible applications in AWS
- Leveraging S3 and EC2
- Computing and automation with AWS
- AWS Glue
- AWS Lambda
- AWS Step Functions
- AWS big data tools for ETL pipelines
- AWS Data Pipeline
- Amazon Kinesis
- Amazon EMR
- Walk-through - creating a Free Tier AWS account
- Prerequisites for running AWS from your device in AWS
- AWS CLI
- Docker
- LocalStack
- AWS SAM CLI
- Chapter 10: Tutorial - Creating an ETL Pipeline in AWS
- Creating a Python pipeline with Amazon S3, Lambda, and Step Functions
- Setting the stage with the AWS CLI
- Creating a "proof of concept" data pipeline in Python
- Using Boto3 and Amazon S3 to read data
- AWS Lambda functions
- An introduction to a scalable ETL pipeline using Bonobo, EC2, and RDS
- Configuring your AWS environment with EC2 and RDS
- Creating an RDS instance
- Creating an EC2 instance
- Creating a data pipeline locally with Bonobo
- Adding the pipeline to AWS
- Chapter 11: Building Robust Deployment Pipelines in AWS
- What is CI/CD and why is it important?
- The six key elements of CI/CD
- Essential steps for CI/CD adoption
- CI/CD is a continual process
- Creating a robust CI/CD process for ETL pipelines in AWS
- Creating a CI/CD pipeline
- Building an ETL pipeline using various AWS services
- Setting up a CodeCommit repository
- Orchestrating with AWS CodePipeline
- Testing the pipeline
- Part 4: Automating and Scaling ETL Pipelines.
- Chapter 12: Orchestration and Scaling in ETL Pipelines
- Performance bottlenecks
- Inflexibility
- Limited scalability
- Operational overheads
- Exploring the types of scaling
- Vertical scaling
- Horizontal scaling
- Choose your scaling strategy
- Processing requirements
- Data volume
- Cost
- Complexity and skills
- Reliability and availability
- Data pipeline orchestration
- Task scheduling
- Error handling and recovery
- Resource management
- Monitoring and logging
- Putting it together with a practical example
- Chapter 13: Testing Strategies for ETL Pipelines
- Benefits of testing data pipeline code
- How to choose the right testing strategies for your ETL pipeline
- How often should you test your ETL pipeline?
- Creating tests for a simple ETL pipeline
- Unit testing
- Validation testing
- Integration testing
- End-to-end testing
- Performance testing
- Resilience testing
- Best practices for a testing environment for ETL pipelines
- Defining testing objectives
- Establishing a testing framework
- Automating ETL tests
- Monitoring ETL pipelines
- ETL testing challenges
- Data privacy and security
- Environment parity
- Top ETL testing tools
- Chapter 14: Best Practices for ETL Pipelines
- Data quality
- Poor scalability
- Lack of error-handling and recovery methods
- ETL logging in Python
- Debugging and issue resolution
- Auditing and compliance
- Performance monitoring
- Including contextual information
- Handling exceptions and errors
- The Goldilocks principle
- Implementing logging in Python
- Checkpoint for recovery
- Avoiding SPOFs
- Modularity and auditing
- Modularity
- Auditing
- Chapter 15: Use Cases and Further Reading
- Technical requirements.
- New York Yellow Taxi data, ETL pipeline, and deployment
- Step 1 - configuration
- Step 2 - ETL pipeline script
- Step 3 - unit tests
- Building a robust ETL pipeline with US construction data in AWS
- Prerequisites
- Step 1 - data extraction
- Step 2 - data transformation
- Step 3 - data loading
- Running the ETL pipeline
- Bonus - deploying your ETL pipeline
- Further reading
- Index
- About Packt
- Other Books You May Enjoy.
- Notes:
- Description based upon print version of record.
- Includes index.
- Precautions to consider
- Includes bibliographical references and index.
- Description based on print version record.
- ISBN:
- 9781804615539
- 1804615536
- OCLC:
- 1402024983
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.