My Account Log in

3 options

Data engineering with AWS : build and implement complex data pipelines using AWS / Gareth Eagar.

EBSCOhost Academic eBook Collection (North America) Available online

View online

Ebook Central College Complete Available online

View online

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Eagar, Gareth, author.
Language:
English
Subjects (All):
Amazon Web Services (Firm).
Cloud computing.
Big data.
Physical Description:
1 online resource (482 pages)
Edition:
1st edition.
Place of Publication:
Birmingham ; Mumbai : Packt Publishing, 2021.
Biography/History:
Eagar Gareth: Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Summary:
Start your AWS data engineering journey with this easy-to-follow, hands-on guide and get to grips with foundational concepts through to building data engineering pipelines using AWS Key Features Learn about common data architectures and modern approaches to generating value from big data Explore AWS tools for ingesting, transforming, and consuming data, and for orchestrating pipelines Learn how to architect and implement data lakes and data lakehouses for big data analytics Book Description Knowing how to architect and implement complex data pipelines is a highly sought-after skill. Data engineers are responsible for building these pipelines that ingest, transform, and join raw datasets - creating new value from the data in the process. Amazon Web Services (AWS) offers a range of tools to simplify a data engineer's job, making it the preferred platform for performing data engineering tasks. This book will take you through the services and the skills you need to architect and implement data pipelines on AWS. You'll begin by reviewing important data engineering concepts and some of the core AWS services that form a part of the data engineer's toolkit. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how the transformed data is used by various data consumers. The book also teaches you about populating data marts and data warehouses along with how a data lakehouse fits into the picture. Later, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. In the final chapters, you'll understand how the power of machine learning and artificial intelligence can be used to draw new insights from data. By the end of this AWS book, you'll be able to carry out data engineering tasks and implement a data pipeline on AWS independently. What you will learn Understand data engineering concepts and emerging technologies Ingest streaming data with Amazon Kinesis Data Firehose Optimize, denormalize, and join datasets with AWS Glue Studio Use Amazon S3 events to trigger a Lambda process to transform a file Run complex SQL queries on data lake data using Amazon Athena Load data into a Redshift data warehouse and run queries Create a visualization of your data using Amazon QuickSight Extract sentiment data from a dataset using Amazon Comprehend Who this book is for This book is for data engineers, data analysts, and data architects who are new to AWS and looki...
Contents:
Cover
Title page
Copyright and Credits
Contributors
Table of Contents
Preface
Section 1: AWS Data Engineering Concepts and Trends
Chapter 1: An Introduction to Data Engineering
Technical requirements
The rise of big data as a corporate asset
The challenges of ever-growing datasets
Data engineers - the big data enablers
Understanding the role of the data engineer
Understanding the role of the data scientist
Understanding the role of the data analyst
Understanding other common data-related roles
The benefits of the cloud when building big data analytic solutions
Hands-on - creating and accessing your AWS account
Creating a new AWS account
Accessing your AWS account
Summary
Chapter 2: Data Management Architectures for Analytics
The evolution of data management for analytics
Databases and data warehouses
Dealing with big, unstructured data
A lake on the cloud and a house on that lake
Understanding data warehouses and data marts - fountains of truth
Distributed storage and massively parallel processing
Columnar data storage and efficient data compression
Dimensional modeling in data warehouses
Understanding the role of data marts
Feeding data into the warehouse - ETL and ELT pipelines
Building data lakes to tame the variety and volume of big data
Data lake logical architecture
Bringing together the best of both worlds with the lake house architecture
Data lakehouse implementations
Building a data lakehouse on AWS
Hands-on - configuring the AWS Command Line Interface tool and creating an S3 bucket
Installing and configuring the AWS CLI
Creating a new Amazon S3 bucket
Chapter 3: The AWS Data Engineer's Toolkit
AWS services for ingesting data.
Overview of Amazon Database Migration Service (DMS)
Overview of Amazon Kinesis for streaming data ingestion
Overview of Amazon MSK for streaming data ingestion
Overview of Amazon AppFlow for ingesting data from SaaS services
Overview of Amazon Transfer Family for ingestion using FTP/SFTP protocols
Overview of Amazon DataSync for ingesting from on-premises storage
Overview of the AWS Snow family of devices for large data transfers
AWS services for transforming data
Overview of AWS Lambda for light transformations
Overview of AWS Glue for serverless Spark processing
Overview of Amazon EMR for Hadoop ecosystem processing
AWS services for orchestrating big data pipelines
Overview of AWS Glue workflows for orchestrating Glue components
Overview of AWS Step Functions for complex workflows
Overview of Amazon managed workflows for Apache Airflow
AWS services for consuming data
Overview of Amazon Athena for SQL queries in the data lake
Overview of Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures
Overview of Amazon QuickSight for visualizing data
Hands-on - triggering an AWS Lambda function when a new file arrives in an S3 bucket
Creating a Lambda layer containing the AWS Data Wrangler library
Creating new Amazon S3 buckets
Creating an IAM policy and role for your Lambda function
Creating a Lambda function
Configuring our Lambda function to be triggered by an S3 upload
Chapter 4: Data Cataloging, Security, and Governance
Getting data security and governance right
Common data regulatory requirements
Core data protection concepts
Personal data
Encryption
Anonymized data
Pseudonymized data/tokenization
Authentication
Authorization
Putting these concepts together.
Cataloging your data to avoid the data swamp
How to avoid the data swamp
The AWS Glue/Lake Formation data catalog
AWS services for data encryption and security monitoring
AWS Key Management Service (KMS)
Amazon Macie
Amazon GuardDuty
AWS services for managing identity and permissions
AWS Identity and Access Management (IAM) service
Using AWS Lake Formation to manage data lake access
Hands-on - configuring Lake Formation permissions
Creating a new user with IAM permissions
Transitioning to managing fine-grained permissions with AWS Lake Formation
Section 2: Architecting and Implementing Data Lakes and Data Lake Houses
Chapter 5: Architecting Data Engineering Pipelines
Approaching the data pipeline architecture
Architecting houses and architecting pipelines
Whiteboarding as an information-gathering tool
Conducting a whiteboarding session
Identifying data consumers and understanding their requirements
Identifying data sources and ingesting data
Identifying data transformations and optimizations
File format optimizations
Data standardization
Data quality checks
Data partitioning
Data denormalization
Data cataloging
Whiteboarding data transformation
Loading data into data marts
Wrapping up the whiteboarding session
Hands-on - architecting a sample pipeline
Detailed notes from the project "Bright Light" whiteboarding meeting of GP Widgets, Inc
Chapter 6: Ingesting Batch and Streaming Data
Understanding data sources
Data variety
Data volume
Data velocity
Data veracity
Data value
Questions to ask
Ingesting data from a relational database
AWS Database Migration Service (DMS)
AWS Glue
Other ways to ingest data from a database.
Deciding on the best approach for ingesting from a database
Ingesting streaming data
Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK)
Hands-on - ingesting data with AWS DMS
Creating a new MySQL database instance
Loading the demo data using an Amazon EC2 instance
Creating an IAM policy and role for DMS
Configuring DMS settings and performing a full load from MySQL to S3
Querying data with Amazon Athena
Hands-on - ingesting streaming data
Configuring Kinesis Data Firehose for streaming delivery to Amazon S3
Configuring Amazon Kinesis Data Generator (KDG)
Adding newly ingested data to the Glue Data Catalog
Querying the data with Amazon Athena
Chapter 7: Transforming Data to Optimize for Analytics
Transformations - making raw data more valuable
Cooking, baking, and data transformations
Transformations as part of a pipeline
Types of data transformation tools
Apache Spark
Hadoop and MapReduce
SQL
GUI-based tools
Data preparation transformations
Protecting PII data
Optimizing the file format
Optimizing with data partitioning
Data cleansing
Business use case transforms
Enriching data
Pre-aggregating data
Extracting metadata from unstructured data
Working with change data capture (CDC) data
Traditional approaches - data upserts and SQL views
Modern approaches - the transactional data lake
Hands-on - joining datasets with AWS Glue Studio
Creating a new data lake zone - the curated zone
Creating a new IAM role for the Glue job
Configuring a denormalization transform using AWS Glue Studio
Finalizing the denormalization transform job to write to S3
Create a transform job to join streaming and film data using AWS Glue Studio
Summary.
Chapter 8: Identifying and Enabling Data Consumers
Understanding the impact of data democratization
A growing variety of data consumers
Meeting the needs of business users with data visualization
AWS tools for business users
Meeting the needs of data analysts with structured reporting
AWS tools for data analysts
Meeting the needs of data scientists and ML models
AWS tools used by data scientists to work with data
Hands-on - creating data transformations with AWS Glue DataBrew
Configuring new datasets for AWS Glue DataBrew
Creating a new Glue DataBrew project
Building your Glue DataBrew recipe
Creating a Glue DataBrew job
Chapter 9: Loading Data into a Data Mart
Extending analytics with data warehouses/data marts
Cold data
Warm data
Hot data
What not to do - anti-patterns for a data warehouse
Using a data warehouse as a transactional datastore
Using a data warehouse as a data lake
Using data warehouses for real-time, record-level use cases
Storing unstructured data
Redshift architecture review and storage deep dive
Data distribution across slices
Redshift Zone Maps and sorting data
Designing a high-performance data warehouse
Selecting the optimal Redshift node type
Selecting the optimal table distribution style and sort key
Selecting the right data type for columns
Selecting the optimal table type
Moving data between a data lake and Redshift
Optimizing data ingestion in Redshift
Exporting data from Redshift to the data lake
Hands-on - loading data into an Amazon Redshift cluster and running queries
Uploading our sample data to Amazon S3
IAM roles for Redshift
Creating a Redshift cluster
Creating external tables for querying data in S3.
Creating a schema for a local Redshift table.
Notes:
Includes index.
Description based on print version record.
ISBN:
9781800569041
1800569041
OCLC:
1312155958
Publisher Number:
9781800560413

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account