3 options

Data engineering with AWS : build and implement complex data pipelines using AWS / Gareth Eagar.

EBSCOhost Academic eBook Collection (North America) Available online

Ebook Central College Complete Available online

O'Reilly Online Learning: Academic/Public Library Edition Available online

Format:: Book
Author/Creator:: Eagar, Gareth, author.
Language:: English
Subjects (All):: Amazon Web Services (Firm).; Cloud computing.; Big data.
Physical Description:: 1 online resource (482 pages)
Edition:: 1st edition.
Place of Publication:: Birmingham ; Mumbai : Packt Publishing, 2021.
Biography/History:: Eagar Gareth: Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Summary:: Start your AWS data engineering journey with this easy-to-follow, hands-on guide and get to grips with foundational concepts through to building data engineering pipelines using AWS Key Features Learn about common data architectures and modern approaches to generating value from big data Explore AWS tools for ingesting, transforming, and consuming data, and for orchestrating pipelines Learn how to architect and implement data lakes and data lakehouses for big data analytics Book Description Knowing how to architect and implement complex data pipelines is a highly sought-after skill. Data engineers are responsible for building these pipelines that ingest, transform, and join raw datasets - creating new value from the data in the process. Amazon Web Services (AWS) offers a range of tools to simplify a data engineer's job, making it the preferred platform for performing data engineering tasks. This book will take you through the services and the skills you need to architect and implement data pipelines on AWS. You'll begin by reviewing important data engineering concepts and some of the core AWS services that form a part of the data engineer's toolkit. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how the transformed data is used by various data consumers. The book also teaches you about populating data marts and data warehouses along with how a data lakehouse fits into the picture. Later, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. In the final chapters, you'll understand how the power of machine learning and artificial intelligence can be used to draw new insights from data. By the end of this AWS book, you'll be able to carry out data engineering tasks and implement a data pipeline on AWS independently. What you will learn Understand data engineering concepts and emerging technologies Ingest streaming data with Amazon Kinesis Data Firehose Optimize, denormalize, and join datasets with AWS Glue Studio Use Amazon S3 events to trigger a Lambda process to transform a file Run complex SQL queries on data lake data using Amazon Athena Load data into a Redshift data warehouse and run queries Create a visualization of your data using Amazon QuickSight Extract sentiment data from a dataset using Amazon Comprehend Who this book is for This book is for data engineers, data analysts, and data architects who are new to AWS and looki...
Contents:: Cover; Title page; Copyright and Credits; Contributors; Table of Contents; Preface; Section 1: AWS Data Engineering Concepts and Trends; Chapter 1: An Introduction to Data Engineering; Technical requirements; The rise of big data as a corporate asset; The challenges of ever-growing datasets; Data engineers - the big data enablers; Understanding the role of the data engineer; Understanding the role of the data scientist; Understanding the role of the data analyst; Understanding other common data-related roles; The benefits of the cloud when building big data analytic solutions; Hands-on - creating and accessing your AWS account; Creating a new AWS account; Accessing your AWS account; Summary; Chapter 2: Data Management Architectures for Analytics; The evolution of data management for analytics; Databases and data warehouses; Dealing with big, unstructured data; A lake on the cloud and a house on that lake; Understanding data warehouses and data marts - fountains of truth; Distributed storage and massively parallel processing; Columnar data storage and efficient data compression; Dimensional modeling in data warehouses; Understanding the role of data marts; Feeding data into the warehouse - ETL and ELT pipelines; Building data lakes to tame the variety and volume of big data; Data lake logical architecture; Bringing together the best of both worlds with the lake house architecture; Data lakehouse implementations; Building a data lakehouse on AWS; Hands-on - configuring the AWS Command Line Interface tool and creating an S3 bucket; Installing and configuring the AWS CLI; Creating a new Amazon S3 bucket; Chapter 3: The AWS Data Engineer's Toolkit; AWS services for ingesting data.; Overview of Amazon Database Migration Service (DMS); Overview of Amazon Kinesis for streaming data ingestion; Overview of Amazon MSK for streaming data ingestion; Overview of Amazon AppFlow for ingesting data from SaaS services; Overview of Amazon Transfer Family for ingestion using FTP/SFTP protocols; Overview of Amazon DataSync for ingesting from on-premises storage; Overview of the AWS Snow family of devices for large data transfers; AWS services for transforming data; Overview of AWS Lambda for light transformations; Overview of AWS Glue for serverless Spark processing; Overview of Amazon EMR for Hadoop ecosystem processing; AWS services for orchestrating big data pipelines; Overview of AWS Glue workflows for orchestrating Glue components; Overview of AWS Step Functions for complex workflows; Overview of Amazon managed workflows for Apache Airflow; AWS services for consuming data; Overview of Amazon Athena for SQL queries in the data lake; Overview of Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures; Overview of Amazon QuickSight for visualizing data; Hands-on - triggering an AWS Lambda function when a new file arrives in an S3 bucket; Creating a Lambda layer containing the AWS Data Wrangler library; Creating new Amazon S3 buckets; Creating an IAM policy and role for your Lambda function; Creating a Lambda function; Configuring our Lambda function to be triggered by an S3 upload; Chapter 4: Data Cataloging, Security, and Governance; Getting data security and governance right; Common data regulatory requirements; Core data protection concepts; Personal data; Encryption; Anonymized data; Pseudonymized data/tokenization; Authentication; Authorization; Putting these concepts together.; Cataloging your data to avoid the data swamp; How to avoid the data swamp; The AWS Glue/Lake Formation data catalog; AWS services for data encryption and security monitoring; AWS Key Management Service (KMS); Amazon Macie; Amazon GuardDuty; AWS services for managing identity and permissions; AWS Identity and Access Management (IAM) service; Using AWS Lake Formation to manage data lake access; Hands-on - configuring Lake Formation permissions; Creating a new user with IAM permissions; Transitioning to managing fine-grained permissions with AWS Lake Formation; Section 2: Architecting and Implementing Data Lakes and Data Lake Houses; Chapter 5: Architecting Data Engineering Pipelines; Approaching the data pipeline architecture; Architecting houses and architecting pipelines; Whiteboarding as an information-gathering tool; Conducting a whiteboarding session; Identifying data consumers and understanding their requirements; Identifying data sources and ingesting data; Identifying data transformations and optimizations; File format optimizations; Data standardization; Data quality checks; Data partitioning; Data denormalization; Data cataloging; Whiteboarding data transformation; Loading data into data marts; Wrapping up the whiteboarding session; Hands-on - architecting a sample pipeline; Detailed notes from the project "Bright Light" whiteboarding meeting of GP Widgets, Inc; Chapter 6: Ingesting Batch and Streaming Data; Understanding data sources; Data variety; Data volume; Data velocity; Data veracity; Data value; Questions to ask; Ingesting data from a relational database; AWS Database Migration Service (DMS); AWS Glue; Other ways to ingest data from a database.; Deciding on the best approach for ingesting from a database; Ingesting streaming data; Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK); Hands-on - ingesting data with AWS DMS; Creating a new MySQL database instance; Loading the demo data using an Amazon EC2 instance; Creating an IAM policy and role for DMS; Configuring DMS settings and performing a full load from MySQL to S3; Querying data with Amazon Athena; Hands-on - ingesting streaming data; Configuring Kinesis Data Firehose for streaming delivery to Amazon S3; Configuring Amazon Kinesis Data Generator (KDG); Adding newly ingested data to the Glue Data Catalog; Querying the data with Amazon Athena; Chapter 7: Transforming Data to Optimize for Analytics; Transformations - making raw data more valuable; Cooking, baking, and data transformations; Transformations as part of a pipeline; Types of data transformation tools; Apache Spark; Hadoop and MapReduce; SQL; GUI-based tools; Data preparation transformations; Protecting PII data; Optimizing the file format; Optimizing with data partitioning; Data cleansing; Business use case transforms; Enriching data; Pre-aggregating data; Extracting metadata from unstructured data; Working with change data capture (CDC) data; Traditional approaches - data upserts and SQL views; Modern approaches - the transactional data lake; Hands-on - joining datasets with AWS Glue Studio; Creating a new data lake zone - the curated zone; Creating a new IAM role for the Glue job; Configuring a denormalization transform using AWS Glue Studio; Finalizing the denormalization transform job to write to S3; Create a transform job to join streaming and film data using AWS Glue Studio; Summary.; Chapter 8: Identifying and Enabling Data Consumers; Understanding the impact of data democratization; A growing variety of data consumers; Meeting the needs of business users with data visualization; AWS tools for business users; Meeting the needs of data analysts with structured reporting; AWS tools for data analysts; Meeting the needs of data scientists and ML models; AWS tools used by data scientists to work with data; Hands-on - creating data transformations with AWS Glue DataBrew; Configuring new datasets for AWS Glue DataBrew; Creating a new Glue DataBrew project; Building your Glue DataBrew recipe; Creating a Glue DataBrew job; Chapter 9: Loading Data into a Data Mart; Extending analytics with data warehouses/data marts; Cold data; Warm data; Hot data; What not to do - anti-patterns for a data warehouse; Using a data warehouse as a transactional datastore; Using a data warehouse as a data lake; Using data warehouses for real-time, record-level use cases; Storing unstructured data; Redshift architecture review and storage deep dive; Data distribution across slices; Redshift Zone Maps and sorting data; Designing a high-performance data warehouse; Selecting the optimal Redshift node type; Selecting the optimal table distribution style and sort key; Selecting the right data type for columns; Selecting the optimal table type; Moving data between a data lake and Redshift; Optimizing data ingestion in Redshift; Exporting data from Redshift to the data lake; Hands-on - loading data into an Amazon Redshift cluster and running queries; Uploading our sample data to Amazon S3; IAM roles for Redshift; Creating a Redshift cluster; Creating external tables for querying data in S3.; Creating a schema for a local Redshift table.
Notes:: Includes index.; Description based on print version record.
ISBN:: 9781800569041; 1800569041
OCLC:: 1312155958
Publisher Number:: 9781800560413

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

3 options

Data engineering with AWS : build and implement complex data pipelines using AWS / Gareth Eagar.

Find

My Account

Guides