1 option
Data engineering for beginners / Chisom Nwokwu.
- Format:
- Book
- Author/Creator:
- Nwokwu, Chisom, author.
- Series:
- Tech Today Series
- Language:
- English
- Subjects (All):
- Big data.
- Database management.
- Physical Description:
- 1 online resource (387 pages)
- Edition:
- 1st ed.
- Place of Publication:
- Hoboken, New Jersey : John Wiley & Sons, Incorporated, 2026.
- Summary:
- A hands-on technical and industry roadmap for aspiring data engineers In Data Engineering for Beginners, big data expert Chisom Nwokwu delivers a beginner-friendly handbook for everyone interested in the fundamentals of data engineering.
- Contents:
- Chapter 1 Understanding Data
- A Brief History of Data
- Data in 19,000 BCE: The Great Baboon and Abacus
- Data in the 1600s: Public Health Statistics
- Data in the 1800s: The U.S. Census
- Data in the 1900s: The Concept of Storage
- Data in the 1990s: Data and the Internet
- Types of Data
- Structured Data
- Unstructured Data
- Semi-structured Data
- Why Is Data Important?
- Healthcare
- Supply Chain
- Transportation and Logistics
- Artificial Intelligence
- Data and Information
- Summary
- Notes
- Chapter 2 Introduction to Data Engineering
- Data Engineering Explained Using an Oil Refinery Analogy
- An Overview of the Data Engineering Life Cycle
- Data Storage
- Data Ingestion
- Data Transformation
- Data Serving
- Navigating Project Requirements, Engaging Stakeholders, and Delivering Business Value
- Requirements Gathering
- Understanding Stakeholders
- Understanding System Requirements
- Delivering Business Value
- The Current State of Data Engineering
- The Importance of Data Engineering
- Chapter 3 Database Fundamentals
- Key Concepts of Databases
- Rows
- Columns
- Schema
- Keys
- Types of Databases
- Relational Databases
- NoSQL Databases
- Choosing Between Relational and NoSQL Databases
- Start with Your Data's Structure
- Think About the Relationships in Your Data
- How Fast Do You Need to Move?
- How Do You Need to Query Your Data?
- Scaling and Performance
- Transaction and Strong Consistency Needs
- Chapter 4 SQL Fundamentals
- Introduction to SQL
- Basic SQL Clauses
- Comparison Operators
- LIKE Statement
- IN Statement.
- BETWEEN Statement
- AND Statement
- OR Statement
- NOT Statement
- IS NULL and IS NOT NULL Statements
- Sorting and Limiting
- Aggregate Functions
- SUM()
- AVG()
- MAX() and MIN()
- GROUP BY
- HAVING
- Understanding Joins
- INNER JOIN
- LEFT JOIN
- RIGHT JOIN
- FULL OUTER JOIN
- Subqueries
- Common Table Expressions (CTEs)
- Set Operations
- Window Functions
- Lab: Setting Up SQL Server and Running SQL Queries
- Best Practices for Writing Efficient SQL Queries
- Chapter 5 Database Design
- Data Modeling
- Why Do We Need to Model Data?
- Types of Data Modeling
- Normalization
- Rules of Normalization
- Downsides of Normalization
- Denormalization
- Data Modeling Best Practices
- Define the Grain
- Normalize Now, Denormalize Later
- Choose the Right Data Types
- Proper Naming Conventions
- Database Optimization
- Indexing
- Partitioning
- Sharding
- Views
- Chapter 6 Data Warehouses, Data Lakes, and Data Lakehouses
- Data Warehouses
- Extract, Transform, and Load (ETL)
- Schema Design
- Snowflake Schema
- Slowly Changing Dimensions
- Data Marts
- Benefits of a Data Mart
- Challenges with Data Marts
- Data Lakes
- How Do Data Lakes Work?
- Challenges of Data Lakes
- Data Lakehouse
- Features of a Data Lakehouse
- Data Lakehouse Architecture
- The Key Differences Between a Database, Data Warehouse, Data Lake, and Data Lakehouse
- Chapter 7 Data Pipelines
- Batch Pipelines
- Components of a Batch Pipeline
- ETL Pipelines vs. ELT Pipelines
- Stream Pipelines
- How Would This Work?
- Components of a Streaming Data Pipeline
- Lambda Architecture
- Components of the Lambda Architecture
- Advantages of the Lambda Architecture
- Challenges and Trade-offs
- Data Orchestration
- Directed Acyclic Graphs (DAGs)
- Scheduling and Automation
- Monitoring
- Alerts.
- Lab: Building an ETL Pipeline and Automating with Apache Airflow
- Requirements
- Set Up Your Development Environment
- Extracting Data from CSV
- Transforming the Data
- Load the New CSV File into a Postgres Database Instance
- Schedule ETL Pipeline with Apache Airflow
- Chapter 8 Data Quality
- Bad Data
- Dimensions of Data Quality
- Accuracy
- Completeness
- Consistency
- Validity
- Uniqueness
- Timeliness
- Accessibility
- Relevance
- Data Quality Hierarchy
- Data Quality Best Practices
- Chapter 9 Data Security
- What Is Data Security?
- Common Threats to Data Security
- Core Principles of Data Security
- Confidentiality
- Integrity
- Availability
- Data Encryption
- Symmetric Encryption
- Asymmetric Encryption
- Data Masking
- Understanding Network Security
- Access Control
- Authentication
- Authorization
- The Principle of Least Privilege
- Access Levels
- Secrets Management
- Data Security and Data Privacy
- Chapter 10 Data Governance
- How to Think About Data Governance
- Data Governance Framework
- Policies
- Regulatory Compliance Policy
- Data Classification Policy
- Data Retention and Disposal Policy
- Data Sharing Policy
- Processes
- Metadata Management
- Data Lineage
- Incident Management
- Master Data Management
- Roles in the Data Governance Framework
- Data Owner
- Data Steward
- Data Custodian
- Chief Data Officer (CDO)
- Data Management and Data Governance
- Chapter 11 Big Data and Distributed Systems
- The Five V's of Big Data
- Volume
- Velocity
- Variety
- Veracity
- Value
- Distributed Systems
- Scalability
- Fault Tolerance
- Reliability
- Concurrency
- Resource Management
- Load Balancing
- Latency
- Distributed Data Processing
- Apache Hadoop
- Big Data File Types
- Avro.
- Parquet
- Optimized Row Columnar (ORC)
- Choosing the File Type
- Chapter 12 Data Engineering on the Cloud
- Cloud Computing
- On-Premises
- Cloud
- Making the Right Choice
- Core Cloud Concepts
- Storage
- Compute
- Networking
- Cloud Service Models
- Infrastructure as a Service
- Platform as a Service
- Software as a Service
- Choosing Between IaaS, PaaS, and SaaS
- A Hybrid Approach
- Cloud Management Models
- Serverless
- Managed
- Self-Managed
- Putting It All Together
- Cost Optimization
- Understanding Cloud Pricing Models
- Rightsizing Resources
- Smart Job Scheduling
- Storage Optimization
- Shutting Down Idle Resources
- Use Serverless Where Possible
- Monitoring and Alerting
- Chapter 13 Building a Career in Data Engineering
- Types of Data Engineering Roles
- Types of Data Engineers
- Platform Data Engineer
- Analytics Data Engineer
- AI/ML Data Engineers
- Landing Your First Data Engineering Role
- A Typical Data Engineering Job Description
- How to Build a Winning Résumé
- Preparing for a Data Engineering Interview
- Thinking Like a Data Engineer
- Think in Systems
- Learn to Prioritize Data Quality
- Design for Failure
- Balance Business Context with Technical Choices
- Optimize for Clarity, Then Speed
- Think Beyond the Tool
- Master Automation
- Appendix: Sample Interview Questions
- SQL
- Data Pipelines
- Apache Spark
- System Design
- Data Engineering Glossary.
- Notes:
- Includes index.
- Description based on publisher supplied metadata and other sources.
- ISBN:
- 1-394-32542-8
- 1-394-35257-3
- 9781394325429
- OCLC:
- 1546814862
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.