My Account Log in

1 option

Hadoop : the definitive guide / Tom White.

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
White, Tom (Tom E.), author.
Language:
English
Subjects (All):
Apache Hadoop.
Electronic data processing--Distributed processing.
Electronic data processing.
File organization (Computer science).
Physical Description:
1 online resource (xxv, 727 p.)
Edition:
4th ed.
Place of Publication:
Sebastopol, California : O'Reilly, 2015.
Language Note:
English
System Details:
text file
Summary:
Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you'll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You'll learn about recent changes to Hadoop, and explore new case studies on Hadoop's role in healthcare
Contents:
Cover
Copyright
Table of Contents
Foreword
Preface
Administrative Notes
What's New in the Fourth Edition?
What's New in the Third Edition?
What's New in the Second Edition?
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Part I. Hadoop Fundamentals
Chapter 1. Meet Hadoop
Data!
Data Storage and Analysis
Querying All Your Data
Beyond Batch
Comparison with Other Systems
Relational Database Management Systems
Grid Computing
Volunteer Computing
A Brief History of Apache Hadoop
What's in This Book?
Chapter 2. MapReduce
A Weather Dataset
Data Format
Analyzing the Data with Unix Tools
Analyzing the Data with Hadoop
Map and Reduce
Java MapReduce
Scaling Out
Data Flow
Combiner Functions
Running a Distributed MapReduce Job
Hadoop Streaming
Ruby
Python
Chapter 3. The Hadoop Distributed Filesystem
The Design of HDFS
HDFS Concepts
Blocks
Namenodes and Datanodes
Block Caching
HDFS Federation
HDFS High Availability
The Command-Line Interface
Basic Filesystem Operations
Hadoop Filesystems
Interfaces
The Java Interface
Reading Data from a Hadoop URL
Reading Data Using the FileSystem API
Writing Data
Directories
Querying the Filesystem
Deleting Data
Anatomy of a File Read
Anatomy of a File Write
Coherency Model
Parallel Copying with distcp
Keeping an HDFS Cluster Balanced
Chapter 4. YARN
Anatomy of a YARN Application Run
Resource Requests
Application Lifespan
Building YARN Applications
YARN Compared to MapReduce 1
Scheduling in YARN
Scheduler Options
Capacity Scheduler Configuration
Fair Scheduler Configuration
Delay Scheduling
Dominant Resource Fairness
Further Reading.
Chapter 5. Hadoop I/O
Data Integrity
Data Integrity in HDFS
LocalFileSystem
ChecksumFileSystem
Compression
Codecs
Compression and Input Splits
Using Compression in MapReduce
Serialization
The Writable Interface
Writable Classes
Implementing a Custom Writable
Serialization Frameworks
File-Based Data Structures
SequenceFile
MapFile
Other File Formats and Column-Oriented Formats
Part II. MapReduce
Chapter 6. Developing a MapReduce Application
The Configuration API
Combining Resources
Variable Expansion
Setting Up the Development Environment
Managing Configuration
GenericOptionsParser, Tool, and ToolRunner
Writing a Unit Test with MRUnit
Mapper
Reducer
Running Locally on Test Data
Running a Job in a Local Job Runner
Testing the Driver
Running on a Cluster
Packaging a Job
Launching a Job
The MapReduce Web UI
Retrieving the Results
Debugging a Job
Hadoop Logs
Remote Debugging
Tuning a Job
Profiling Tasks
MapReduce Workflows
Decomposing a Problem into MapReduce Jobs
JobControl
Apache Oozie
Chapter 7. How MapReduce Works
Anatomy of a MapReduce Job Run
Job Submission
Job Initialization
Task Assignment
Task Execution
Progress and Status Updates
Job Completion
Failures
Task Failure
Application Master Failure
Node Manager Failure
Resource Manager Failure
Shuffle and Sort
The Map Side
The Reduce Side
Configuration Tuning
The Task Execution Environment
Speculative Execution
Output Committers
Chapter 8. MapReduce Types and Formats
MapReduce Types
The Default MapReduce Job
Input Formats
Input Splits and Records
Text Input
Binary Input
Multiple Inputs
Database Input (and Output)
Output Formats
Text Output
Binary Output.
Multiple Outputs
Lazy Output
Database Output
Chapter 9. MapReduce Features
Counters
Built-in Counters
User-Defined Java Counters
User-Defined Streaming Counters
Sorting
Preparation
Partial Sort
Total Sort
Secondary Sort
Joins
Map-Side Joins
Reduce-Side Joins
Side Data Distribution
Using the Job Configuration
Distributed Cache
MapReduce Library Classes
Part III. Hadoop Operations
Chapter 10. Setting Up a Hadoop Cluster
Cluster Specification
Cluster Sizing
Network Topology
Cluster Setup and Installation
Installing Java
Creating Unix User Accounts
Installing Hadoop
Configuring SSH
Configuring Hadoop
Formatting the HDFS Filesystem
Starting and Stopping the Daemons
Creating User Directories
Hadoop Configuration
Configuration Management
Environment Settings
Important Hadoop Daemon Properties
Hadoop Daemon Addresses and Ports
Other Hadoop Properties
Security
Kerberos and Hadoop
Delegation Tokens
Other Security Enhancements
Benchmarking a Hadoop Cluster
Hadoop Benchmarks
User Jobs
Chapter 11. Administering Hadoop
HDFS
Persistent Data Structures
Safe Mode
Audit Logging
Tools
Monitoring
Logging
Metrics and JMX
Maintenance
Routine Administration Procedures
Commissioning and Decommissioning Nodes
Upgrades
Part IV. Related Projects
Chapter 12. Avro
Avro Data Types and Schemas
In-Memory Serialization and Deserialization
The Specific API
Avro Datafiles
Interoperability
Python API
Avro Tools
Schema Resolution
Sort Order
Avro MapReduce
Sorting Using Avro MapReduce
Avro in Other Languages
Chapter 13. Parquet
Data Model
Nested Encoding
Parquet File Format
Parquet Configuration
Writing and Reading Parquet Files.
Avro, Protocol Buffers, and Thrift
Parquet MapReduce
Chapter 14. Flume
Installing Flume
An Example
Transactions and Reliability
Batching
The HDFS Sink
Partitioning and Interceptors
File Formats
Fan Out
Delivery Guarantees
Replicating and Multiplexing Selectors
Distribution: Agent Tiers
Sink Groups
Integrating Flume with Applications
Component Catalog
Further Reading
Chapter 15. Sqoop
Getting Sqoop
Sqoop Connectors
A Sample Import
Text and Binary File Formats
Generated Code
Additional Serialization Systems
Imports: A Deeper Look
Controlling the Import
Imports and Consistency
Incremental Imports
Direct-Mode Imports
Working with Imported Data
Imported Data and Hive
Importing Large Objects
Performing an Export
Exports: A Deeper Look
Exports and Transactionality
Exports and SequenceFiles
Chapter 16. Pig
Installing and Running Pig
Execution Types
Running Pig Programs
Grunt
Pig Latin Editors
Generating Examples
Comparison with Databases
Pig Latin
Structure
Statements
Expressions
Types
Schemas
Functions
Macros
User-Defined Functions
A Filter UDF
An Eval UDF
A Load UDF
Data Processing Operators
Loading and Storing Data
Filtering Data
Grouping and Joining Data
Sorting Data
Combining and Splitting Data
Pig in Practice
Parallelism
Anonymous Relations
Parameter Substitution
Chapter 17. Hive
Installing Hive
The Hive Shell
Running Hive
Configuring Hive
Hive Services
The Metastore
Comparison with Traditional Databases
Schema on Read Versus Schema on Write
Updates, Transactions, and Indexes
SQL-on-Hadoop Alternatives
HiveQL
Data Types.
Operators and Functions
Tables
Managed Tables and External Tables
Partitions and Buckets
Storage Formats
Importing Data
Altering Tables
Dropping Tables
Querying Data
Sorting and Aggregating
MapReduce Scripts
Subqueries
Views
Writing a UDF
Writing a UDAF
Chapter 18. Crunch
The Core Crunch API
Primitive Operations
Sources and Targets
Materialization
Pipeline Execution
Running a Pipeline
Stopping a Pipeline
Inspecting a Crunch Plan
Iterative Algorithms
Checkpointing a Pipeline
Crunch Libraries
Chapter 19. Spark
Installing Spark
Spark Applications, Jobs, Stages, and Tasks
A Scala Standalone Application
A Java Example
A Python Example
Resilient Distributed Datasets
Creation
Transformations and Actions
Persistence
Shared Variables
Broadcast Variables
Accumulators
Anatomy of a Spark Job Run
DAG Construction
Task Scheduling
Executors and Cluster Managers
Spark on YARN
Chapter 20. HBase
HBasics
Backdrop
Concepts
Whirlwind Tour of the Data Model
Implementation
Installation
Test Drive
Clients
Java
MapReduce
REST and Thrift
Building an Online Query Application
Schema Design
Loading Data
Online Queries
HBase Versus RDBMS
Successful Service
HBase
Praxis
UI
Metrics
Chapter 21. ZooKeeper
Installing and Running ZooKeeper
Group Membership in ZooKeeper
Creating the Group
Joining a Group
Listing Members in a Group
Deleting a Group
The ZooKeeper Service
Operations
Implementation.
Consistency.
Notes:
Includes index.
Description based on online resource; title from PDF title page (ebrary, viewed April 11, 2015).
ISBN:
9781491901700
1491901705
9781491901687
1491901683
9781491901717
1491901713
9781491901694
1491901691
OCLC:
1024265037

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account