1 option

Hadoop : the definitive guide / Tom White.

O'Reilly Online Learning: Academic/Public Library Edition Available online

Format:: Book
Author/Creator:: White, Tom (Tom E.), author.
Language:: English
Subjects (All):: Apache Hadoop.; Electronic data processing--Distributed processing.; Electronic data processing.; File organization (Computer science).
Physical Description:: 1 online resource (xxv, 727 p.)
Edition:: 4th ed.
Place of Publication:: Sebastopol, California : O'Reilly, 2015.
Language Note:: English
System Details:: text file
Summary:: Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you'll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You'll learn about recent changes to Hadoop, and explore new case studies on Hadoop's role in healthcare
Contents:: Cover; Copyright; Table of Contents; Foreword; Preface; Administrative Notes; What's New in the Fourth Edition?; What's New in the Third Edition?; What's New in the Second Edition?; Conventions Used in This Book; Using Code Examples; Safari® Books Online; How to Contact Us; Acknowledgments; Part I. Hadoop Fundamentals; Chapter 1. Meet Hadoop; Data!; Data Storage and Analysis; Querying All Your Data; Beyond Batch; Comparison with Other Systems; Relational Database Management Systems; Grid Computing; Volunteer Computing; A Brief History of Apache Hadoop; What's in This Book?; Chapter 2. MapReduce; A Weather Dataset; Data Format; Analyzing the Data with Unix Tools; Analyzing the Data with Hadoop; Map and Reduce; Java MapReduce; Scaling Out; Data Flow; Combiner Functions; Running a Distributed MapReduce Job; Hadoop Streaming; Ruby; Python; Chapter 3. The Hadoop Distributed Filesystem; The Design of HDFS; HDFS Concepts; Blocks; Namenodes and Datanodes; Block Caching; HDFS Federation; HDFS High Availability; The Command-Line Interface; Basic Filesystem Operations; Hadoop Filesystems; Interfaces; The Java Interface; Reading Data from a Hadoop URL; Reading Data Using the FileSystem API; Writing Data; Directories; Querying the Filesystem; Deleting Data; Anatomy of a File Read; Anatomy of a File Write; Coherency Model; Parallel Copying with distcp; Keeping an HDFS Cluster Balanced; Chapter 4. YARN; Anatomy of a YARN Application Run; Resource Requests; Application Lifespan; Building YARN Applications; YARN Compared to MapReduce 1; Scheduling in YARN; Scheduler Options; Capacity Scheduler Configuration; Fair Scheduler Configuration; Delay Scheduling; Dominant Resource Fairness; Further Reading.; Chapter 5. Hadoop I/O; Data Integrity; Data Integrity in HDFS; LocalFileSystem; ChecksumFileSystem; Compression; Codecs; Compression and Input Splits; Using Compression in MapReduce; Serialization; The Writable Interface; Writable Classes; Implementing a Custom Writable; Serialization Frameworks; File-Based Data Structures; SequenceFile; MapFile; Other File Formats and Column-Oriented Formats; Part II. MapReduce; Chapter 6. Developing a MapReduce Application; The Configuration API; Combining Resources; Variable Expansion; Setting Up the Development Environment; Managing Configuration; GenericOptionsParser, Tool, and ToolRunner; Writing a Unit Test with MRUnit; Mapper; Reducer; Running Locally on Test Data; Running a Job in a Local Job Runner; Testing the Driver; Running on a Cluster; Packaging a Job; Launching a Job; The MapReduce Web UI; Retrieving the Results; Debugging a Job; Hadoop Logs; Remote Debugging; Tuning a Job; Profiling Tasks; MapReduce Workflows; Decomposing a Problem into MapReduce Jobs; JobControl; Apache Oozie; Chapter 7. How MapReduce Works; Anatomy of a MapReduce Job Run; Job Submission; Job Initialization; Task Assignment; Task Execution; Progress and Status Updates; Job Completion; Failures; Task Failure; Application Master Failure; Node Manager Failure; Resource Manager Failure; Shuffle and Sort; The Map Side; The Reduce Side; Configuration Tuning; The Task Execution Environment; Speculative Execution; Output Committers; Chapter 8. MapReduce Types and Formats; MapReduce Types; The Default MapReduce Job; Input Formats; Input Splits and Records; Text Input; Binary Input; Multiple Inputs; Database Input (and Output); Output Formats; Text Output; Binary Output.; Multiple Outputs; Lazy Output; Database Output; Chapter 9. MapReduce Features; Counters; Built-in Counters; User-Defined Java Counters; User-Defined Streaming Counters; Sorting; Preparation; Partial Sort; Total Sort; Secondary Sort; Joins; Map-Side Joins; Reduce-Side Joins; Side Data Distribution; Using the Job Configuration; Distributed Cache; MapReduce Library Classes; Part III. Hadoop Operations; Chapter 10. Setting Up a Hadoop Cluster; Cluster Specification; Cluster Sizing; Network Topology; Cluster Setup and Installation; Installing Java; Creating Unix User Accounts; Installing Hadoop; Configuring SSH; Configuring Hadoop; Formatting the HDFS Filesystem; Starting and Stopping the Daemons; Creating User Directories; Hadoop Configuration; Configuration Management; Environment Settings; Important Hadoop Daemon Properties; Hadoop Daemon Addresses and Ports; Other Hadoop Properties; Security; Kerberos and Hadoop; Delegation Tokens; Other Security Enhancements; Benchmarking a Hadoop Cluster; Hadoop Benchmarks; User Jobs; Chapter 11. Administering Hadoop; HDFS; Persistent Data Structures; Safe Mode; Audit Logging; Tools; Monitoring; Logging; Metrics and JMX; Maintenance; Routine Administration Procedures; Commissioning and Decommissioning Nodes; Upgrades; Part IV. Related Projects; Chapter 12. Avro; Avro Data Types and Schemas; In-Memory Serialization and Deserialization; The Specific API; Avro Datafiles; Interoperability; Python API; Avro Tools; Schema Resolution; Sort Order; Avro MapReduce; Sorting Using Avro MapReduce; Avro in Other Languages; Chapter 13. Parquet; Data Model; Nested Encoding; Parquet File Format; Parquet Configuration; Writing and Reading Parquet Files.; Avro, Protocol Buffers, and Thrift; Parquet MapReduce; Chapter 14. Flume; Installing Flume; An Example; Transactions and Reliability; Batching; The HDFS Sink; Partitioning and Interceptors; File Formats; Fan Out; Delivery Guarantees; Replicating and Multiplexing Selectors; Distribution: Agent Tiers; Sink Groups; Integrating Flume with Applications; Component Catalog; Further Reading; Chapter 15. Sqoop; Getting Sqoop; Sqoop Connectors; A Sample Import; Text and Binary File Formats; Generated Code; Additional Serialization Systems; Imports: A Deeper Look; Controlling the Import; Imports and Consistency; Incremental Imports; Direct-Mode Imports; Working with Imported Data; Imported Data and Hive; Importing Large Objects; Performing an Export; Exports: A Deeper Look; Exports and Transactionality; Exports and SequenceFiles; Chapter 16. Pig; Installing and Running Pig; Execution Types; Running Pig Programs; Grunt; Pig Latin Editors; Generating Examples; Comparison with Databases; Pig Latin; Structure; Statements; Expressions; Types; Schemas; Functions; Macros; User-Defined Functions; A Filter UDF; An Eval UDF; A Load UDF; Data Processing Operators; Loading and Storing Data; Filtering Data; Grouping and Joining Data; Sorting Data; Combining and Splitting Data; Pig in Practice; Parallelism; Anonymous Relations; Parameter Substitution; Chapter 17. Hive; Installing Hive; The Hive Shell; Running Hive; Configuring Hive; Hive Services; The Metastore; Comparison with Traditional Databases; Schema on Read Versus Schema on Write; Updates, Transactions, and Indexes; SQL-on-Hadoop Alternatives; HiveQL; Data Types.; Operators and Functions; Tables; Managed Tables and External Tables; Partitions and Buckets; Storage Formats; Importing Data; Altering Tables; Dropping Tables; Querying Data; Sorting and Aggregating; MapReduce Scripts; Subqueries; Views; Writing a UDF; Writing a UDAF; Chapter 18. Crunch; The Core Crunch API; Primitive Operations; Sources and Targets; Materialization; Pipeline Execution; Running a Pipeline; Stopping a Pipeline; Inspecting a Crunch Plan; Iterative Algorithms; Checkpointing a Pipeline; Crunch Libraries; Chapter 19. Spark; Installing Spark; Spark Applications, Jobs, Stages, and Tasks; A Scala Standalone Application; A Java Example; A Python Example; Resilient Distributed Datasets; Creation; Transformations and Actions; Persistence; Shared Variables; Broadcast Variables; Accumulators; Anatomy of a Spark Job Run; DAG Construction; Task Scheduling; Executors and Cluster Managers; Spark on YARN; Chapter 20. HBase; HBasics; Backdrop; Concepts; Whirlwind Tour of the Data Model; Implementation; Installation; Test Drive; Clients; Java; MapReduce; REST and Thrift; Building an Online Query Application; Schema Design; Loading Data; Online Queries; HBase Versus RDBMS; Successful Service; HBase; Praxis; UI; Metrics; Chapter 21. ZooKeeper; Installing and Running ZooKeeper; Group Membership in ZooKeeper; Creating the Group; Joining a Group; Listing Members in a Group; Deleting a Group; The ZooKeeper Service; Operations; Implementation.; Consistency.
Notes:: Includes index.; Description based on online resource; title from PDF title page (ebrary, viewed April 11, 2015).
ISBN:: 9781491901700; 1491901705; 9781491901687; 1491901683; 9781491901717; 1491901713; 9781491901694; 1491901691
OCLC:: 1024265037

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

1 option

Hadoop : the definitive guide / Tom White.

Find

My Account

Guides