2 options
Data Lake for enterprises : leveraging Lambda architecture for building Enterprise Data Lake / Tomcy John, Pankaj Misra ; foreword by Thomas Benjamin.
- Format:
- Book
- Author/Creator:
- John, Tomcy, author.
- Misra, Pankaj, author.
- Language:
- English
- Subjects (All):
- Electronic data processing--Distributed processing--Management.
- Electronic data processing.
- Big data.
- Information storage and retrieval systems.
- Physical Description:
- 1 online resource (561 pages) : illustrations (some color)
- Edition:
- 1st edition
- Place of Publication:
- Birmingham, England : Packt, 2017.
- System Details:
- text file
- Biography/History:
- Mishra Vivek: Vivek Mishra is an IT professional with more than nine years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor for open source like Apache Cassandra and lead committer for Kundera(JPA 2. 0 compliant Object-Datastore Mapping Library for NoSQL Datastores like Cassandra, HBase, MongoDB and REDIS). Mr Mishra in his previous experience has enjoyed long lasting partnership with most recognizable names in SCM, Banking and finance industries, employing industry standard full software life cycle methodologies Agile and SCRUM. He is currently employed with Impetus infotech pvt. ltd. He has undertaken speaking engagements in cloud camp and Nasscom Big data seminar and is an active blogger and can be followed at mevivs. wordpress. comJohn Tomcy: Tomcy John lives in Dubai (United Arab Emirates), hailing from Kerala (India), and is an enterprise Java specialist with a degree in Engineering (B Tech) and over 14 years of experience in several industries. He's currently working as principal architect at Emirates Group IT, in their core architecture team. Prior to this, he worked with Oracle Corporation and Ernst & Young. His main specialization is in building enterprise-grade applications and he acts as chief mentor and evangelist to facilitate incorporating new technologies as corporate standards in the organization. Outside of his work, Tomcy works very closely with young developers and engineers as mentors and speaks at various forums as a technical evangelist on many topics ranging from web and middleware all the way to various persistence stores. Misra Pankaj: Pankaj Misra has been a technology evangelist, holding a bachelor's degree in engineering, with over 16 years of experience across multiple business domains and technologies. He has been working with Emirates Group IT since 2015, and has worked with various other organizations in the past. He specializes in architecting and building multi-stack solutions and implementations. He has also been a speaker at technology forums in India and has built products with scale-out architecture that support high-volume, near-real-time data processing and near-real-time analytics.
- Summary:
- A practical guide to implementing your enterprise data lake using Lambda Architecture as the base About This Book Build a full-fledged data lake for your organization with popular big data technologies using the Lambda architecture as the base Delve into the big data technologies required to meet modern day business strategies A highly practical guide to implementing enterprise data lakes with lots of examples and real-world use-cases Who This Book Is For Java developers and architects who would like to implement a data lake for their enterprise will find this book useful. If you want to get hands-on experience with the Lambda Architecture and big data technologies by implementing a practical solution using these technologies, this book will also help you. What You Will Learn Build an enterprise-level data lake using the relevant big data technologies Understand the core of the Lambda architecture and how to apply it in an enterprise Learn the technical details around Sqoop and its functionalities Integrate Kafka with Hadoop components to acquire enterprise data Use flume with streaming technologies for stream-based processing Understand stream- based processing with reference to Apache Spark Streaming Incorporate Hadoop components and know the advantages they provide for enterprise data lakes Build fast, streaming, and high-performance applications using ElasticSearch Make your data ingestion process consistent across various data formats with configurability Process your data to derive intelligence using machine learning algorithms In Detail The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces yo...
- Contents:
- Cover
- Copyright
- Credits
- Foreword
- About the Authors
- About the Reviewers
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Part 1 - Overview
- Part 2 - Technical Building blocks of Data Lake
- Part 3 - Bringing It All Together
- Chapter 1: Introduction to Data
- Exploring data
- What is Enterprise Data?
- Enterprise Data Management
- Big data concepts
- Big data and 4Vs
- Relevance of data
- Quality of data
- Where does this data live in an enterprise?
- Intranet (within enterprise)
- Internet (external to enterprise)
- Business applications hosted in cloud
- Third-party cloud solutions
- Social data (structured and unstructured)
- Data stores or persistent stores (RDBMS or NoSQL)
- Traditional data warehouse
- File stores
- Enterprise's current state
- Enterprise digital transformation
- Enterprises embarking on this journey
- Some examples
- Data lake use case enlightenment
- Summary
- Chapter 2: Comprehensive Concepts of a Data Lake
- What is a Data Lake?
- Relevance to enterprises
- How does a Data Lake help enterprises?
- Data Lake benefits
- How Data Lake works?
- Differences between Data Lake and Data Warehouse
- Approaches to building a Data Lake
- Lambda Architecture-driven Data Lake
- Data ingestion layer - ingest for processing and storage
- Batch layer - batch processing of ingested data
- Speed layer - near real time data processing
- Data storage layer - store all data
- Serving layer - data delivery and exports
- Data acquisition layer - get data from source systems
- Messaging Layer - guaranteed data delivery
- Exploring the Data Ingestion Layer
- Exploring the Lambda layer
- Batch layer
- Speed layer
- Serving layer
- Data push
- Data pull
- Data storage layer
- Batch process layer
- Relational data stores.
- Distributed data stores
- Chapter 3: Lambda Architecture as a Pattern for Data Lake
- What is Lambda Architecture?
- History of Lambda Architecture
- Principles of Lambda Architecture
- Fault-tolerant principle
- Immutable Data principle
- Re-computation principle
- Components of a Lambda Architecture
- CAP Theorem
- Eventual consistency
- Complete working of a Lambda Architecture
- Advantages of Lambda Architecture
- Disadvantages of Lambda Architectures
- Technology overview for Lambda Architecture
- Applied lambda
- Enterprise-level log analysis
- Capturing and analyzing sensor data
- Real-time mailing platform statistics
- Real-time sports analysis
- Recommendation engines
- Analyzing security threats
- Multi-channel consumer behaviour
- Working examples of Lambda Architecture
- Kappa architecture
- Chapter 4: Applied Lambda for Data Lake
- Knowing Hadoop distributions
- Selection factors for a big data stack for enterprises
- Technical capabilities
- Ease of deployment and maintenance
- Integration readiness
- Batch layer for data processing
- The NameNode server
- The secondary NameNode Server
- Yet Another Resource Negotiator (YARN)
- Data storage nodes (DataNode)
- Flume for data acquisition
- Source for event sourcing
- Interceptors for event interception
- Channels for event flow
- Sink as an event destination
- Spark Streaming
- DStreams
- Data Frames
- Checkpointing
- Apache Flink
- Data repository layer
- Relational databases
- Big data tables/views
- Data services with data indexes
- NoSQL databases
- Data access layer
- Data exports
- Data publishing
- Chapter 5: Data Acquisition of Batch Data using Apache Sqoop
- Context in data lake - data acquisition.
- Data acquisition layer
- Data acquisition of batch data - technology mapping
- Why Apache Sqoop
- History of Sqoop
- Advantages of Sqoop
- Disadvantages of Sqoop
- Workings of Sqoop
- Sqoop 2 architecture
- Sqoop 1 versus Sqoop 2
- Ease of use
- Ease of extension
- Security
- When to use Sqoop 1 and Sqoop 2
- Functioning of Sqoop
- Data import using Sqoop
- Data export using Sqoop
- Sqoop connectors
- Types of Sqoop connectors
- Sqoop support for HDFS
- Sqoop working example
- Installation and Configuration
- Step 1 - Installing and verifying Java
- Step 2 - Installing and verifying Hadoop
- Step 3 - Installing and verifying Hue
- Step 4 - Installing and verifying Sqoop
- Step 5 - Installing and verifying PostgreSQL (RDBMS)
- Step 6 - Installing and verifying HBase (NoSQL)
- Configure data source (ingestion)
- Sqoop configuration (database drivers)
- Configuring HDFS as destination
- Sqoop Import
- Import complete database
- Import selected tables
- Import selected columns from a table
- Import into HBase
- Sqoop Export
- Sqoop Job
- Job command
- Create job
- List Job
- Run Job
- Create Job
- Sqoop 2
- Sqoop in purview of SCV use case
- When to use Sqoop
- When not to use Sqoop
- Real-time Sqooping: a possibility?
- Other options
- Native big data connectors
- Talend
- Pentaho's Kettle (PDI - Pentaho Data Integration)
- Chapter 6: Data Acquisition of Stream Data using Apache Flume
- Context in Data Lake: data acquisition
- What is Stream Data?
- Batch and stream data
- Data acquisition of stream data - technology mapping
- What is Flume?
- Sqoop and Flume
- Why Flume?
- History of Flume
- Advantages of Flume
- Disadvantages of Flume
- Flume architecture principles
- The Flume Architecture
- Distributed pipeline - Flume architecture
- Fan Out - Flume architecture.
- Fan In - Flume architecture
- Three tier design - Flume architecture
- Advanced Flume architecture
- Flume reliability level
- Flume event - Stream Data
- Flume agent
- Flume agent configurations
- Flume source
- Custom Source
- Flume Channel
- Custom channel
- Flume sink
- Custom sink
- Flume configuration
- Flume transaction management
- Other flume components
- Channel processor
- Interceptor
- Channel Selector
- Sink Groups
- Sink Processor
- Event Serializers
- Context Routing
- Flume working example
- Step 1: Installing and verifying Flume
- Step 2: Configuring Flume
- Step 3: Start Flume
- Flume in purview of SCV use case
- Kafka Installation
- Example 1 - RDBMS to Kafka
- Example 2: Spool messages to Kafka
- Example 3: Interceptors
- Example 4 - Memory channel, file channel, and Kafka channel
- When to use Flume
- When not to use Flume
- Apache NiFi
- Chapter 7: Messaging Layer using Apache Kafka
- Context in Data Lake - messaging layer
- Messaging layer
- Messaging layer - technology mapping
- What is Apache Kafka?
- Why Apache Kafka
- History of Kafka
- Advantages of Kafka
- Disadvantages of Kafka
- Kafka architecture
- Core architecture principles of Kafka
- Data stream life cycle
- Working of Kafka
- Kafka message
- Kafka producer
- Persistence of data in Kafka using topics
- Partitions - Kafka topic division
- Kafka message broker
- Kafka consumer
- Consumer groups
- Other Kafka components
- Zookeeper
- MirrorMaker
- Kafka programming interface
- Kafka core API's
- Kafka REST interface
- Producer and consumer reliability
- Kafka security
- Kafka as message-oriented middleware
- Scale-out architecture with Kafka
- Kafka connect
- Kafka working example
- Installation.
- Producer - putting messages into Kafka
- Kafka Connect
- Consumer - getting messages from Kafka
- Setting up multi-broker cluster
- Kafka in the purview of an SCV use case
- When to use Kafka
- When not to use Kafka
- RabbitMQ
- ZeroMQ
- Apache ActiveMQ
- Chapter 8: Data Processing using Apache Flink
- Context in a Data Lake - Data Ingestion Layer
- Data Ingestion Layer
- Data Ingestion Layer - technology mapping
- What is Apache Flink?
- Why Apache Flink?
- History of Flink
- Advantages of Flink
- Disadvantages of Flink
- Working of Flink
- Flink architecture
- Client
- Job Manager
- Task Manager
- Flink execution model
- Core architecture principles of Flink
- Flink Component Stack
- Checkpointing in Flink
- Savepoints in Flink
- Streaming window options in Flink
- Time window
- Count window
- Tumbling window configuration
- Sliding window configuration
- Memory management
- Flink API's
- DataStream API
- Flink DataStream API example
- Streaming connectors
- DataSet API
- Flink DataSet API example
- Table API
- Flink domain specific libraries
- Gelly - Flink Graph API
- FlinkML
- FlinkCEP
- Flink working example
- Installation
- Example - data processing with Flink
- Data generation
- Step 1 - Preparing streams
- Step 2 - Consuming Streams via Flink
- Step 3 - Streaming data into HDFS
- Flink in purview of SCV use cases
- User Log Data Generation
- Flume Setup
- Flink Processors
- When to use Flink
- When not to use Flink
- Apache Spark
- Apache Storm
- Apache Tez
- Chapter 9: Data Store Using Apache Hadoop
- Context for Data Lake - Data Storage and lambda Batch layer
- Data Storage and the Lambda Batch Layer
- Data Storage and Lambda Batch Layer - technology mapping
- What is Apache Hadoop?
- Why Hadoop?
- History of Hadoop.
- Advantages of Hadoop.
- Notes:
- Includes index.
- Description based on online resource; title from PDF title page (ebrary, viewed July 14, 2017).
- OCLC:
- 991530196
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.