My Account Log in

2 options

Data Lake for enterprises : leveraging Lambda architecture for building Enterprise Data Lake / Tomcy John, Pankaj Misra ; foreword by Thomas Benjamin.

EBSCOhost Ebook Business Collection Available online

View online

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
John, Tomcy, author.
Misra, Pankaj, author.
Contributor:
Benjamin, Thomas, writer of foreword.
Language:
English
Subjects (All):
Electronic data processing--Distributed processing--Management.
Electronic data processing.
Big data.
Information storage and retrieval systems.
Physical Description:
1 online resource (561 pages) : illustrations (some color)
Edition:
1st edition
Place of Publication:
Birmingham, England : Packt, 2017.
System Details:
text file
Biography/History:
Mishra Vivek: Vivek Mishra is an IT professional with more than nine years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor for open source like Apache Cassandra and lead committer for Kundera(JPA 2. 0 compliant Object-Datastore Mapping Library for NoSQL Datastores like Cassandra, HBase, MongoDB and REDIS). Mr Mishra in his previous experience has enjoyed long lasting partnership with most recognizable names in SCM, Banking and finance industries, employing industry standard full software life cycle methodologies Agile and SCRUM. He is currently employed with Impetus infotech pvt. ltd. He has undertaken speaking engagements in cloud camp and Nasscom Big data seminar and is an active blogger and can be followed at mevivs. wordpress. comJohn Tomcy: Tomcy John lives in Dubai (United Arab Emirates), hailing from Kerala (India), and is an enterprise Java specialist with a degree in Engineering (B Tech) and over 14 years of experience in several industries. He's currently working as principal architect at Emirates Group IT, in their core architecture team. Prior to this, he worked with Oracle Corporation and Ernst & Young. His main specialization is in building enterprise-grade applications and he acts as chief mentor and evangelist to facilitate incorporating new technologies as corporate standards in the organization. Outside of his work, Tomcy works very closely with young developers and engineers as mentors and speaks at various forums as a technical evangelist on many topics ranging from web and middleware all the way to various persistence stores. Misra Pankaj: Pankaj Misra has been a technology evangelist, holding a bachelor's degree in engineering, with over 16 years of experience across multiple business domains and technologies. He has been working with Emirates Group IT since 2015, and has worked with various other organizations in the past. He specializes in architecting and building multi-stack solutions and implementations. He has also been a speaker at technology forums in India and has built products with scale-out architecture that support high-volume, near-real-time data processing and near-real-time analytics.
Summary:
A practical guide to implementing your enterprise data lake using Lambda Architecture as the base About This Book Build a full-fledged data lake for your organization with popular big data technologies using the Lambda architecture as the base Delve into the big data technologies required to meet modern day business strategies A highly practical guide to implementing enterprise data lakes with lots of examples and real-world use-cases Who This Book Is For Java developers and architects who would like to implement a data lake for their enterprise will find this book useful. If you want to get hands-on experience with the Lambda Architecture and big data technologies by implementing a practical solution using these technologies, this book will also help you. What You Will Learn Build an enterprise-level data lake using the relevant big data technologies Understand the core of the Lambda architecture and how to apply it in an enterprise Learn the technical details around Sqoop and its functionalities Integrate Kafka with Hadoop components to acquire enterprise data Use flume with streaming technologies for stream-based processing Understand stream- based processing with reference to Apache Spark Streaming Incorporate Hadoop components and know the advantages they provide for enterprise data lakes Build fast, streaming, and high-performance applications using ElasticSearch Make your data ingestion process consistent across various data formats with configurability Process your data to derive intelligence using machine learning algorithms In Detail The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces yo...
Contents:
Cover
Copyright
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Part 1 - Overview
Part 2 - Technical Building blocks of Data Lake
Part 3 - Bringing It All Together
Chapter 1: Introduction to Data
Exploring data
What is Enterprise Data?
Enterprise Data Management
Big data concepts
Big data and 4Vs
Relevance of data
Quality of data
Where does this data live in an enterprise?
Intranet (within enterprise)
Internet (external to enterprise)
Business applications hosted in cloud
Third-party cloud solutions
Social data (structured and unstructured)
Data stores or persistent stores (RDBMS or NoSQL)
Traditional data warehouse
File stores
Enterprise's current state
Enterprise digital transformation
Enterprises embarking on this journey
Some examples
Data lake use case enlightenment
Summary
Chapter 2: Comprehensive Concepts of a Data Lake
What is a Data Lake?
Relevance to enterprises
How does a Data Lake help enterprises?
Data Lake benefits
How Data Lake works?
Differences between Data Lake and Data Warehouse
Approaches to building a Data Lake
Lambda Architecture-driven Data Lake
Data ingestion layer - ingest for processing and storage
Batch layer - batch processing of ingested data
Speed layer - near real time data processing
Data storage layer - store all data
Serving layer - data delivery and exports
Data acquisition layer - get data from source systems
Messaging Layer - guaranteed data delivery
Exploring the Data Ingestion Layer
Exploring the Lambda layer
Batch layer
Speed layer
Serving layer
Data push
Data pull
Data storage layer
Batch process layer
Relational data stores.
Distributed data stores
Chapter 3: Lambda Architecture as a Pattern for Data Lake
What is Lambda Architecture?
History of Lambda Architecture
Principles of Lambda Architecture
Fault-tolerant principle
Immutable Data principle
Re-computation principle
Components of a Lambda Architecture
CAP Theorem
Eventual consistency
Complete working of a Lambda Architecture
Advantages of Lambda Architecture
Disadvantages of Lambda Architectures
Technology overview for Lambda Architecture
Applied lambda
Enterprise-level log analysis
Capturing and analyzing sensor data
Real-time mailing platform statistics
Real-time sports analysis
Recommendation engines
Analyzing security threats
Multi-channel consumer behaviour
Working examples of Lambda Architecture
Kappa architecture
Chapter 4: Applied Lambda for Data Lake
Knowing Hadoop distributions
Selection factors for a big data stack for enterprises
Technical capabilities
Ease of deployment and maintenance
Integration readiness
Batch layer for data processing
The NameNode server
The secondary NameNode Server
Yet Another Resource Negotiator (YARN)
Data storage nodes (DataNode)
Flume for data acquisition
Source for event sourcing
Interceptors for event interception
Channels for event flow
Sink as an event destination
Spark Streaming
DStreams
Data Frames
Checkpointing
Apache Flink
Data repository layer
Relational databases
Big data tables/views
Data services with data indexes
NoSQL databases
Data access layer
Data exports
Data publishing
Chapter 5: Data Acquisition of Batch Data using Apache Sqoop
Context in data lake - data acquisition.
Data acquisition layer
Data acquisition of batch data - technology mapping
Why Apache Sqoop
History of Sqoop
Advantages of Sqoop
Disadvantages of Sqoop
Workings of Sqoop
Sqoop 2 architecture
Sqoop 1 versus Sqoop 2
Ease of use
Ease of extension
Security
When to use Sqoop 1 and Sqoop 2
Functioning of Sqoop
Data import using Sqoop
Data export using Sqoop
Sqoop connectors
Types of Sqoop connectors
Sqoop support for HDFS
Sqoop working example
Installation and Configuration
Step 1 - Installing and verifying Java
Step 2 - Installing and verifying Hadoop
Step 3 - Installing and verifying Hue
Step 4 - Installing and verifying Sqoop
Step 5 - Installing and verifying PostgreSQL (RDBMS)
Step 6 - Installing and verifying HBase (NoSQL)
Configure data source (ingestion)
Sqoop configuration (database drivers)
Configuring HDFS as destination
Sqoop Import
Import complete database
Import selected tables
Import selected columns from a table
Import into HBase
Sqoop Export
Sqoop Job
Job command
Create job
List Job
Run Job
Create Job
Sqoop 2
Sqoop in purview of SCV use case
When to use Sqoop
When not to use Sqoop
Real-time Sqooping: a possibility?
Other options
Native big data connectors
Talend
Pentaho's Kettle (PDI - Pentaho Data Integration)
Chapter 6: Data Acquisition of Stream Data using Apache Flume
Context in Data Lake: data acquisition
What is Stream Data?
Batch and stream data
Data acquisition of stream data - technology mapping
What is Flume?
Sqoop and Flume
Why Flume?
History of Flume
Advantages of Flume
Disadvantages of Flume
Flume architecture principles
The Flume Architecture
Distributed pipeline - Flume architecture
Fan Out - Flume architecture.
Fan In - Flume architecture
Three tier design - Flume architecture
Advanced Flume architecture
Flume reliability level
Flume event - Stream Data
Flume agent
Flume agent configurations
Flume source
Custom Source
Flume Channel
Custom channel
Flume sink
Custom sink
Flume configuration
Flume transaction management
Other flume components
Channel processor
Interceptor
Channel Selector
Sink Groups
Sink Processor
Event Serializers
Context Routing
Flume working example
Step 1: Installing and verifying Flume
Step 2: Configuring Flume
Step 3: Start Flume
Flume in purview of SCV use case
Kafka Installation
Example 1 - RDBMS to Kafka
Example 2: Spool messages to Kafka
Example 3: Interceptors
Example 4 - Memory channel, file channel, and Kafka channel
When to use Flume
When not to use Flume
Apache NiFi
Chapter 7: Messaging Layer using Apache Kafka
Context in Data Lake - messaging layer
Messaging layer
Messaging layer - technology mapping
What is Apache Kafka?
Why Apache Kafka
History of Kafka
Advantages of Kafka
Disadvantages of Kafka
Kafka architecture
Core architecture principles of Kafka
Data stream life cycle
Working of Kafka
Kafka message
Kafka producer
Persistence of data in Kafka using topics
Partitions - Kafka topic division
Kafka message broker
Kafka consumer
Consumer groups
Other Kafka components
Zookeeper
MirrorMaker
Kafka programming interface
Kafka core API's
Kafka REST interface
Producer and consumer reliability
Kafka security
Kafka as message-oriented middleware
Scale-out architecture with Kafka
Kafka connect
Kafka working example
Installation.
Producer - putting messages into Kafka
Kafka Connect
Consumer - getting messages from Kafka
Setting up multi-broker cluster
Kafka in the purview of an SCV use case
When to use Kafka
When not to use Kafka
RabbitMQ
ZeroMQ
Apache ActiveMQ
Chapter 8: Data Processing using Apache Flink
Context in a Data Lake - Data Ingestion Layer
Data Ingestion Layer
Data Ingestion Layer - technology mapping
What is Apache Flink?
Why Apache Flink?
History of Flink
Advantages of Flink
Disadvantages of Flink
Working of Flink
Flink architecture
Client
Job Manager
Task Manager
Flink execution model
Core architecture principles of Flink
Flink Component Stack
Checkpointing in Flink
Savepoints in Flink
Streaming window options in Flink
Time window
Count window
Tumbling window configuration
Sliding window configuration
Memory management
Flink API's
DataStream API
Flink DataStream API example
Streaming connectors
DataSet API
Flink DataSet API example
Table API
Flink domain specific libraries
Gelly - Flink Graph API
FlinkML
FlinkCEP
Flink working example
Installation
Example - data processing with Flink
Data generation
Step 1 - Preparing streams
Step 2 - Consuming Streams via Flink
Step 3 - Streaming data into HDFS
Flink in purview of SCV use cases
User Log Data Generation
Flume Setup
Flink Processors
When to use Flink
When not to use Flink
Apache Spark
Apache Storm
Apache Tez
Chapter 9: Data Store Using Apache Hadoop
Context for Data Lake - Data Storage and lambda Batch layer
Data Storage and the Lambda Batch Layer
Data Storage and Lambda Batch Layer - technology mapping
What is Apache Hadoop?
Why Hadoop?
History of Hadoop.
Advantages of Hadoop.
Notes:
Includes index.
Description based on online resource; title from PDF title page (ebrary, viewed July 14, 2017).
OCLC:
991530196

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account