My Account Log in

1 option

Apache Spark 2.x cookbook : Cloud-ready recipes to do analytics and data science on Apache Spark / Rishi Yadav.

Ebook Central College Complete Available online

View online
Format:
Book
Author/Creator:
Yadav, Rishi, author.
Language:
English
Subjects (All):
Big data.
Data mining--Computer programs.
Data mining.
Physical Description:
1 online resource (288 pages) : illustrations
Edition:
2nd ed.
Place of Publication:
Birmingham, [England] ; Mumbai, [India] : Packt Publishing, 2017.
Biography/History:
Yadav Rishi: Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again). Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Contents:
Cover
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Getting Started with Apache Spark
Introduction
Leveraging Databricks Cloud
How to do it...
How it works...
Cluster
Notebook
Table
Library
Deploying Spark using Amazon EMR
What it represents is much bigger than what it looks
EMR's architecture
EC2 instance types
T2 - Free Tier Burstable (EBS only)
M4 - General purpose (EBS only)
C4 - Compute optimized
X1 - Memory optimized
R4 - Memory optimized
P2 - General purpose GPU
I3 - Storage optimized
D2 - Storage optimized
Installing Spark from binaries
Getting ready
How to do it…
Building the Spark source code with Maven
Launching Spark on Amazon EC2
See also
Deploying Spark on a cluster in standalone mode
How it works…
Deploying Spark on a cluster with Mesos
Deploying Spark on a cluster with YARN
Understanding SparkContext and SparkSession
SparkContext
SparkSession
Understanding resilient distributed dataset - RDD
Chapter 2: Developing Applications with Spark
Exploring the Spark shell
There's more...
Developing a Spark applications in Eclipse with Maven
Developing a Spark applications in Eclipse with SBT
Developing a Spark application in IntelliJ IDEA with Maven
Developing a Spark application in IntelliJ IDEA with SBT
Developing applications using the Zeppelin notebook.
How to do it...
Setting up Kerberos to do authentication
Enabling Kerberos authentication for Spark
Securing data at rest
Securing data in transit
Chapter 3: Spark SQL
Understanding the evolution of schema awareness
DataFrames
Datasets
Schema-aware file formats
Understanding the Catalyst optimizer
Analysis
Logical plan optimization
Physical planning
Code generation
Inferring schema using case classes
Programmatically specifying the schema
Understanding the Parquet format
Partitioning
Predicate pushdown
Parquet Hive interoperability
Loading and saving data using the JSON format
Loading and saving data from relational databases
Loading and saving data from an arbitrary source
Understanding joins
Shuffle hash join
Broadcast hash join
The cartesian join
Analyzing nested structures
Chapter 4: Working with External Data Sources
Loading data from the local filesystem
Loading data from HDFS
Loading data from Amazon S3
Loading data from Apache Cassandra
How it works
CAP Theorem
Cassandra partitions
Consistency levels
Chapter 5: Spark Streaming
Classic Spark Streaming
Structured Streaming
WordCount using Structured Streaming
Taking a closer look at Structured Streaming
How to do it.
There's more...
Streaming Twitter data
Streaming using Kafka
Understanding streaming challenges
Late arriving/out-of-order data
Maintaining the state in between batches
Message delivery reliability
Streaming is not an island
Chapter 6: Getting Started with Machine Learning
Creating vectors
Calculating correlation
Understanding feature engineering
Feature selection
Quality of features
Number of features
Feature scaling
Feature extraction
TF-IDF
Term frequency
Inverse document frequency
Understanding Spark ML
Understanding hyperparameter tuning
Chapter 7: Supervised Learning with MLlib - Regression
Using linear regression
Understanding the cost function
Doing linear regression with lasso
Bias versus variance
Doing ridge regression
Chapter 8: Supervised Learning with MLlib - Classification
Doing classification using logistic regression
What is ROC?
Doing binary classification using SVM
Doing classification using decision trees
Doing classification using random forest
Doing classification using gradient boosted trees
Doing classification with Naïve Bayes
Chapter 9: Unsupervised Learning
Introduction.
Clustering using k-means
Dimensionality reduction with principal component analysis
Dimensionality reduction with singular value decomposition
Chapter 10: Recommendations Using Collaborative Filtering
Collaborative filtering using explicit feedback
Adding my recommendations and then testing predictions
Collaborative filtering using implicit feedback
Chapter 11: Graph Processing Using GraphX and GraphFrames
Fundamental operations on graphs
Using PageRank
Finding connected components
Performing neighborhood aggregation
Understanding GraphFrames
Chapter 12: Optimizations and Performance Tuning
Optimizing memory
Garbage collection
Mark and sweep
G1
Spark memory allocation
Leveraging speculation
Optimizing joins
Using compression to improve performance
Using serialization to improve performance
Optimizing the level of parallelism
Understanding project Tungsten
Tungsten phase 1
Bypassing GC
Cache conscious computation
Code generation for expression evaluation
Tungsten phase 2
Wholesale code generation
In-memory columnar format
Index.
Notes:
Includes bibliographical references and index.
Description based on online resource; title from PDF title page (ebrary, viewed July 14, 2017).
OCLC:
992573594

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account