1 option
Apache Spark 2.x cookbook : Cloud-ready recipes to do analytics and data science on Apache Spark / Rishi Yadav.
- Format:
- Book
- Author/Creator:
- Yadav, Rishi, author.
- Language:
- English
- Subjects (All):
- Big data.
- Data mining--Computer programs.
- Data mining.
- Physical Description:
- 1 online resource (288 pages) : illustrations
- Edition:
- 2nd ed.
- Place of Publication:
- Birmingham, [England] ; Mumbai, [India] : Packt Publishing, 2017.
- Biography/History:
- Yadav Rishi: Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again). Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
- Contents:
- Cover
- Credits
- About the Author
- About the Reviewer
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: Getting Started with Apache Spark
- Introduction
- Leveraging Databricks Cloud
- How to do it...
- How it works...
- Cluster
- Notebook
- Table
- Library
- Deploying Spark using Amazon EMR
- What it represents is much bigger than what it looks
- EMR's architecture
- EC2 instance types
- T2 - Free Tier Burstable (EBS only)
- M4 - General purpose (EBS only)
- C4 - Compute optimized
- X1 - Memory optimized
- R4 - Memory optimized
- P2 - General purpose GPU
- I3 - Storage optimized
- D2 - Storage optimized
- Installing Spark from binaries
- Getting ready
- How to do it…
- Building the Spark source code with Maven
- Launching Spark on Amazon EC2
- See also
- Deploying Spark on a cluster in standalone mode
- How it works…
- Deploying Spark on a cluster with Mesos
- Deploying Spark on a cluster with YARN
- Understanding SparkContext and SparkSession
- SparkContext
- SparkSession
- Understanding resilient distributed dataset - RDD
- Chapter 2: Developing Applications with Spark
- Exploring the Spark shell
- There's more...
- Developing a Spark applications in Eclipse with Maven
- Developing a Spark applications in Eclipse with SBT
- Developing a Spark application in IntelliJ IDEA with Maven
- Developing a Spark application in IntelliJ IDEA with SBT
- Developing applications using the Zeppelin notebook.
- How to do it...
- Setting up Kerberos to do authentication
- Enabling Kerberos authentication for Spark
- Securing data at rest
- Securing data in transit
- Chapter 3: Spark SQL
- Understanding the evolution of schema awareness
- DataFrames
- Datasets
- Schema-aware file formats
- Understanding the Catalyst optimizer
- Analysis
- Logical plan optimization
- Physical planning
- Code generation
- Inferring schema using case classes
- Programmatically specifying the schema
- Understanding the Parquet format
- Partitioning
- Predicate pushdown
- Parquet Hive interoperability
- Loading and saving data using the JSON format
- Loading and saving data from relational databases
- Loading and saving data from an arbitrary source
- Understanding joins
- Shuffle hash join
- Broadcast hash join
- The cartesian join
- Analyzing nested structures
- Chapter 4: Working with External Data Sources
- Loading data from the local filesystem
- Loading data from HDFS
- Loading data from Amazon S3
- Loading data from Apache Cassandra
- How it works
- CAP Theorem
- Cassandra partitions
- Consistency levels
- Chapter 5: Spark Streaming
- Classic Spark Streaming
- Structured Streaming
- WordCount using Structured Streaming
- Taking a closer look at Structured Streaming
- How to do it.
- There's more...
- Streaming Twitter data
- Streaming using Kafka
- Understanding streaming challenges
- Late arriving/out-of-order data
- Maintaining the state in between batches
- Message delivery reliability
- Streaming is not an island
- Chapter 6: Getting Started with Machine Learning
- Creating vectors
- Calculating correlation
- Understanding feature engineering
- Feature selection
- Quality of features
- Number of features
- Feature scaling
- Feature extraction
- TF-IDF
- Term frequency
- Inverse document frequency
- Understanding Spark ML
- Understanding hyperparameter tuning
- Chapter 7: Supervised Learning with MLlib - Regression
- Using linear regression
- Understanding the cost function
- Doing linear regression with lasso
- Bias versus variance
- Doing ridge regression
- Chapter 8: Supervised Learning with MLlib - Classification
- Doing classification using logistic regression
- What is ROC?
- Doing binary classification using SVM
- Doing classification using decision trees
- Doing classification using random forest
- Doing classification using gradient boosted trees
- Doing classification with Naïve Bayes
- Chapter 9: Unsupervised Learning
- Introduction.
- Clustering using k-means
- Dimensionality reduction with principal component analysis
- Dimensionality reduction with singular value decomposition
- Chapter 10: Recommendations Using Collaborative Filtering
- Collaborative filtering using explicit feedback
- Adding my recommendations and then testing predictions
- Collaborative filtering using implicit feedback
- Chapter 11: Graph Processing Using GraphX and GraphFrames
- Fundamental operations on graphs
- Using PageRank
- Finding connected components
- Performing neighborhood aggregation
- Understanding GraphFrames
- Chapter 12: Optimizations and Performance Tuning
- Optimizing memory
- Garbage collection
- Mark and sweep
- G1
- Spark memory allocation
- Leveraging speculation
- Optimizing joins
- Using compression to improve performance
- Using serialization to improve performance
- Optimizing the level of parallelism
- Understanding project Tungsten
- Tungsten phase 1
- Bypassing GC
- Cache conscious computation
- Code generation for expression evaluation
- Tungsten phase 2
- Wholesale code generation
- In-memory columnar format
- Index.
- Notes:
- Includes bibliographical references and index.
- Description based on online resource; title from PDF title page (ebrary, viewed July 14, 2017).
- OCLC:
- 992573594
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.