My Account Log in

3 options

Learning Spark SQL : architect streaming analytics and machine learning solutions / Aurobindo Sarkar.

EBSCOhost Academic eBook Collection (North America) Available online

View online

Ebook Central College Complete Available online

View online

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Sarkar, Aurobindo, author.
Language:
English
Subjects (All):
Spark (Electronic resource : Apache Software Foundation).
Data mining.
Big data.
Physical Description:
1 online resource (1 volume) : illustrations
Edition:
1st edition
Place of Publication:
Birmingham, [England] ; Mumbai, [India] : Packt Publishing, 2017.
System Details:
text file
Biography/History:
Sarkar Aurobindo: Aurobindo Sarkar leads a team of data scientists and engineers at Session AI, developing cloud-based ML models for in-session marketing in e-commerce and retail. As a former CTO at multiple SaaS startups, he has architected secure, scalable, and highly available AWS cloud applications. His research interests now focus on AWS-based large-scale transformer models for NLP and HFT models for the futures and options market. Aurobindo holds a bachelor's degree in engineering from IIT Delhi, a master's in management from the Indian Institute of Science Bangalore, and a master's in computer science from New York University.
Summary:
Design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API About This Book Learn about the design and implementation of streaming applications, machine learning pipelines, deep learning, and large-scale graph processing applications using Spark SQL APIs and Scala. Learn data exploration, data munging, and how to process structured and semi-structured data using real-world datasets and gain hands-on exposure to the issues and challenges of working with noisy and "dirty" real-world data. Understand design considerations for scalability and performance in web-scale Spark application architectures. Who This Book Is For If you are a developer, engineer, or an architect and want to learn how to use Apache Spark in a web-scale project, then this is the book for you. It is assumed that you have prior knowledge of SQL querying. A basic programming knowledge with Scala, Java, R, or Python is all you need to get started with this book. What You Will Learn Familiarize yourself with Spark SQL programming, including working with DataFrame/Dataset API and SQL Perform a series of hands-on exercises with different types of data sources, including CSV, JSON, Avro, MySQL, and MongoDB Perform data quality checks, data visualization, and basic statistical analysis tasks Perform data munging tasks on publically available datasets Learn how to use Spark SQL and Apache Kafka to build streaming applications Learn key performance-tuning tips and tricks in Spark SQL applications Learn key architectural components and patterns in large-scale Spark SQL applications In Detail In the past year, Apache Spark has been increasingly adopted for the development of distributed applications. Spark SQL APIs provide an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Hence, understanding the design and implementation best practices before you start your project will help you avoid these problems. This book gives an insight into the engineering practices used to design and build real-world, Spark-based applications. The book's hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL. It starts by familiarizing you with data exploration and data munging tasks using Spark SQL and Scala. Extensive code examples will help yo...
Contents:
Cover
Title Page
Copyright
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Getting Started with Spark SQL
What is Spark SQL?
Introducing SparkSession
Understanding Spark SQL concepts
Understanding Resilient Distributed Datasets (RDDs)
Understanding DataFrames and Datasets
Understanding the Catalyst optimizer
Understanding Catalyst optimizations
Understanding Catalyst transformations
Introducing Project Tungsten
Using Spark SQL in streaming applications
Understanding Structured Streaming internals
Summary
Chapter 2: Using Spark SQL for Processing Structured and Semistructured Data
Understanding data sources in Spark applications
Selecting Spark data sources
Using Spark with relational databases
Using Spark with MongoDB (NoSQL database)
Using Spark with JSON data
Using Spark with Avro files
Using Spark with Parquet files
Defining and using custom data sources in Spark
Chapter 3: Using Spark SQL for Data Exploration
Introducing Exploratory Data Analysis (EDA)
Using Spark SQL for basic data analysis
Identifying missing data
Computing basic statistics
Identifying data outliers
Visualizing data with Apache Zeppelin
Sampling data with Spark SQL APIs
Sampling with the DataFrame/Dataset API
Sampling with the RDD API
Using Spark SQL for creating pivot tables
Chapter 4: Using Spark SQL for Data Munging
Introducing data munging
Exploring data munging techniques
Pre-processing of the&amp
#160
household electric consumption Dataset
Computing basic statistics and aggregations
Augmenting the Dataset
Executing other miscellaneous processing steps
Pre-processing of&amp
the weather Dataset.
Analyzing missing data
Combining data using a JOIN operation
Munging textual data
Processing multiple input data files
Removing stop words
Munging time series data
time-series Dataset
Processing date fields
Persisting and loading data
Defining a date-time index
Using the&amp
&amp
TimeSeriesRDD&amp
object
Handling missing time-series data
Dealing with variable length records
Converting variable-length records to fixed-length records
Extracting data from "messy" columns
Preparing data for machine learning
Pre-processing data for machine learning
Creating and running a machine learning pipeline
Chapter 5: Using Spark SQL in Streaming Applications
Introducing streaming data applications
Building Spark streaming applications
Implementing sliding window-based functionality
Joining a streaming Dataset with a static Dataset
Using the Dataset API in Structured Streaming
Using output sinks
Using the Foreach Sink for arbitrary computations on output
Using the Memory Sink to save output to a table
Using the File Sink to save output to a partitioned table
Monitoring streaming queries
Using Kafka with Spark Structured Streaming
Introducing Kafka concepts
Introducing ZooKeeper concepts
Introducing Kafka-Spark integration
Introducing Kafka-Spark Structured Streaming
Writing a receiver for a custom data source
Chapter 6: Using Spark SQL in Machine Learning Applications
Introducing machine learning applications
Understanding Spark ML pipelines and their components
Understanding the steps in a pipeline application development process
Introducing feature engineering
Creating new features from raw data.
Estimating the importance of a feature
Understanding dimensionality reduction
Deriving good features
Implementing a Spark ML classification model
Exploring the diabetes Dataset
Pre-processing the data
Building the Spark ML pipeline
Using StringIndexer for indexing categorical features and labels
Using VectorAssembler for assembling features into one column
Using a Spark ML classifier
Creating a Spark ML pipeline
Creating the training and test Datasets
Making predictions using the PipelineModel
Selecting the best model
Changing the ML algorithm in the pipeline
Introducing Spark ML tools and utilities
Using Principal Component Analysis to select features
Using encoders
Using Bucketizer
Using VectorSlicer
Using Chi-squared selector
Using a Normalizer
Retrieving our original labels
Implementing a Spark ML clustering model
Chapter 7: Using Spark SQL in Graph Applications
Introducing large-scale graph applications
Exploring graphs using GraphFrames
Constructing a GraphFrame
Basic graph queries and operations
Motif analysis using GraphFrames
Processing subgraphs
Applying graph algorithms
Saving and loading GraphFrames
Analyzing JSON input modeled as a graph&amp
Processing graphs containing multiple types of relationships
Understanding GraphFrame internals
Viewing GraphFrame physical execution plan
Understanding partitioning in GraphFrames
Chapter 8: Using Spark SQL with SparkR
Introducing SparkR
Understanding the SparkR architecture
Understanding SparkR DataFrames
Using SparkR for EDA and data munging tasks
Reading and writing Spark DataFrames
Exploring structure and contents of Spark DataFrames
Running basic operations on Spark DataFrames
Executing SQL statements on Spark DataFrames.
Merging SparkR DataFrames
Using User Defined Functions (UDFs)
Using SparkR for computing summary statistics
Using SparkR for data visualization
Visualizing data on a map
Visualizing graph nodes and edges
Using SparkR for machine learning
Chapter 9: Developing Applications with Spark SQL
Introducing Spark SQL applications
Understanding text analysis applications
Using Spark SQL for textual analysis
Preprocessing textual data
Computing readability
Using word lists
Creating data preprocessing pipelines
Understanding themes in document corpuses
Using Naive Bayes classifiers
Developing a machine learning application
Chapter 10: Using Spark SQL in Deep Learning Applications
Introducing neural networks
Understanding deep learning
Understanding representation learning
Understanding stochastic gradient descent
Introducing deep learning in Spark
Introducing CaffeOnSpark
Introducing DL4J
Introducing TensorFrames
Working with BigDL
Tuning hyperparameters of deep learning models
Introducing deep learning pipelines
Understanding Supervised learning
Understanding convolutional neural networks
Using neural networks for text classification
Using deep neural networks for language processing
Understanding Recurrent Neural Networks
Introducing autoencoders
Chapter 11: Tuning Spark SQL Components for Performance
Introducing performance tuning in Spark SQL
Understanding DataFrame/Dataset APIs
Optimizing data serialization
Understanding the Dataset/DataFrame API
Visualizing Spark application execution
Exploring Spark application execution metrics
Using external tools for performance tuning
Cost-based optimizer in Apache Spark 2.2.
Understanding the&amp
CBO statistics collection
Statistics collection functions
Filter operator
Join operator
Build side selection
Understanding multi-way JOIN ordering optimization
Understanding performance improvements using whole-stage code generation
Chapter 12: Spark SQL in Large-Scale Application Architectures
Understanding Spark-based application architectures
Using Apache Spark for batch processing
Using Apache Spark for stream processing
Understanding the Lambda architecture
Understanding the Kappa Architecture
Design considerations for building scalable stream processing applications
Building robust ETL pipelines using Spark SQL
Choosing appropriate data formats
Transforming data in ETL pipelines
Addressing errors in ETL pipelines
Implementing a scalable monitoring solution
Deploying Spark machine learning pipelines
Understanding the challenges in typical ML deployment environments
Understanding types of model scoring architectures
Using cluster managers
Index.
Notes:
Includes index.
Description based on online resource; title from PDF title page (ebrary, viewed October 12, 2017).
OCLC:
1005351391

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account