My Account Log in

4 options

Modern Scala projects : leverage the power of Scala for building data-driven and high-performant projects / Ilango Gurusamy.

EBSCOhost Academic eBook Collection (North America) Available online

View online

EBSCOhost Ebook Business Collection Available online

View online

Ebook Central Academic Complete Available online

View online

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Gurusamy, Ilango, author.
Language:
English
Subjects (All):
Scala (Computer program language).
Machine learning.
Electronic data processing.
Physical Description:
1 online resource (334 pages)
Edition:
First edition
Place of Publication:
Birmingham : Packt, 2018.
System Details:
text file
Summary:
Develop robust, Scala-powered projects with the help of machine learning libraries such as SparkML to harvest meaningful insight Key Features Gain hands-on experience in building data science projects with Scala Exploit powerful functionalities of machine learning libraries Use machine learning algorithms and decision tree models for enterprise apps Book Description Scala, together with the Spark Framework, forms a rich and powerful data processing ecosystem. Modern Scala Projects is a journey into the depths of this ecosystem. The machine learning (ML) projects presented in this book enable you to create practical, robust data analytics solutions, with an emphasis on automating data workflows with the Spark ML pipeline API. This book showcases or carefully cherry-picks from Scala's functional libraries and other constructs to help readers roll out their own scalable data processing frameworks. The projects in this book enable data practitioners across all industries gain insights into data that will help organizations have strategic and competitive advantage. Modern Scala Projects focuses on the application of supervisory learning ML techniques that classify data and make predictions. You'll begin with working on a project to predict a class of flower by implementing a simple machine learning model. Next, you'll create a cancer diagnosis classification pipeline, followed by projects delving into stock price prediction, spam filtering, fraud detection, and a recommendation engine. By the end of this book, you will be able to build efficient data science projects that fulfil your software requirements. What you will learn Create pipelines to extract data or analytics and visualizations Automate your process pipeline with jobs that are reproducible Extract intelligent data efficiently from large, disparate datasets Automate the extraction, transformation, and loading of data Develop tools that collate, model, and analyze data Maintain the integrity of data as data flows become more complex Develop tools that predict outcomes based on ?pattern discovery? Build really fast and accurate machine-learning models in Scala Who this book is for Modern Scala Projects is for Scala developers who would like to gain some hands-on experience with some interesting real-world projects. Prior programming experience with Scala is necessary.
Contents:
Cover
Title Page
Copyright and Credits
Packt Upsell
Contributors
Table of Contents
Preface
Chapter 1: Predict the Class of a Flower from the Iris Dataset
A multivariate classification problem
Understanding multivariate
Different kinds of variables
Categorical variables
Fischer's Iris dataset
The Iris dataset represents a multiclass, multidimensional classification task
The training dataset
The mapping function
An algorithm and its mapping function
Supervised learning - how it relates to the Iris classification task
Random Forest classification algorithm
Project overview - problem formulation
Getting started with Spark
Setting up prerequisite software
Installing Spark in standalone deploy mode
Developing a simple interactive data analysis utility
Reading a data file and deriving DataFrame out of it
Implementing the Iris pipeline
Iris pipeline implementation objectives
Step 1 - getting the Iris dataset from the UCI Machine Learning Repository
Step 2 - preliminary EDA
Firing up Spark shell
Loading the iris.csv file and building a DataFrame
Calculating statistics
Inspecting your SparkConf again
Calculating statistics again
Step 3 - creating an SBT project
Step 4 - creating Scala files in SBT project
Step 5 - preprocessing, data transformation, and DataFrame creation
DataFrame Creation
Step 6 - creating, training, and testing data
Step 7 - creating a Random Forest classifier
Step 8 - training the Random Forest classifier
Step 9 - applying the Random Forest classifier to test data
Step 10 - evaluate Random Forest classifier
Step 11 - running the pipeline as an SBT application
Step 12 - packaging the application
Step 13 - submitting the pipeline application to Spark local
Summary
Questions.
Chapter 2: Build a Breast Cancer Prognosis Pipeline with the Power of Spark and Scala
Breast cancer classification problem
Breast cancer dataset at a glance
Logistic regression algorithm
Salient characteristics of LR
Binary logistic regression assumptions
A fictitious dataset and LR
LR as opposed to linear regression
Formulation of a linear regression classification model
Logit function as a mathematical equation
LR function
Getting started
Implementation objectives
Implementation objective 1 - getting the breast cancer dataset
Implementation objective 2 - deriving a dataframe for EDA
Step 1 - conducting preliminary EDA
Step 2 - loading data and converting it to an RDD[String]
Step 3 - splitting the resilient distributed dataset and reorganizing individual rows into an array
Step 4 - purging the dataset of rows containing question mark characters
Step 5 - running a count after purging the dataset of rows with questionable characters
Step 6 - getting rid of header
Step 7 - creating a two-column DataFrame
Step 8 - creating the final DataFrame
Random Forest breast cancer pipeline
Step 1 - creating an RDD and preprocessing the data
Step 2 - creating training and test data
Step 3 - training the Random Forest classifier
Step 4 - applying the classifier to the test data
Step 5 - evaluating the classifier
Step 6 - running the pipeline as an SBT application
Step 7 - packaging the application
Step 8 - deploying the pipeline app into Spark local
LR breast cancer pipeline
Implementation objectives 1 and 2
Implementation objective 3 - Spark ML workflow for the breast cancer classification task
Implementation objective 4 - coding steps for building the indexer and logit machine learning model.
Extending our pipeline object with the WisconsinWrapper trait
Importing the StringIndexer algorithm and using it
Splitting the DataFrame into training and test datasets
Creating a LogisticRegression classifier and setting hyperparameters on it
Running the LR model on the test dataset
Building a breast cancer pipeline with two stages
Implementation objective 5 - evaluating the binary classifier's performance
Questions
Chapter 3: Stock Price Predictions
Stock price binary classification problem
Stock price prediction dataset at a glance
Support for hardware virtualization
Installing the supported virtualization application
Downloading the HDP Sandbox and importing it
Hortonworks Sandbox virtual appliance overview
Turning on the virtual machine and powering up the Sandbox
Setting up SSH access for data transfer between Sandbox and the host machine
Setting up PuTTY, a third-party SSH and Telnet client
Setting up WinSCP, an SFTP client for Windows
Updating the default Python required by Zeppelin
What is Zeppelin?
Updating our Zeppelin instance
Launching the Ambari Dashboard and Zeppelin UI
Updating Zeppelin Notebook configuration by adding or updating interpreters
Updating a Spark 2 interpreter
List of implementation goals
Step 1 - creating a Scala representation of the path to the dataset file
Step 2 - creating an RDD[String]
Step 3 - splitting the RDD around the newline character in the dataset
Step 4 - transforming the RDD[String]
Step 5 - carrying out preliminary data analysis
Creating DataFrame from the original dataset
Dropping the Date and Label columns from the DataFrame
Having Spark describe the DataFrame
Adding a new column to the DataFrame and deriving Vector out of it.
Removing stop words - a preprocessing step
Transforming the merged DataFrame
Transforming a DataFrame into an array of NGrams
Adding a new column to the DataFrame, devoid of stop words
Constructing a vocabulary from our dataset corpus
Training CountVectorizer
Using StringIndexer to transform our input label column
Dropping the input label column
Adding a new column to our DataFrame
Dividing the DataSet into training and test sets
Creating labelIndexer to index the indexedLabel column
Creating StringIndexer to index a column label
Creating RandomForestClassifier
Creating a new data pipeline with three stages
Creating a new data pipeline with hyperparameters
Training our new data pipeline
Generating stock price predictions
Chapter 4: Building a Spam Classification Pipeline
Spam classification problem
Relevant background topics
Multidimensional data
Features and their importance
Classification task
Classification outcomes
Two possible classification outcomes
Spam classification pipeline
Implementation steps
Step 1 - setting up your project folder
Step 2 - upgrading your build.sbt file
Step 3 - creating a trait called SpamWrapper
Step 4 - describing the dataset
Description of the SpamHam dataset
Step 5 - creating a new spam classifier class
Step 6 - listing the data preprocessing steps
Step 7 - regex to remove punctuation marks and whitespaces
Step 8 - creating a ham dataframe with punctuation removed
Creating a labeled ham dataframe
Step 9 - creating a spam dataframe devoid of punctuation
Step 10 - joining the spam and ham datasets
Step 11 - tokenizing our features
Step 12 - removing stop words.
Step 13 - feature extraction
Step 14 - creating training and test datasets
Further reading
Chapter 5: Build a Fraud Detection System
Fraud detection problem
Fraud detection dataset at a glance
Precision, recall, and the F1 score
Feature selection
The Gaussian Distribution function
Where does Spark fit in all this?
Fraud detection approach
Setting up Hortonworks Sandbox in the cloud
Creating your Azure free account, and signing in
The Azure Marketplace
The HDP Sandbox home page
Create the FraudDetection trait
Broadcasting mean and standard deviation vectors
Calculating PDFs
F1 score
Calculating the best error term and best F1 score
Maximum and minimum values of a probability density
Step size for best error term calculation
A loop to generate the best F1 and the best error term
Generating predictions - outliers that represent fraud
Generating the best error term and best F1 measure
Preparing to compute precision and recall
A recap of how we looped through a ranger of Epsilons, the best error term, and the best F1 measure
Function to calculate false positives
Chapter 6: Build Flights Performance Prediction Model
Overview of flight delay prediction
The flight dataset at a glance
Problem formulation of flight delay prediction
Increasing Java memory
Reviewing the JDK version
MongoDB installation
Implementation and deployment
Creating a new Scala project
Building the AirlineWrapper Scala trait
Chapter 7: Building a Recommendation Engine.
Problem overviews.
Notes:
Includes bibliographical references.
Description based on print version record.
OCLC:
1050169895

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account