My Account Log in

1 option

Hands-on data science and Python machine learning : perform data mining and machine learning efficiently using Python and Spark / Frank Kane.

Ebook Central College Complete Available online

View online
Format:
Book
Author/Creator:
Kane, Frank, author.
Language:
English
Subjects (All):
Python (Computer program language).
Machine learning.
Data mining.
Physical Description:
1 online resource (415 pages) : illustrations
Edition:
1st ed.
Place of Publication:
Birmingham, England ; Mumbai, [India] : Packt, 2017.
Biography/History:
Kane Frank: Frank Kane has spent nine years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers all the time. He holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology and teaches others about big data analysis.
Summary:
This book covers the fundamentals of machine learning with Python in a concise and dynamic manner. It covers data mining and large-scale machine learning using Apache Spark.Key FeaturesTake your first steps in the world of data science by understanding the tools and techniques of data analysisTrain efficient Machine Learning models in Python using the supervised and unsupervised learning methodsLearn how to use Apache Spark for processing Big Data efficientlyBook DescriptionJoin Frank Kane, who worked on Amazon and IMDb’s machine learning algorithms, as he guides you on your first steps into the world of data science. Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your own machine learning models. With the help of interesting and easy-to-follow practical examples, Frank Kane explains potentially complex topics such as Bayesian methods and K-means clustering in a way that anybody can understand them. Based on Frank’s successful data science course, Hands-On Data Science and Python Machine Learning empowers you to conduct data analysis and perform efficient machine learning using Python. Let Frank help you unearth the value in your data using the various data mining and data analysis techniques available in Python, and to develop efficient predictive models to predict future results. You will also learn how to perform large-scale machine learning on Big Data using Apache Spark. The book covers preparing your data for analysis, training machine learning models, and visualizing the final data analysis. What you will learnLearn how to clean your data and ready it for analysisImplement the popular clustering and regression methods in PythonTrain efficient machine learning models using decision trees and random forestsVisualize the results of your analysis using Python’s Matplotlib libraryUse Apache Spark’s MLlib package to perform machine learning on large datasetsWho this book is forIf you are a budding data scientist or a data analyst who wants to analyze and gain actionable insights from data using Python, this book is for you. Programmers with some experience in Python who want to enter the lucrative world of Data Science will also find this book to be very useful, but you don't need to be an expert Python coder or mathematician to get the most from this book.
Contents:
Intro
Copyright
Credits
About the Author
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Getting Started
Installing Enthought Canopy
Giving the installation a test run
If you occasionally get problems opening your IPNYB files
Using and understanding IPython (Jupyter) Notebooks
Python basics - Part 1
Understanding Python code
Importing modules
Data structures
Experimenting with lists
Pre colon
Post colon
Negative syntax
Adding list to list
The append function
Complex data structures
Dereferencing a single element
The sort function
Reverse sort
Tuples
Dereferencing an element
List of tuples
Dictionaries
Iterating through entries
Python basics - Part 2
Functions in Python
Lambda functions - functional programming
Understanding boolean expressions
The if statement
The if-else loop
Looping
The while loop
Exploring activity
Running Python scripts
More options than just the IPython/Jupyter Notebook
Running Python scripts in command prompt
Using the Canopy IDE
Summary
Chapter 2: Statistics and Probability Refresher, and Python Practice
Types of data
Numerical data
Discrete data
Continuous data
Categorical data
Ordinal data
Mean, median, and mode
Mean
Median
The factor of outliers
Mode
Using mean, median, and mode in Python
Calculating mean using the NumPy package
Visualizing data using matplotlib
Calculating median using the NumPy package
Analyzing the effect of outliers
Calculating mode using the SciPy package
Some exercises
Standard deviation and variance
Variance
Measuring variance
Standard deviation
Identifying outliers with standard deviation
Population variance versus sample variance
The Mathematical explanation.
Analyzing standard deviation and variance on a histogram
Using Python to compute standard deviation and variance
Try it yourself
Probability density function and probability mass function
The probability density function and probability mass functions
Probability density functions
Probability mass functions
Types of data distributions
Uniform distribution
Normal or Gaussian distribution
The exponential probability distribution or Power law
Binomial probability mass function
Poisson probability mass function
Percentiles and moments
Percentiles
Quartiles
Computing percentiles in Python
Moments
Computing moments in Python
Chapter 3: Matplotlib and Advanced Probability Concepts
A crash course in Matplotlib
Generating multiple plots on one graph
Saving graphs as images
Adjusting the axes
Adding a grid
Changing line types and colors
Labeling axes and adding a legend
A fun example
Generating pie charts
Generating bar charts
Generating scatter plots
Generating histograms
Generating box-and-whisker plots
Covariance and correlation
Defining the concepts
Measuring covariance
Correlation
Computing covariance and correlation in Python
Computing correlation - The hard way
Computing correlation - The NumPy way
Correlation activity
Conditional probability
Conditional probability exercises in Python
Conditional probability assignment
My assignment solution
Bayes' theorem
Chapter 4: Predictive Models
Linear regression
The ordinary least squares technique
The gradient descent technique
The co-efficient of determination or r-squared
Computing r-squared
Interpreting r-squared
Computing linear regression and r-squared using Python
Activity for linear regression.
Polynomial regression
Implementing polynomial regression using NumPy
Computing the r-squared error
Activity for polynomial regression
Multivariate regression and predicting car prices
Multivariate regression using Python
Activity for multivariate regression
Multi-level models
Chapter 5: Machine Learning with Python
Machine learning and train/test
Unsupervised learning
Supervised learning
Evaluating supervised learning
K-fold cross validation
Using train/test to prevent overfitting of a polynomial regression
Activity
Bayesian methods - Concepts
Implementing a spam classifier with Naïve Bayes
K-Means clustering
Limitations to k-means clustering
Clustering people based on income and age
Measuring entropy
Decision trees - Concepts
Decision tree example
Walking through a decision tree
Random forests technique
Decision trees - Predicting hiring decisions using Python
Ensemble learning - Using a random forest
Ensemble learning
Support vector machine overview
Using SVM to cluster people by using scikit-learn
Chapter 6: Recommender Systems
What are recommender systems?
User-based collaborative filtering
Limitations of user-based collaborative filtering
Item-based collaborative filtering
Understanding item-based collaborative filtering
How item-based collaborative filtering works?
Collaborative filtering using Python
Finding movie similarities
Understanding the code
The corrwith function
Improving the results of movie similarities
Making movie recommendations to people
Understanding movie recommendations with an example
Using the groupby command to combine rows
Removing entries with the drop command
Improving the recommendation results
Summary.
Chapter 7: More Data Mining and Machine Learning Techniques
K-nearest neighbors - concepts
Using KNN to predict a rating for a movie
Dimensionality reduction and principal component analysis
Dimensionality reduction
Principal component analysis
A PCA example with the Iris dataset
Data warehousing overview
ETL versus ELT
Reinforcement learning
Q-learning
The exploration problem
The simple approach
The better way
Fancy words
Markov decision process
Dynamic programming
Chapter 8: Dealing with Real-World Data
Bias/variance trade-off
K-fold cross-validation to avoid overfitting
Example of k-fold cross-validation using scikit-learn
Data cleaning and normalisation
Cleaning web log data
Applying a regular expression on the web log
Modification one - filtering the request field
Modification two - filtering post requests
Modification three - checking the user agents
Filtering the activity of spiders/robots
Modification four - applying website-specific filters
Activity for web log data
Normalizing numerical data
Detecting outliers
Dealing with outliers
Activity for outliers
Chapter 9: Apache Spark - Machine Learning on Big Data
Installing Spark
Installing Spark on Windows
Installing Spark on other operating systems
Installing the Java Development Kit
Spark introduction
It's scalable
It's fast
It's young
It's not difficult
Components of Spark
Python versus Scala for Spark
Spark and Resilient Distributed Datasets (RDD)
The SparkContext object
Creating RDDs
Creating an RDD using a Python list
Loading an RDD from a text file
More ways to create RDDs
RDD operations
Transformations
Using map()
Actions
Introducing MLlib.
Some MLlib Capabilities
Special MLlib data types
The vector data type
LabeledPoint data type
Rating data type
Decision Trees in Spark with MLlib
Exploring decision trees code
Creating the SparkContext
Importing and cleaning our data
Creating a test candidate and building our decision tree
Running the script
K-Means Clustering in Spark
Within set sum of squared errors (WSSSE)
Running the code
TF-IDF
TF-IDF in practice
Using TF- IDF
Searching wikipedia with Spark MLlib
Import statements
Creating the initial RDD
Creating and transforming a HashingTF object
Computing the TF-IDF score
Using the Wikipedia search engine algorithm
Running the algorithm
Using the Spark 2.0 DataFrame API for MLlib
How Spark 2.0 MLlib works
Implementing linear regression
Chapter 10: Testing and Experimental Design
A/B testing concepts
A/B tests
Measuring conversion for A/B testing
How to attribute conversions
Variance is your enemy
T-test and p-value
The t-statistic or t-test
The p-value
Measuring t-statistics and p-values using Python
Running A/B test on some experimental data
When there's no real difference between the two groups
Does the sample size make a difference?
Sample size increased to six-digits
Sample size increased seven-digits
A/A testing
Determining how long to run an experiment for
A/B test gotchas
Novelty effects
Seasonal effects
Selection bias
Auditing selection bias issues
Data pollution
Attribution errors
Index.
Notes:
Includes index.
Description based on online resource; title from PDF title page (ebrary, viewed August 28, 2017).
OCLC:
999636604

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account