My Account Log in

2 options

Data science on the Google cloud platform : implementing end-to-end real-time data pipelines: from ingest to machine learning / Valliappa Lakshmanan.

Online

Available online

View online

O'Reilly Online Learning: Academic/Public Library Edition Available online

View online
Format:
Book
Author/Creator:
Lakshmanan, Valliappa, author.
Contributor:
Safari Books Online (Firm)
Language:
English
Subjects (All):
Real-time data processing.
Cloud computing.
Computing platforms.
Genre:
Electronic books.
Physical Description:
1 online resource (xiv, 393 pages) : illustrations
Edition:
First edition.
Other Title:
Data Science on the Google Cloud Platform
Implementing end-to-end real-time data pipelines : from ingest to machine learning
Place of Publication:
Sebastopol, CA : O'Reilly Media, 2018.
System Details:
text file
Summary:
Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build on top of the Google Cloud Platform (GCP). This hands-on guide shows developers entering the data science field how to implement an end-to-end data pipeline, using statistical and machine learning methods and tools on GCP. Over the course of the book, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by implementing these statistical and machine learning solutions in your own project on GCP, and discover how this platform provides a transformative and more collaborative way of doing data science. You'll learn how to: automate and schedule data ingest using an App Engine application, create and populate a dashboard in Google Data Studio, build a real-time analysis pipeline to carry out streaming analytics, conduct interactive data exploration with Google BigQuery, create a Bayesian model on a Cloud Dataproc cluster, build a logistic regression machine learning model with Spark, compute time-aggregate features with a Cloud Dataflow pipeline, create a high-performing prediction model with TensorFlow, use your deployed model as a microservice you can access from both batch and real-time pipelines.
Contents:
1 Making Better Decisions Based on Data 1
Many Similar Decisions 2
The Role of Data Engineers 4
The Cloud Makes Data Engineers Possible 6
The Cloud Turbocharges Data Science 10
Case Studies Get at the Stubborn Facts 12
A Probabilistic Decision 13
Data and Tools 19
Getting Started with the Code 20
Summary 22
2 Ingesting Data into the Cloud 23
Airline On-Time Performance Data 23
Knowability 25
Training-Serving Skew 26
Download Procedure 27
Dataset Attributes 28
Why Not Store the Data in Situ? 29
Scaling Up 31
Scaling Out 33
Data in Situ with Colossus and Jupiter 35
Ingesting Data 38
Reverse Engineering a Web Form 39
Dataset Download 41
Exploration and Cleanup 43
Uploading Data to Google Cloud Storage 45
Scheduling Monthly Downloads 48
Ingesting in Python 51
Flask Web App 57
Running on App Engine 58
Securing the URL 59
Scheduling a Cron Task 59
Summary 61
Code Break 62
3 Creating Compelling Dashboards 65
Explain Your Model with Dashboards 66
Why Build a Dashboard First? 68
Accuracy, Honesty, and Good Design 69
Loading Data into Google Cloud SQL 71
Create a Google Cloud SQL Instance 72
Interacting with Google Cloud Platform 73
Controlling Access to MySQL 74
Create Tables 75
Populating Tables 77
Building Our First Model 78
Contingency Table 79
Threshold Optimization 80
Machine Learning 81
Building a Dashboard 81
Getting Started with Data Studio 82
Creating Charts 84
Adding End-User Controls 86
Showing Proportions with a Pie Chart 88
Explaining a Contingency Table 93
Summary 96
4 Streaming Data: Publication and Ingest 97
Designing the Event Feed 97
Time Correction 100
Apache Beam/Cloud Dataflow 101
Parsing Airports Data 103
Adding Time Zone Information 104
Converting Times to UTC 105
Correcting Dates 107
Creating Events 108
Running the Pipeline in the Cloud 109
Publishing an Event Stream to Cloud Pub/Sub 113
Get Records to Publish 115
Paging Through Records 116
Building a Batch of Events 117
Publishing a Batch of Events 118
Real-Time Stream Processing 119
Streaming in Java Dataflow 119
Executing the Stream Processing 124
Analyzing Streaming Data in BigQuery 126
Real-Time Dashboard 127
Summary 130
5 Interactive Data Exploration 131
Exploratory Data Analysis 132
Loading Flights Data into BigQuery 134
Advantages of a Serverless Columnar Database 134
Staging on Cloud Storage 136
Access Control 137
Federated Queries 142
Ingesting CSV Files 144
Exploratory Data Analysis in Cloud Datalab 149
Jupyter Notebooks 151
Cloud Datalab 151
Installing Packages in Cloud Datalab 154
Jupyter Magic for Google Cloud Platform 156
Quality Control 161
Oddball Values 162
Outlier Removal: Big Data Is Different 163
Filtering Data on Occurrence Frequency 165
Arrival Delay Conditioned on Departure Delay 166
Applying Probabilistic Decision Threshold 168
Empirical Probability Distribution Function 169
The Answer Is... 172
Evaluating the Model 172
Random Shuffling 173
Splitting by Date 174
Training and Testing 175
Summary 180
6 Bayes Classifier on Cloud Dataproc 181
MapReduce and the Hadoop Ecosystem 181
How MapReduce Works 182
Apache Hadoop 184
Google Cloud Dataproc 184
Need for Higher-Level Tools 186
Jobs, Not Clusters 188
Initialization Actions 189
Quantization Using Spark SQL 190
Google Cloud Datalab on Cloud Dataproc 192
Independence Check Using BigQuery 193
Spark SQL in Google Cloud Datalab 195
Histogram Equalization 198
Dynamically Resizing Clusters 202
Bayes Classification Using Pig 205
Running a Pig Job on Cloud Dataproc 207
Limiting to Training Days 208
The Decision Criteria 208
Evaluating the Bayesian Model 212
Summary 214
7 Machine Learning: Logistic Regression on Spark 217
Logistic Regression 218
Spark ML Library 221
Getting Started with Spark Machine Learning 222
Spark Logistic Regression 223
Creating a Training Dataset 224
Dealing with Corner Cases 226
Creating Training Examples 228
Training 229
Predicting by Using a Model 231
Evaluating a Model 232
Feature Engineering 235
Experimental Framework 236
Creating the Held-Out Dataset 238
Feature Selection 239
Scaling and Clipping Features 242
Feature Transforms 244
Categorical Variables 248
Scalable, Repeatable, Real Time 250
Summary 251
8 Time-Windowed Aggregate Features 253
The Need for Time Averages 253
Dataflow in Java 255
Setting Up Development Environment 256
Filtering with Beam 257
Pipeline Options and Text I/O 260
Run on Cloud 261
Parsing into Objects 263
Computing Time Averages 266
Grouping and Combining 266
Parallel Do with Side Input 268
Debugging 269
BigQueryIO 271
Mutating the Flight Object 272
Sliding Window Computation in Batch Mode 274
Running in the Cloud 275
Monitoring, Troubleshooting, and Performance Tuning 277
Troubleshooting Pipeline 278
Side Input Limitations 280
Redesigning the Pipeline 283
Removing Duplicates 285
Summary 288
9 Machine Learning Classifier Using TensorFlow 291
Toward More Complex Models 292
Reading Data into TensorFlow 295
Setting Up an Experiment 299
Linear Classifier 301
Training and Evaluating Input Functions 302
Serving Input Function 303
Creating an Experiment 304
Performing a Training Run 305
Distributed Training in the Cloud 307
Improving the ML Model 308
Deep Neural Network Model 309
Embeddings 312
Wide-and-Deep Model 314
Hyperparameter Tuning 317
Deploying the Model 325
Predicting with the Model 326
Explaining the Model 327
Summary 329
10 Real-Time Machine Learning 331
Invoking Prediction Service 332
Java Classes for Request and Response 333
Post Request and Parse Response 335
Client of Prediction Service 335
Adding Predictions to Flight Information 336
Batch Input and Output 336
Data Processing Pipeline 338
Identifying Inefficiency 339
Batching Requests 340
Streaming Pipeline 342
Flattening PCollections 343
Executing Streaming Pipeline 344
Late and Out-of-Order Records 345
Watermarks and Triggers 350
Transactions, Throughput, and Latency 352
Possible Streaming Sinks 352
Cloud Bigtable 354
Designing Tables 355
Designing the Row Key 356
Streaming into Cloud Bigtable 357
Querying from Cloud Bigtable 360
Evaluating Model Performance 361
The Need for Continuous Training 361
Evaluation Pipeline 362
Evaluating Performance 364
Marginal Distributions 364
Checking Model Behavior 366
Identifying Behavioral Change 369
Summary 370
Book Summary 371.
Notes:
Includes index.
Online resource; title from PDF title page (EBSCO, viewed December 20, 2017).
ISBN:
9781491974537
1491974532
9781491974513
1491974516
OCLC:
1015372122
Access Restriction:
Restricted for use by site license.

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Library Catalog Using Articles+ Library Account