2 options
Data science on the Google cloud platform : implementing end-to-end real-time data pipelines: from ingest to machine learning / Valliappa Lakshmanan.
- Format:
- Book
- Author/Creator:
- Lakshmanan, Valliappa, author.
- Language:
- English
- Subjects (All):
- Real-time data processing.
- Cloud computing.
- Computing platforms.
- Genre:
- Electronic books.
- Physical Description:
- 1 online resource (xiv, 393 pages) : illustrations
- Edition:
- First edition.
- Other Title:
- Data Science on the Google Cloud Platform
- Implementing end-to-end real-time data pipelines : from ingest to machine learning
- Place of Publication:
- Sebastopol, CA : O'Reilly Media, 2018.
- System Details:
- text file
- Summary:
- Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build on top of the Google Cloud Platform (GCP). This hands-on guide shows developers entering the data science field how to implement an end-to-end data pipeline, using statistical and machine learning methods and tools on GCP. Over the course of the book, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by implementing these statistical and machine learning solutions in your own project on GCP, and discover how this platform provides a transformative and more collaborative way of doing data science. You'll learn how to: automate and schedule data ingest using an App Engine application, create and populate a dashboard in Google Data Studio, build a real-time analysis pipeline to carry out streaming analytics, conduct interactive data exploration with Google BigQuery, create a Bayesian model on a Cloud Dataproc cluster, build a logistic regression machine learning model with Spark, compute time-aggregate features with a Cloud Dataflow pipeline, create a high-performing prediction model with TensorFlow, use your deployed model as a microservice you can access from both batch and real-time pipelines.
- Contents:
- 1 Making Better Decisions Based on Data 1
- Many Similar Decisions 2
- The Role of Data Engineers 4
- The Cloud Makes Data Engineers Possible 6
- The Cloud Turbocharges Data Science 10
- Case Studies Get at the Stubborn Facts 12
- A Probabilistic Decision 13
- Data and Tools 19
- Getting Started with the Code 20
- Summary 22
- 2 Ingesting Data into the Cloud 23
- Airline On-Time Performance Data 23
- Knowability 25
- Training-Serving Skew 26
- Download Procedure 27
- Dataset Attributes 28
- Why Not Store the Data in Situ? 29
- Scaling Up 31
- Scaling Out 33
- Data in Situ with Colossus and Jupiter 35
- Ingesting Data 38
- Reverse Engineering a Web Form 39
- Dataset Download 41
- Exploration and Cleanup 43
- Uploading Data to Google Cloud Storage 45
- Scheduling Monthly Downloads 48
- Ingesting in Python 51
- Flask Web App 57
- Running on App Engine 58
- Securing the URL 59
- Scheduling a Cron Task 59
- Summary 61
- Code Break 62
- 3 Creating Compelling Dashboards 65
- Explain Your Model with Dashboards 66
- Why Build a Dashboard First? 68
- Accuracy, Honesty, and Good Design 69
- Loading Data into Google Cloud SQL 71
- Create a Google Cloud SQL Instance 72
- Interacting with Google Cloud Platform 73
- Controlling Access to MySQL 74
- Create Tables 75
- Populating Tables 77
- Building Our First Model 78
- Contingency Table 79
- Threshold Optimization 80
- Machine Learning 81
- Building a Dashboard 81
- Getting Started with Data Studio 82
- Creating Charts 84
- Adding End-User Controls 86
- Showing Proportions with a Pie Chart 88
- Explaining a Contingency Table 93
- Summary 96
- 4 Streaming Data: Publication and Ingest 97
- Designing the Event Feed 97
- Time Correction 100
- Apache Beam/Cloud Dataflow 101
- Parsing Airports Data 103
- Adding Time Zone Information 104
- Converting Times to UTC 105
- Correcting Dates 107
- Creating Events 108
- Running the Pipeline in the Cloud 109
- Publishing an Event Stream to Cloud Pub/Sub 113
- Get Records to Publish 115
- Paging Through Records 116
- Building a Batch of Events 117
- Publishing a Batch of Events 118
- Real-Time Stream Processing 119
- Streaming in Java Dataflow 119
- Executing the Stream Processing 124
- Analyzing Streaming Data in BigQuery 126
- Real-Time Dashboard 127
- Summary 130
- 5 Interactive Data Exploration 131
- Exploratory Data Analysis 132
- Loading Flights Data into BigQuery 134
- Advantages of a Serverless Columnar Database 134
- Staging on Cloud Storage 136
- Access Control 137
- Federated Queries 142
- Ingesting CSV Files 144
- Exploratory Data Analysis in Cloud Datalab 149
- Jupyter Notebooks 151
- Cloud Datalab 151
- Installing Packages in Cloud Datalab 154
- Jupyter Magic for Google Cloud Platform 156
- Quality Control 161
- Oddball Values 162
- Outlier Removal: Big Data Is Different 163
- Filtering Data on Occurrence Frequency 165
- Arrival Delay Conditioned on Departure Delay 166
- Applying Probabilistic Decision Threshold 168
- Empirical Probability Distribution Function 169
- The Answer Is... 172
- Evaluating the Model 172
- Random Shuffling 173
- Splitting by Date 174
- Training and Testing 175
- Summary 180
- 6 Bayes Classifier on Cloud Dataproc 181
- MapReduce and the Hadoop Ecosystem 181
- How MapReduce Works 182
- Apache Hadoop 184
- Google Cloud Dataproc 184
- Need for Higher-Level Tools 186
- Jobs, Not Clusters 188
- Initialization Actions 189
- Quantization Using Spark SQL 190
- Google Cloud Datalab on Cloud Dataproc 192
- Independence Check Using BigQuery 193
- Spark SQL in Google Cloud Datalab 195
- Histogram Equalization 198
- Dynamically Resizing Clusters 202
- Bayes Classification Using Pig 205
- Running a Pig Job on Cloud Dataproc 207
- Limiting to Training Days 208
- The Decision Criteria 208
- Evaluating the Bayesian Model 212
- Summary 214
- 7 Machine Learning: Logistic Regression on Spark 217
- Logistic Regression 218
- Spark ML Library 221
- Getting Started with Spark Machine Learning 222
- Spark Logistic Regression 223
- Creating a Training Dataset 224
- Dealing with Corner Cases 226
- Creating Training Examples 228
- Training 229
- Predicting by Using a Model 231
- Evaluating a Model 232
- Feature Engineering 235
- Experimental Framework 236
- Creating the Held-Out Dataset 238
- Feature Selection 239
- Scaling and Clipping Features 242
- Feature Transforms 244
- Categorical Variables 248
- Scalable, Repeatable, Real Time 250
- Summary 251
- 8 Time-Windowed Aggregate Features 253
- The Need for Time Averages 253
- Dataflow in Java 255
- Setting Up Development Environment 256
- Filtering with Beam 257
- Pipeline Options and Text I/O 260
- Run on Cloud 261
- Parsing into Objects 263
- Computing Time Averages 266
- Grouping and Combining 266
- Parallel Do with Side Input 268
- Debugging 269
- BigQueryIO 271
- Mutating the Flight Object 272
- Sliding Window Computation in Batch Mode 274
- Running in the Cloud 275
- Monitoring, Troubleshooting, and Performance Tuning 277
- Troubleshooting Pipeline 278
- Side Input Limitations 280
- Redesigning the Pipeline 283
- Removing Duplicates 285
- Summary 288
- 9 Machine Learning Classifier Using TensorFlow 291
- Toward More Complex Models 292
- Reading Data into TensorFlow 295
- Setting Up an Experiment 299
- Linear Classifier 301
- Training and Evaluating Input Functions 302
- Serving Input Function 303
- Creating an Experiment 304
- Performing a Training Run 305
- Distributed Training in the Cloud 307
- Improving the ML Model 308
- Deep Neural Network Model 309
- Embeddings 312
- Wide-and-Deep Model 314
- Hyperparameter Tuning 317
- Deploying the Model 325
- Predicting with the Model 326
- Explaining the Model 327
- Summary 329
- 10 Real-Time Machine Learning 331
- Invoking Prediction Service 332
- Java Classes for Request and Response 333
- Post Request and Parse Response 335
- Client of Prediction Service 335
- Adding Predictions to Flight Information 336
- Batch Input and Output 336
- Data Processing Pipeline 338
- Identifying Inefficiency 339
- Batching Requests 340
- Streaming Pipeline 342
- Flattening PCollections 343
- Executing Streaming Pipeline 344
- Late and Out-of-Order Records 345
- Watermarks and Triggers 350
- Transactions, Throughput, and Latency 352
- Possible Streaming Sinks 352
- Cloud Bigtable 354
- Designing Tables 355
- Designing the Row Key 356
- Streaming into Cloud Bigtable 357
- Querying from Cloud Bigtable 360
- Evaluating Model Performance 361
- The Need for Continuous Training 361
- Evaluation Pipeline 362
- Evaluating Performance 364
- Marginal Distributions 364
- Checking Model Behavior 366
- Identifying Behavioral Change 369
- Summary 370
- Book Summary 371.
- Notes:
- Includes index.
- Online resource; title from PDF title page (EBSCO, viewed December 20, 2017).
- ISBN:
- 9781491974537
- 1491974532
- 9781491974513
- 1491974516
- OCLC:
- 1015372122
- Access Restriction:
- Restricted for use by site license.
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.