My Account Log in

0 options

We are having trouble retrieving some holdings at the moment. Refresh the page to try again.

Cleaning data for effective data science : doing the other 80% of the work with Python, R, and command-line tools / David Mertz.

Format:
Book
Author/Creator:
Mertz, David, (author).
Language:
English
Subjects (All):
Database management.
Data integrity.
Physical Description:
1 online resource (499 pages)
Place of Publication:
Birmingham, England ; Mumbai : Packt Publishing, [2021]
Summary:
Data in its raw state is rarely ready for productive analysis. This book not only teaches you data preparation, but also what questions you should ask of your data. It focuses on the thought processes necessary for successful data cleaning as much as on concise and precise code examples that express these thoughts.
Contents:
Cover
Copyright
Contributors
Table of Contents
Preface
Part I - Data Ingestion
Chapter 1: Tabular Formats
Tidying Up
CSV
Sanity Checks
The Good, the Bad, and the Textual Data
The Bad
The Good
Spreadsheets Considered Harmful
SQL RDBMS
Massaging Data Types
Repeating in R
Where SQL Goes Wrong (and How to Notice It)
Other Formats
HDF5 and NetCDF-4
Tools and Libraries
SQLite
Apache Parquet
Data Frames
Spark/Scala
Pandas and Derived Wrappers
Vaex
Data Frames in R (Tidyverse)
Data Frames in R (data.table)
Bash for Fun
Exercises
Tidy Data from Excel
Tidy Data from SQL
Denouement
Chapter 2: Hierarchical Formats
JSON
What JSON Looks Like
NaN Handling and Data Types
JSON Lines
GeoJSON
Tidy Geography
JSON Schema
XML
User Records
Keyhole Markup Language
Configuration Files
INI and Flat Custom Formats
TOML
Yet Another Markup Language
NoSQL Databases
Document-Oriented Databases
Missing Fields
Denormalization and Its Discontents
Key/Value Stores
Exploring Filled Area
Create a Relational Model
Chapter 3: Repurposing Data Sources
Web Scraping
HTML Tables
Non-Tabular Data
Command-Line Scraping
Portable Document Format
Image Formats
Pixel Statistics
Channel Manipulation
Metadata
Binary Serialized Data Structures
Custom Text Formats
A Structured Log
Character Encodings
Enhancing the NPY Parser
Scraping Web Traffic
Part II - The Vicissitudes of Error
Chapter 4: Anomaly Detection
Missing Data
SQL
Hierarchical Formats
Sentinels
Miscoded Data
Fixed Bounds
Outliers
Z-Score
Interquartile Range
Multivariate Outliers
A Famous Experiment
Misspelled Words.
Denouement
Chapter 5: Data Quality
Biasing Trends
Understanding Bias
Detecting Bias
Comparison to Baselines
Benford's Law
Class Imbalance
Normalization and Scaling
Applying a Machine Learning Model
Scaling Techniques
Factor and Sample Weighting
Cyclicity and Autocorrelation
Domain Knowledge Trends
Discovered Cycles
Bespoke Validation
Collation Validation
Transcription Validation
Data Characterization
Oversampled Polls
Part III - Rectification and Creation
Chapter 6: Value Imputation
Typical-Value Imputation
Typical Tabular Data
Locality Imputation
Trend Imputation
Types of Trends
A Larger Coarse Time Series
Understanding the Data
Removing Unusable Data
Imputing Consistency
Interpolation
Non-Temporal Trends
Sampling
Undersampling
Oversampling
Alternate Trend Imputation
Balancing Multiple Features
Chapter 7: Feature Engineering
Date/Time Fields
Creating Datetimes
Imposing Regularity
Duplicated Timestamps
Adding Timestamps
String Fields
Fuzzy Matching
Explicit Categories
String Vectors
Decompositions
Rotation and Whitening
Dimensionality Reduction
Visualization
Quantization and Binarization
One-Hot Encoding
Polynomial Features
Generating Synthetic Features
Feature Selection
Intermittent Occurrences
Characterizing Levels
Part IV - Ancillary Matters
Closure
What You Know
What You Don't Know (Yet)
Glossary
Other Books You May Enjoy
Index.
Notes:
Description based on print version record.
ISBN:
9781801074407
1801074402
OCLC:
1245420921

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Library Catalog Using Articles+ Library Account