3 options
R data mining : implement data mining techniques through practical use cases and real-world datasets / Andrea Cirillo.
- Format:
- Book
- Author/Creator:
- Cirillo, Andrea, author.
- Language:
- English
- Subjects (All):
- R (Computer program language).
- Data mining.
- Physical Description:
- 1 online resource (1 volume) : illustrations
- Edition:
- 1st edition
- Place of Publication:
- Birmingham, England ; Mumbai, [India] : Packt Publishing, 2017.
- System Details:
- text file
- Summary:
- Mine valuable insights from your data using popular tools and techniques in R About This Book Understand the basics of data mining and why R is a perfect tool for it. Manipulate your data using popular R packages such as ggplot2, dplyr, and so on to gather valuable business insights from it. Apply effective data mining models to perform regression and classification tasks. Who This Book Is For If you are a budding data scientist, or a data analyst with a basic knowledge of R, and want to get into the intricacies of data mining in a practical manner, this is the book for you. No previous experience of data mining is required. What You Will Learn Master relevant packages such as dplyr, ggplot2 and so on for data mining Learn how to effectively organize a data mining project through the CRISP-DM methodology Implement data cleaning and validation tasks to get your data ready for data mining activities Execute Exploratory Data Analysis both the numerical and the graphical way Develop simple and multiple regression models along with logistic regression Apply basic ensemble learning techniques to join together results from different data mining models Perform text mining analysis from unstructured pdf files and textual data Produce reports to effectively communicate objectives, methods, and insights of your analyses In Detail R is widely used to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more. This book will empower you to produce and present impressive analyses from data, by selecting and implementing the appropriate data mining techniques in R. It will let you gain these powerful skills while immersing in a one of a kind data mining crime case, where you will be requested to help resolving a real fraud case affecting a commercial company, by the mean of both basic and advanced data mining techniques. While moving along the plot of the story you will effectively learn and practice on real data the various R packages commonly employed for this kind of tasks. You will also get the chance of apply some of the most popular and effective data mining models and algos, from the basic multiple linear regression to the most advanced Support Vector Machines. Unlike other data mining learning instruments, this book will effectively expose you the theory behind these models, their relevant assumptions and when they can be applied to the data you are facing. By the end of the book you w...
- Contents:
- Cover
- Copyright
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: Why to Choose R for Your Data Mining and Where to Start
- What is R?
- A bit of history
- R's points of strength
- Open source inside
- Plugin ready
- Data visualization friendly
- Installing R and writing R code
- Downloading R
- R installation for Windows and macOS
- R installation for Linux OS
- Main components of a base R installation
- Possible alternatives to write and run R code
- RStudio (all OSs)
- The Jupyter Notebook (all OSs)
- Visual Studio (Windows users only)
- R foundational notions
- A preliminary R session
- Executing R interactively through the R console
- Creating an R script
- Executing an R script
- Vectors
- Lists
- Creating lists
- Subsetting lists
- Data frames
- Functions
- R's weaknesses and how to overcome them
- Learning R effectively and minimizing the effort
- The tidyverse
- Leveraging the R community to learn R
- Where to find the R community
- Engaging with the community to learn R
- Handling large datasets with R
- Further references
- Summary
- Chapter 2: A First Primer on Data Mining Analysing Your Bank Account Data
- Acquiring and preparing your banking data
- Data model
- Summarizing your data with pivot-like tables
- A gentle introduction to the pipe operator
- An even more gentle introduction to the dplyr package
- Installing the necessary packages and loading your data into R
- Installing and loading the necessary packages
- Importing your data into R
- Defining the monthly and daily sum of expenses
- Visualizing your data with ggplot2
- Basic data visualization principles
- Less but better
- Not every chart is good for your message
- Scatter plot
- Line chart
- Bar plot
- Other advanced charts.
- Colors have to be chosen carefully
- A bit of theory - chromatic circle, hue, and luminosity
- Visualizing your data with ggplot
- One more gentle introduction - the grammar of graphics
- A layered grammar of graphics - ggplot2
- Visualizing your banking movements with ggplot2
- Visualizing the number of movements per day of the week
- Chapter 3: The Data Mining Process - CRISP-DM Methodology
- The Crisp-DM methodology data mining cycle
- Business understanding
- Data understanding
- Data collection
- How to perform data collection with R
- Data import from TXT and CSV files
- Data import from different types of format already structured as tables
- Data import from unstructured sources
- Data description
- How to perform data description with R
- Data exploration
- What to use in R to perform this task
- The summary() function
- Box plot
- Histograms
- Data preparation
- Modelling
- Defining a data modeling strategy
- How similar problems were solved in the past
- Emerging techniques
- Classification of modeling problems
- How to perform data modeling with R
- Evaluation
- Clustering evaluation
- Classification evaluation
- Regression evaluation
- How to judge the adequacy of a model's performance
- Deployment
- Deployment plan development
- Maintenance plan development
- Chapter 4: Keeping the House Clean - The Data Mining Architecture
- A general overview
- Data sources
- Types of data sources
- Unstructured data sources
- Structured data sources
- Key issues of data sources
- Databases and data warehouses
- The third wheel - the data mart
- One-level database
- Two-level database
- Three-level database
- Technologies
- SQL
- MongoDB
- Hadoop
- The data mining engine
- The interpreter.
- The interface between the engine and the data warehouse
- The data mining algorithms
- User interface
- Clarity
- Clarity and mystery
- Clarity and simplicity
- Efficiency
- Consistency
- Syntax highlight
- Auto-completion
- How to build a data mining architecture in R
- The data warehouse
- The interface between the engine and the data warehouse
- The user interface
- Chapter 5: How to Address a Data Mining Problem - Data Cleaning and Validation
- On a quiet day
- Data cleaning
- Tidy data
- Analysing the structure of our data
- The str function
- The describe function
- head, tail, and View functions
- Evaluating your data tidiness
- Every row is a record
- Every column shows an attribute
- Every table represents an observational unit
- Tidying our data
- The tidyr package
- Long versus wide data
- The spread function
- The gather function
- The separate function
- Applying tidyr to our dataset
- Validating our data
- Fitness for use
- Conformance to standards
- Data quality controls
- Consistency checks
- Data type checks
- Logical checks
- Domain checks
- Uniqueness checks
- Performing data validation on our data
- Data type checks with str()
- The final touch - data merging
- left_join function
- moving beyond left_join
- Chapter 6: Looking into Your Data Eyes - Exploratory Data Analysis
- Introducing summary EDA
- Describing the population distribution
- Quartiles and Median
- Mean
- The mean and phenomenon going on within sub populations
- The mean being biased by outlier values
- Computing the mean of our population
- Variance
- Standard deviation
- Skewness
- Measuring the relationship between variables
- Correlation.
- The Pearson correlation coefficient
- Distance correlation
- Weaknesses of summary EDA - the Anscombe quartet
- Graphical EDA
- Visualizing a variable distribution
- Histogram
- Reporting date histogram
- Geographical area histogram
- Cash flow histogram
- Boxplot
- Checking for outliers
- Visualizing relationships between variables
- Scatterplots
- Adding title, subtitle, and caption to the plot
- Setting axis and legend
- Adding explicative text to the plot
- Final touches on colors
- Chapter 7: Our First Guess - a Linear Regression
- Defining a data modelling strategy
- Data modelling notions
- Supervised learning
- Unsupervised learning
- The modeling strategy
- Applying linear regression to our data
- The intuition behind linear regression
- The math behind the linear regression
- Ordinary least squares technique
- Model requirements - what to look for before applying the model
- Residuals' uncorrelation
- Residuals' homoscedasticity
- How to apply linear regression in R
- Fitting the linear regression model
- Validating model assumption
- Visualizing fitted values
- Preparing the data for visualization
- Developing the data visualization
- Chapter 8: A Gentle Introduction to Model Performance Evaluation
- Defining model performance
- Fitting versus interpretability
- Making predictions with models
- Measuring performance in regression models
- Mean squared error
- R-squared
- R-squared meaning and interpretation
- R-squared computation in R
- Adjusted R-squared
- R-squared misconceptions
- The R-squared doesn't measure the goodness of fit
- A low R-squared doesn't mean your model is not statistically significant
- Measuring the performance in classification problems
- The confusion matrix
- Confusion matrix in R
- Accuracy.
- How to compute accuracy in R
- Sensitivity
- How to compute sensitivity in R
- Specificity
- How to compute specificity in R
- How to choose the right performance statistics
- A final general warning - training versus test datasets
- Chapter 9: Don't Give up - Power up Your Regression Including Multiple Variables
- Moving from simple to multiple linear regression
- Notation
- Assumptions
- Variables' collinearity
- Tolerance
- Variance inflation factors
- Addressing collinearity
- Dimensionality reduction
- Stepwise regression
- Backward stepwise regression
- From the full model to the n-1 model
- Forward stepwise regression
- Double direction stepwise regression
- Principal component regression
- Fitting a multiple linear model with R
- Model fitting
- Variable assumptions validation
- Residual assumptions validation
- Linear model cheat sheet
- Chapter 10: A Different Outlook to Problems with Classification Models
- What is classification and why do we need it?
- Linear regression limitations for categorical variables
- Common classification algorithms and models
- Logistic regression
- The intuition behind logistic regression
- The logistic function estimates a response variable enclosed within an upper and lower bound
- The logistic function estimates the probability of an observation pertaining to one of the two available categories
- The math behind logistic regression
- Maximum likelihood estimator
- Model assumptions
- Absence of multicollinearity between variables
- Linear relationship between explanatory variables and log odds
- Large enough sample size
- How to apply logistic regression in R
- Fitting the model
- Reading the glm() estimation output.
- The level of statistical significance of the association between the explanatory variable and the response variable.
- Notes:
- Includes bibliographical references at the end of each chapters and index.
- Description based on online resource; title from PDF title page (EBC, viewed December 29, 2017).
- ISBN:
- 9781787129238
- 1787129233
- OCLC:
- 1018480584
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.