2 options
Data analysis and graphics using R : an example-based approach / John Maindonald and John Braun.
Van Pelt Library QA276.4 .M245 2003
Available
- Format:
- Book
- Author/Creator:
- Maindonald, J. H. (John Hilary), 1937-
- Series:
- Cambridge series on statistical and probabilistic mathematics
- Cambridge series in statistical and probabilistic mathematics
- Language:
- English
- Subjects (All):
- Statistics--Data processing.
- Statistics.
- Statistics--Graphic methods--Data processing.
- R (Computer program language).
- Statistics--Graphic methods.
- Physical Description:
- xxiii, 362 pages : illustrations ; 26 cm.
- Place of Publication:
- Cambridge, UK ; New York : Cambridge University Press, 2003.
- Summary:
- Modern statistical software such as the powerful and freely available R system provides sophisticated tools for researchers who need to manipulate and display their data. Maindonald and Braun begin this example-based introduction to data analysis with a tutorial in R, and this allows them to demonstrate elementary concepts and methodologies for data analysis with real world examples drawn from their experience as teachers and consultants. The detailed discussion of regression methods that makes up the core of the book leads on to more advanced statistical concepts. As these are explained, the facilities that allow them to be implemented in the R system are illustrated. R code and data sets for all examples are available on the Internet, and can be reworked easily by the reader. This allows mathematical content to be kept to a minimum while statistical and scientific issues are still covered in depth. This book is therefore suitable for either the researcher requiring practical skills in data analysis, or the student looking for examples of applications to complement a more theoretically oriented course. Only basic statistical knowledge is assumed, approximately that for a first undergraduate course, and the methods demonstrated are suitable for use in fields as diverse as biology, social science, medicine and engineering.
- Contents:
- 1.1 A Short R Session 1
- 1.1.1 R must be installed! 1
- 1.1.2 Using the console (or command line) window 1
- 1.1.3 Reading data from a file 2
- 1.1.4 Entry of data at the command line 3
- 1.1.5 Online help 4
- 1.1.6 Quitting R 5
- 1.2 The Uses of R 5
- 1.3 The R Language 6
- 1.3.1 R objects 7
- 1.3.2 Retaining objects between sessions 7
- 1.4 Vectors in R 8
- 1.4.1 Concatenation
- joining vector objects 8
- 1.4.2 Subsets of vectors 8
- 1.4.3 Patterned data 9
- 1.4.4 Missing values 9
- 1.4.5 Factors 10
- 1.5 Data Frames 11
- 1.5.1 Variable names 12
- 1.5.2 Applying a function to the columns of a data frame 13
- 1.5.3 Data frames and matrices 13
- 1.5.4 Identification of rows that include missing values 13
- 1.6 R Packages 14
- 1.6.1 Data sets that accompany R packages 14
- 1.7 Looping 14
- 1.8 R Graphics 15
- 1.8.1 The function plot () and allied functions 16
- 1.8.2 Identification and location on the figure region 19
- 1.8.3 Plotting mathematical symbols 20
- 1.8.4 Row by column layouts of plots 20
- 1.8.5 Graphs
- additional notes 22
- 1.9 Additional Points on the Use of R in This Book 23
- 2 Styles of Data Analysis 29
- 2.1 Revealing Views of the Data 29
- 2.1.1 Views of a single sample 30
- 2.1.2 Patterns in grouped data 33
- 2.1.3 Patterns in bivariate data
- the scatterplot 34
- 2.1.4 Multiple variables and times 36
- 2.1.5 Lattice (trellis style) graphics 37
- 2.1.6 What to look for in plots 41
- 2.2 Data Summary 42
- 2.2.1 Mean and median 43
- 2.2.2 Standard deviation and inter-quartile range 44
- 2.2.3 Correlation 46
- 2.3 Statistical Analysis Strategies 47
- 2.3.1 Helpful and unhelpful questions 48
- 2.3.2 Planning the formal analysis 48
- 2.3.3 Changes to the intended plan of analysis 49
- 3 Statistical Models 52
- 3.1 Regularities 53
- 3.1.1 Mathematical models 53
- 3.1.2 Models that include a random component 54
- 3.1.3 Smooth and rough 55
- 3.1.4 The construction and use of models 56
- 3.1.5 Model formulae 56
- 3.2 Distributions: Models for the Random Component 57
- 3.2.1 Discrete distributions 57
- 3.2.2 Continuous distributions 58
- 3.3 The Uses of Random Numbers 60
- 3.3.1 Simulation 60
- 3.3.2 Sampling from populations 61
- 3.4 Model Assumptions 62
- 3.4.1 Random sampling assumptions
- independence 62
- 3.4.2 Checks for normality 63
- 3.4.3 Checking other model assumptions 66
- 3.4.4 Are non-parametric methods the answer? 66
- 3.4.5 Why models matter
- adding across contingency tables 67
- 4 An Introduction to Formal Inference 71
- 4.1 Standard Errors 71
- 4.1.1 Population parameters and sample statistics 71
- 4.1.2 Assessing accuracy
- the standard error 71
- 4.1.3 Standard errors for differences of means 72
- 4.1.4 The standard error of the median 73
- 4.1.5 Resampling to estimate standard errors: bootstrapping 73
- 4.2 Calculations Involving Standard Errors: the t-Distribution 74
- 4.3 Confidence Intervals and Hypothesis Tests 77
- 4.3.1 One- and two-sample intervals and tests for means 78
- 4.3.2 Confidence intervals and tests for proportions 82
- 4.3.3 Confidence intervals for the correlation 83
- 4.4 Contingency Tables 84
- 4.4.1 Rare and endangered plant species 86
- 4.4.2 Additional notes 87
- 4.5 One-Way Unstructured Comparisons 88
- 4.5.1 Displaying means for the one-way layout 90
- 4.5.2 Multiple comparisons 91
- 4.5.3 Data with a two-way structure 92
- 4.5.4 Presentation issues 92
- 4.6 Response Curves 93
- 4.7 Data with a Nested Variation Structure 94
- 4.7.1 Degrees of freedom considerations 95
- 4.7.2 General multi-way analysis of variance designs 96
- 4.8 Resampling Methods for Tests and Confidence Intervals 96
- 4.8.1 The one-sample permutation test 96
- 4.8.2 The two-sample permutation test 97
- 4.8.3 Bootstrap estimates of confidence intervals 98
- 4.9 Further Comments on Formal Inference 100
- 4.9.1 Confidence intervals versus hypothesis tests 100
- 4.9.2 If there is strong prior information, use it! 101
- 5 Regression with a Single Predictor 107
- 5.1 Fitting a Line to Data 107
- 5.1.1 Lawn roller example 108
- 5.1.2 Calculating fitted values and residuals 109
- 5.1.3 Residual plots 110
- 5.1.4 The analysis of variance table 113
- 5.2 Outliers, Influence and Robust Regression 114
- 5.3 Standard Errors and Confidence Intervals 116
- 5.3.1 Confidence intervals and tests for the slope 116
- 5.3.2 SEs and confidence intervals for predicted values 117
- 5.3.3 Implications for design 118
- 5.4 Regression versus Qualitative ANOVA Comparisons 119
- 5.5 Assessing Predictive Accuracy 121
- 5.5.1 Training/test sets, and cross-validation 121
- 5.5.3 Bootstrapping 123
- 5.6 A Note on Power Transformations 126
- 5.7 Size and Shape Data 127
- 5.7.1 Allometric growth 128
- 5.7.2 There are two regression lines! 129
- 5.8 The Model Matrix in Regression 130
- 6 Multiple Linear Regression 134
- 6.1 Basic Ideas: Book Weight and Brain Weight Examples 134
- 6.1.1 Omission of the intercept term 137
- 6.1.2 Diagnostic plots 138
- 6.1.3 Further investigation of influential points 139
- 6.1.4 Example: brain weight 140
- 6.2 Multiple Regression Assumptions and Diagnostics 142
- 6.2.1 Influential outliers and Cook's distance 143
- 6.2.2 Component plus residual plots 143
- 6.2.3 Further types of diagnostic plot 145
- 6.2.4 Robust and resistant methods 145
- 6.3 A Strategy for Fitting Multiple Regression Models 145
- 6.3.2 Model fitting 146
- 6.3.3 An example
- the Scottish hill race data 147
- 6.4 Measures for the Comparison of Regression Models 152
- 6.4.1 R[superscript 2] and adjusted R[superscript 2] 152
- 6.4.2 AIC and related statistics 152
- 6.4.3 How accurately does the equation predict? 153
- 6.4.4 An external assessment of predictive accuracy 155
- 6.5 Interpreting Regression Coefficients
- the Labor Training Data 155
- 6.6 Problems with Many Explanatory Variables 161
- 6.6.1 Variable selection issues 162
- 6.6.2 Principal components summaries 163
- 6.7 Multicollinearity 164
- 6.7.2 The variance inflation factor (VIF) 167
- 6.7.3 Remedying multicollinearity 168
- 6.8 Multiple Regression Models
- Additional Points 168
- 6.8.1 Confusion between explanatory and dependent variables 168
- 6.8.2 Missing explanatory variables 169
- 6.8.3 The use of transformations 169
- 6.8.4 Non-linear methods
- an alternative to transformation? 170
- 7 Exploiting the Linear Model Framework 175
- 7.1 Levels of a Factor
- Using Indicator Variables 175
- 7.1.1 Example
- sugar weight 175
- 7.1.2 Different choices for the model matrix when there are factors 178
- 7.2 Polynomial Regression 179
- 7.2.1 Issues in the choice of model 181
- 7.3 Fitting Multiple Lines 183
- 7.4 Methods for Passing Smooth Curves through Data 187
- 7.4.1 Scatterplot smoothing
- regression splines 188
- 7.4.2 Other smoothing methods 191
- 7.4.3 Generalized additive models 191
- 7.5 Smoothing Terms in Multiple Linear Models 192
- 8 Logistic Regression and Other Generalized Linear Models 197
- 8.1 Generalized Linear Models 197
- 8.1.1 Transformation of the expected value on the left 197
- 8.1.2 Noise terms need not be normal 198
- 8.1.3 Log odds in contingency tables 198
- 8.1.4 Logistic regression with a continuous explanatory variable 199
- 8.2 Logistic Multiple Regression 202
- 8.2.1 A plot of contributions of explanatory variables 208
- 8.2.2 Cross-validation estimates of predictive accuracy 209
- 8.3 Logistic Models for Categorical Data
- an Example 210
- 8.4 Poisson and Quasi-Poisson Regression 211
- 8.4.1 Data on aberrant crypt foci 211
- 8.4.2 Moth habitat example 213
- 8.4.3 Residuals, and estimating the dispersion 215
- 8.5 Ordinal Regression Models 216
- 8.5.1 Exploratory analysis 217
- 8.5.2 Proportional odds logistic regression 217
- 8.6 Other Related Models 220
- 8.6.1 Loglinear models 220
- 8.6.2 Survival analysis 220
- 8.7 Transformations for Count
- Data 221
- 9 Multi-level Models, Time Series and Repeated Measures 224
- 9.2 Example
- Survey Data, with Clustering 225
- 9.2.1 Alternative models 225
- 9.2.2 Instructive, though faulty, analyses 228
- 9.2.3 Predictive accuracy 229
- 9.3 A Multi-level Experimental Design 230
- 9.3.1 The ANOVA table 232
- 9.3.2 Expected values of mean squares 232
- 9.3.3 The sums of squares breakdown 234
- 9.3.4 The variance components 236
- 9.3.5 The mixed model analysis 237
- 9.3.6 Predictive accuracy 239
- 9.3.7 Different sources of variance
- complication or focus of interest? 239
- 9.4 Within and between Subject Effects
- an Example 239
- 9.5 Time Series
- Some Basic Ideas 242
- 9.5.1 Preliminary graphical explorations 242
- 9.5.2 The autocorrelation function 243
- 9.5.3 Autoregressive (AR) models 244
- 9.5.4 Autoregressive moving average (ARMA) models
- theory 245
- 9.6 Regression Modeling with Moving Average Errors
- an Example 246
- 9.7 Repeated Measures in Time
- Notes on the Methodology 252
- 9.7.1 The theory of repeated measures modeling 253
- 9.7.2 Correlation structure 253
- 9.7.3 Different approaches to repeated measures analysis 254
- 9.8 Further Notes on Multi-level Modeling 255
- 9.8.1 An historical perspective on multi-level models 255
- 9.8.2 Meta-analysis 256
- 10 Tree-based Classification and Regression 259
- 10.1 The Uses of Tree-based Methods 259
- 10.1.1 Problems for which tree-based regression may be used 259
- 10.1.2 Tree-based regression versus parametric approaches 260
- 10.1.3 Summary of pluses and minuses 261
- 10.2 Detecting Email Spam
- an Example 261
- 10.2.1 Choosing the number of splits 264
- 10.3 Terminology and Methodology 264
- 10.3.1 Choosing the split
- regression trees 265
- 10.3.2 Within and between sums of squares 266
- 10.3.3 Choosing the split
- classification trees 267
- 10.3.4 The mechanics of tree-based regression
- a trivial example 268
- 10.4 Assessments of Predictive Accuracy 270
- 10.4.1 Cross-validation 270
- 10.4.2 The training/test set methodology 271
- 10.4.3 Predicting the future 271
- 10.5 A Strategy for Choosing the Optimal Tree 271
- 10.5.1 Cost-complexity pruning 271
- 10.5.2 Prediction error versus tree size 272
- 10.6 Detecting Email Spam
- the Optimal Tree 273
- 10.6.1 The one-standard-deviation rule 273
- 10.7 Interpretation and Presentation of the rpart Output 275
- 10.7.1 Data for female heart attack patients 275
- 10.7.2 Printed Information on Each Split 276
- 11 Multivariate Data Exploration and Discrimination 281
- 11.1 Multivariate Exploratory Data Analysis 282
- 11.1.1 Scatterplot matrices 282
- 11.1.2 Principal components analysis 282
- 11.2 Discriminant Analysis 285
- 11.2.1 Example
- plant architecture 286
- 11.2.2 Classical Fisherian discriminant analysis 287
- 11.2.3 Logistic discriminant analysis 289
- 11.2.4 An example with more than two groups 290
- 11.3 Principal Component Scores in Regression 291
- 11.4 Propensity Scores in Regression Comparisons
- Labor Training Data 295
- 12 The R System
- Additional Topics 300
- 12.1 Graphs in R 300
- 12.2.1 Common useful functions 303
- 12.2.2 User-written R functions 307
- 12.2.3 Functions for working with dates 310
- 12.3 Data input and output 310
- 12.5 Missing Values 317
- 12.6 Lists and Data Frames 320
- 12.6.1 Data frames as lists 320
- 12.6.2 Reshaping data frames; reshape () 320
- 12.6.3 Joining data frames and vectors
- cbind () 322
- 12.6.4 Conversion of tables and arrays into data frames 322
- 12.6.5 Merging data frames
- merge () 322
- 12.6.6 The function sapply () and related functions 323
- 12.6.7 Splitting vectors and data frames into lists
- split () 324
- 12.7 Matrices and Arrays 324
- 12.7.1 Outer products 326
- 12.7.2 Arrays 327
- 12.8 Classes and Methods 328
- 12.8.1 Printing and summarizing model objects 328
- 12.8.2 Extracting information from model objects 329
- 12.9 Data-bases and Environments 330
- 12.9.1 Workspace management 331
- 12.9.2 Function environments, and lazy evaluation 331
- 12.10 Manipulation of Language Constructs 333
- Epilogue
- Models 338
- Appendix S-PLUS Differences 341.
- Notes:
- Includes bibliographical references (pages [346]-351) and indexes.
- Local Notes:
- Acquired for the Penn Libraries with assistance from the Class of 1924 Book Fund.
- ISBN:
- 0521813360
- OCLC:
- 50520552
- Online:
- Publisher description
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.