2 options

Sharing data and models in software engineering / Tim Menzies [and four others] ; designer, Mark Rogers.

Ebook Central Academic Complete Available online

O'Reilly Online Learning: Academic/Public Library Edition Available online

Format:: Book
Author/Creator:: Menzies, Tim, author.
Contributor:: Rogers, Mark, designer.
Language:: English
Subjects (All):: Software engineering.; Computer-aided software engineering.
Physical Description:: 1 online resource (415 pages) : illustrations (some color), graphs
Edition:: First edition.
Place of Publication:: Waltham, Massachusetts : Morgan Kaufmann, 2015.
Language Note:: English
System Details:: text file
Summary:: Data Science for Software Engineering: Sharing Data and Models presents guidance and procedures for reusing data and models between projects to produce results that are useful and relevant. Starting with a background section of practical lessons and warnings for beginner data scientists for software engineering, this edited volume proceeds to identify critical questions of contemporary software engineering related to data and models. Learn how to adapt data from other organizations to local problems, mine privatized data, prune spurious information, simplify complex results, how to update models for new platforms, and more. Chapters share largely applicable experimental results discussed with the blend of practitioner focused domain expertise, with commentary that highlights the methods that are most useful, and applicable to the widest range of projects. Each chapter is written by a prominent expert and offers a state-of-the-art solution to an identified problem facing data scientists in software engineering. Throughout, the editors share best practices collected from their experience training software engineering students and practitioners to master data science, and highlight the methods that are most useful, and applicable to the widest range of projects. Shares the specific experience of leading researchers and techniques developed to handle data problems in the realm of software engineering Explains how to start a project of data science for software engineering as well as how to identify and avoid likely pitfalls Provides a wide range of useful qualitative and quantitative principles ranging from very simple to cutting edge research Addresses current challenges with software engineering data such as lack of local data, access issues due to data privacy, increasing data quality via cleaning of spurious chunks in data
Contents:: Front Cover; Sharing Data and Models in Software Engineering; Copyright; Why this book?; Foreword; Contents; List of Figures; Chapter 1: Introduction; 1.1 Why Read This Book?; 1.2 What Do We Mean by ``Sharing''?; 1.2.1 Sharing Insights; 1.2.2 Sharing Models; 1.2.3 Sharing Data; 1.2.4 Sharing Analysis Methods; 1.2.5 Types of Sharing; 1.2.6 Challenges with Sharing; 1.2.7 How to Share; 1.3 What? (Our Executive Summary); 1.3.1 An Overview; 1.3.2 More Details; 1.4 How to Read This Book; 1.4.1 Data Analysis Patterns; 1.5 But What About …? (What Is Not in This Book); 1.5.1 What About ``Big Data''?; 1.5.2 What About Related Work?; 1.5.3 Why All the Defect Prediction and Effort Estimation?; 1.6 Who? (About the Authors); 1.7 Who Else? (Acknowledgments); Part I: Data Mining for Managers; Chapter 2: Rules for Managers; 2.1 The Inductive Engineering Manifesto; 2.2 More Rules; Chapter 3: Rule #1: Talk to the Users; 3.1 Users Biases; 3.2 Data Mining Biases; 3.3 Can We Avoid Bias?; 3.4 Managing Biases; 3.5 Summary; Chapter 4: Rule #2: Know the Domain; 4.1 Cautionary Tale #1: ``Discovering'' Random Noise; 4.2 Cautionary Tale #2: Jumping at Shadows; 4.3 Cautionary Tale #3: It Pays to Ask; 4.4 Summary; Chapter 5: Rule #3: Suspect Your Data; 5.1 Controlling Data Collection; 5.2 Problems with Controlled Data Collection; 5.3 Rinse (and Prune) Before Use; 5.3.1 Row Pruning; 5.3.2 Column Pruning; 5.4 On the Value of Pruning; 5.5 Summary; Chapter 6: Rule #4: Data Science Is Cyclic; 6.1 The Knowledge Discovery Cycle; 6.2 Evolving Cyclic Development; 6.2.1 Scouting; 6.2.2 Surveying; 6.2.3 Building; 6.2.4 Effort; 6.3 Summary; Part II: Data Mining: A Technical Tutorial; Chapter 7: Data Mining and SE; 7.1 Some Definitions; 7.2 Some Application Areas.; Chapter 8: Defect Prediction; 8.1 Defect Detection Economics; 8.2 Static Code Defect Prediction; 8.2.1 Easy to Use; 8.2.2 Widely Used; 8.2.3 Useful; Chapter 9: Effort Estimation; 9.1 The Estimation Problem; 9.2 How to Make Estimates; 9.2.1 Expert-Based Estimation; 9.2.2 Model-Based Estimation; 9.2.3 Hybrid Methods; Chapter 10: Data Mining (Under the Hood); 10.1 Data Carving; 10.2 About the Data; 10.3 Cohen Pruning; 10.4 Discretization; 10.4.1 Other Discretization Methods; 10.5 Column Pruning; 10.6 Row Pruning; 10.7 Cluster Pruning; 10.7.1 Advantages of Prototypes; 10.7.2 Advantages of Clustering; 10.8 Contrast Pruning; 10.9 Goal Pruning; 10.10 Extensions for Continuous Classes; 10.10.1 How RTs Work; 10.10.2 Creating Splits for Categorical Input Features; 10.10.3 Splits on Numeric Input Features; 10.10.4 Termination Condition and Predictions; 10.10.5 Potential Advantages of RTs for Software Effort Estimation; 10.10.6 Predictions for Multiple Numeric Goals; Part III: Sharing Data; Chapter 11: Sharing Data: Challenges and Methods; 11.1 Houston, We Have a Problem; 11.2 Good News, Everyone; Chapter 12: Learning Contexts; 12.1 Background; 12.2 Manual Methods for Contextualization; 12.3 Automatic Methods; 12.4 Other Motivation to Find Contexts; 12.4.1 Variance Reduction; 12.4.2 Anomaly Detection; 12.4.3 Certification Envelopes; 12.4.4 Incremental Learning; 12.4.5 Compression; 12.4.6 Optimization; 12.5 How to Find Local Regions; 12.5.1 License; 12.5.2 Installing CHUNK; 12.5.3 Testing Your Installation; 12.5.4 Applying CHUNK to Other Models; 12.6 Inside CHUNK; 12.6.1 Roadmap to Functions; 12.6.2 Distance Calculations; 12.6.2.1 Normalize; 12.6.2.2 SquaredDifference; 12.6.3 Dividing the Data; 12.6.3.1 FastDiv; 12.6.3.2 TwoDistantPoints.; 12.6.3.3 Settings; 12.6.3.4 Chunk (main function); 12.6.4 Support Utilities; 12.6.4.1 Some standard tricks; 12.6.4.2 Tree iterators; 12.6.4.3 Pretty printing; 12.7 Putting It all Together; 12.7.1 _nasa93; 12.8 Using CHUNK; 12.9 Closing Remarks; Chapter 13: Cross-Company Learning: Handling the Data Drought; 13.1 Motivation; 13.2 Setting the Ground for Analyses; 13.2.1 Wait … Is This Really CC Data?; 13.2.2 Mining the Data; 13.2.3 Magic Trick: NN Relevancy Filtering; 13.3 Analysis #1: Can CC Data be Useful for an Organization?; 13.3.1 Design; 13.3.2 Results from Analysis #1; 13.3.3 Checking the Analysis #1 Results; 13.3.4 Discussion of Analysis #1; 13.4 Analysis #2: How to Cleanup CC Data for Local Tuning?; 13.4.1 Design; 13.4.2 Results; 13.4.3 Discussions; 13.5 Analysis #3: How Much Local Data Does an Organization Need for a Local Model?; 13.5.1 Design; 13.5.2 Results from Analysis #3; 13.5.3 Checking the Analysis #3 Results; 13.5.4 Discussion of Analysis #3; 13.6 How Trustworthy Are These Results?; 13.7 Are These Useful in Practice or Just Number Crunching?; 13.8 What's New on Cross-Learning?; 13.8.1 Discussion; 13.9 What's the Takeaway?; Chapter 14: Building Smarter Transfer Learners; 14.1 What Is Actually the Problem?; 14.2 What Do We Know So Far?; 14.2.1 Transfer Learning; 14.2.2 Transfer Learning and SE; 14.2.3 Data Set Shift; 14.3 An Example Technology: TEAK; 14.4 The Details of the Experiments; 14.4.1 Performance Comparison; 14.4.2 Performance Measures; 14.4.3 Retrieval Tendency; 14.5 Results; 14.5.1 Performance Comparison; 14.5.2 Inspecting Selection Tendencies; 14.6 Discussion; 14.7 What Are the Takeaways?; Chapter 15: Sharing Less Data (Is a Good Thing); 15.1 Can We Share Less Data?; 15.2 Using Less Data; 15.3 Why Share Less Data?.; 15.3.1 Less Data Is More Reliable; 15.3.2 Less Data Is Faster to Discuss; 15.3.3 Less Data Is Easier to Process; 15.4 How to Find Less Data; 15.4.1 Input; 15.4.2 Comparisons to Other Learners; 15.4.3 Reporting the Results; 15.4.4 Discussion of Results; 15.5 What's Next?; Chapter 16: How to Keep Your Data Private; 16.1 Motivation; 16.2 What Is PPDP and Why Is It Important?; 16.3 What Is Considered a Breach of Privacy?; 16.4 How to Avoid Privacy Breaches?; 16.4.1 Generalization and Suppression; 16.4.2 Anatomization and Permutation; 16.4.3 Perturbation; 16.4.4 Output Perturbation; 16.5 How Are Privacy-Preserving Algorithms Evaluated?; 16.5.1 Privacy Metrics; 16.5.2 Modeling the Background Knowledge of an Attacker; 16.6 Case Study: Privacy and Cross-Company Defect Prediction; 16.6.1 Results and Contributions; 16.6.2 Privacy and CCDP; 16.6.3 CLIFF; 16.6.4 MORPH; 16.6.5 Example of CLIFF&amp; MORPH; 16.6.6 Evaluation Metrics; 16.6.7 Evaluating Utility via Classification; 16.6.8 Evaluating Privatization; 16.6.8.1 Defining privacy; 16.6.9 Experiments; 16.6.9.1 Data; 16.6.10 Design; 16.6.11 Defect Predictors; 16.6.12 Query Generator; 16.6.13 Benchmark Privacy Algorithms; 16.6.14 Experimental Evaluation; 16.6.15 Discussion; 16.6.16 Related Work: Privacy in SE; 16.6.17 Summary; Chapter 17: Compensating for Missing Data; 17.1 Background Notes on SEE and Instance Selection; 17.1.1 Software Effort Estimation; 17.1.2 Instance Selection in SEE; 17.2 Data Sets and Performance Measures; 17.2.1 Data Sets; 17.2.2 Error Measures; 17.3 Experimental Conditions; 17.3.1 The Algorithms Adopted; 17.3.2 Proposed Method: POP1; 17.3.3 Experiments; 17.4 Results; 17.4.1 Results Without Instance Selection; 17.4.2 Results with Instance Selection; 17.5 Summary.; Chapter 18: Active Learning: Learning More with Less; 18.1 How Does the QUICK Algorithm Work?; 18.1.1 Getting Rid of Similar Features: Synonym Pruning; 18.1.2 Getting Rid of Dissimilar Instances: Outlier Pruning; 18.2 Notes on Active Learning; 18.3 The Application and Implementation Details of QUICK; 18.3.1 Phase 1: Synonym Pruning; 18.3.2 Phase 2: Outlier Removal and Estimation; 18.3.3 Seeing QUICK in Action with a Toy Example; 18.3.3.1 Phase 1: Synonym pruning; 18.3.3.2 Phase 2: Outlier removal and estimation; 18.4 How the Experiments Are Designed; 18.5 Results; 18.5.1 Performance; 18.5.2 Reduction via Synonym and Outlier Pruning; 18.5.3 Comparison of QUICK vs. CART; 18.5.4 Detailed Look at the Statistical Analysis; 18.5.5 Early Results on Defect Data Sets; 18.6 Summary; Part IV: Sharing Models; Chapter 19: Sharing Models: Challenges and Methods; Chapter 20: Ensembles of Learning Machines; 20.1 When and Why Ensembles Work; 20.1.1 Intuition; 20.1.2 Theoretical Foundation; 20.2 Bootstrap Aggregating (Bagging); 20.2.1 How Bagging Works; 20.2.2 When and Why Bagging Works; 20.2.3 Potential Advantages of Bagging for SEE; 20.3 Regression Trees (RTs) for Bagging; 20.4 Evaluation Framework; 20.4.1 Choice of Data Sets and Preprocessing Techniques; 20.4.1.1 PROMISE data; 20.4.1.2 ISBSG data; 20.4.2 Choice of Learning Machines; 20.4.3 Choice of Evaluation Methods; 20.4.4 Choice of Parameters; 20.5 Evaluation of Bagging+RTs in SEE; 20.5.1 Friedman Ranking; 20.5.2 Approaches Most Often Ranked First or Second in Terms of MAE, MMRE and PRED(25); 20.5.3 Magnitude of Performance Against the Best; 20.5.4 Discussion; 20.6 Further Understanding of Bagging+RTs in SEE; 20.7 Summary; Chapter 21: How to Adapt Models in a Dynamic World; 21.1 Cross-Company Data and Questions Tackled.; 21.2 Related Work.
Notes:: Bibliographic Level Mode of Issuance: Monograph; Includes bibliographical references and indexes.
ISBN:: 9780124173071; 0124173071
OCLC:: 896901265

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

2 options

Sharing data and models in software engineering / Tim Menzies [and four others] ; designer, Mark Rogers.

Find

My Account

Guides