My Account Log in

1 option

Analyzing textual information : from words to meanings through numbers / Johannes Ledolter, The University of Iowa, Lea S. VanderVelde, The University of Iowa.

Lippincott Library H61 .L4196 2022
Loading location information...

Available This item is available for access.

Log in to request item
Format:
Book
Author/Creator:
Ledolter, Johannes, author.
VanderVelde, Lea, author.
Series:
Quantitative applications in the social sciences ; no. 188.
Quantitative applications in the social sciences ; volume 188
Language:
English
Subjects (All):
Social sciences--Research.
Social sciences.
Social sciences--Methodology.
Physical Description:
xix, 168 pages : illustrations ; 22 cm.
Place of Publication:
Thousand Oaks, California : SAGE Publications, Inc., [2022]
Summary:
"Researchers in the social sciences and beyond are dealing more and more with massive quantities of text data requiring analysis, from historical letters to the constant stream of content in social media. Traditional texts on statistical analysis have focused on numbers, but this book will provide a practical introduction to the quantitative analysis of textual data. Using up-to-date R methods, this book will take readers through the text analysis process, from text mining and pre-processing the text to final analysis. It includes two major case studies using historical and more contemporary text data to demonstrate the practical applications of these methods. Currently, there is no introductory how-to book on textual data analysis with R that is up-to-date and applicable across the social sciences. Code and a variety of additional resources are available on an accompanying website for the book"-- Provided by publisher.
Contents:
1.1 Text Data p. 1
1.1.1 Introducing the Definitions p. 2
1.1.2 Types of Text Data p. 4
1.1.3 File Formats to Save and Store Text Information p. 5
1.2 The Two Applications Considered in This Book p. 6
1.3 Introductory Example and Its Analysis Using the R Statistical Software p. 7
1.4 The Introductory Example Revisited, Illustrating Concordance and Collocation Using Alternative Software p. 22
Chapter 2 A Description of the Studied Text Corpora and A Discussion of Our Modeling Strategy p. 26
2.1 Introduction to the Corpora: Selecting the Texts p. 26
2.2 Debates of the 39th U.S. Congress, as recorded in the, Congressional Globe p. 27
2.3 The Territorial Papers of the United States p. 29
2.4 Analyzing Text Data: Bottom-Up or Top-Down Analysis p. 32
Appendix to Chapter 2: The Complete Congressional Record p. 35
Chapter 3 Preparing Text for Analysis: Text Cleaning and Formatting p. 36
3.1 Text Cleaning p. 36
3.1.1 Compacting Multiple Word Sets Into a Single Word p. 42
3.2 Text Formatting p. 43
3.2.1 Formatting by Marking Versus Formatting by Deleting p. 44
3.2.2 Formatting Beyond Metavariables: Telling the Computer What Sections to Skip When Running the Analysis p. 44
Chapter 4 Word Distributions: Document-Term Matrices of Word Frequencies and the "Bag of Words" Representation p. 49
4.1 Document-Term Matrices of Frequencies p. 49
4.1.1 Creating the Document-Term Matrix in R p. 51
4.1.2 Dropping Sparse Words That Do Not Occur in Many Documents p. 52
4.2 Displaying Word Frequencies p. 53
4.3 Co-Occurrence of Terms in the Same Document p. 56
4.4 The Zipf Law: An Interesting Fact About the Distribution of Word Frequencies p. 59
Chapter 5 Metavariables and Text Analysis Stratified on Metavariables p. 62
5.1 The Significance of Stratification and the Importance of Metavariables p. 62
5.2 Analysis of the Territorial Papers p. 63
5.2.1 Territorial Papers: Visualization of the Metavariables p. 64
5.2.2 Territorial Papers: Stratified Text Analysis p. 69
5.3 Analysis of Speeches From the 39th Congress p. 72
5.3.1 Speeches From the 39th Congress: Visualization of the Metavariables p. 73
5.3.2 Speeches From the 39th Congress: Stratified Text Analysis p. 77
Chapter 6 Sentiment Analysis p. 84
6.1 Lexicons of Sentiment-Charged Words p. 84
6.1.1 Attaching Sentiment to a Document p. 85
6.1.2 Sentiment Analysis for the Corpus audits Documents p. 87
6.1.3 Importance of Sentiment Analysis p. 88
6.2 Applying Sentiment Analysis to the Letters of the Territorial Papers p. 88
6.3 Using Other Sentiment Dictionaries and the R Software tidytext for Sentiment Analysis p. 91
6.4 Concluding Remarks: An Alternative Approach for Sentiment Analysis p. 94
Chapter 7 Clustering of Documents p. 97
7.1 Clustering Documents p. 97
7.2 Measures for the Closeness and the Distance of Documents p. 98
7.3 Methods for Clustering Documents p. 101
7.3.1 Hierarchical Agglomerative Clustering and Dendrograms p. 101
7.3.2 k-Means Clustering p. 103
7.4 Illustrating Clustering Methods on a Simulated Example p. 106
Chapter 8 Classification of Documents p. 110
8.2 Classification Procedures p. 111
8.2.1 The k-Nearest Neighbor Algorithm p. 111
8.2.2 Naive Bayesian Analysis p. 113
8.2.3 Fisher Linear Discriminant Method and Linear Scoring (SVM) Methods p. 115
8.2.4 Evaluating Classification Rules on Hold-Out Samples p. 116
8.3 Two Examples Using the Congressional Speech Database p. 116
8.4 Concluding Remarks on Authorship Attribution: Commenting on the Field of Stylometry p. 119
Chapter 9 Modeling Text Data: Topic Models p. 121
9.3 Topic Models p. 121
9.1.1 Some More Technical Details and a Brief Primer on Dirichlet Distributions p. 126
9.1.2 Model Extensions and Useful Software, With a Tip of the Hat to Their Developers p. 128
9.1.3 Further Comments p. 129
9.2 Fitting Topic Models to the Two Corpora Studied in This Book p. 130
9.2.1 Topic Models for the Corpus of the Territorial Papers p. 130
9.2.2 Topic Models for the Corpus of Speeches From the 39th U.S. Congress p. 134
Chapter 10 n-Grams and Other Ways of Analyzing Adjacent Words p. 142
10.1 Analysis of Bigrams p. 142
10.2 Text Windows to Measure Word Associations Within a Neighborhood of Words and a Discussion of the R Package text2vec p. 143
10.3 Illustrating the Use of n-Grams: Speeches of the 39th Congress p. 146.
Notes:
Includes bibliographical references (pages 155-159) and index.
ISBN:
9781544390000
1544390009
OCLC:
1238129429

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account