0 options
Challenges in Corpus Linguistics : Rethinking Corpus Compilation and Analysis / edited by Mark Kaunisto and Marco Schilk.
- Format:
- Book
- Series:
- Studies in corpus linguistics ; Volume 118.
- Studies in Corpus Linguistics Series ; Volume 118
- Language:
- English
- Subjects (All):
- English language--Grammar, Comparative.
- English language.
- Physical Description:
- 1 online resource (182 pages)
- Edition:
- First edition.
- Place of Publication:
- Amsterdam, Netherlands : John Benjamins Publishing Company, [2024]
- Summary:
- This book contributes to the work on discussing the challenges faced in different areas of corpus linguistics, namely the compilation, annotation, and analysis of linguistic corpora.
- Contents:
- Intro
- Table of contents
- Acknowledgements
- From fallacies and pitfalls to solutions and future directions
- References
- Engaging with bad (meta)data in historical corpus linguistics
- 1. Introduction
- 2. POS annotation in diachronic datasets
- 2.1 Accounting for category change
- 2.2 Theoretical choices in the design of the annotation scheme
- 2.3 Annotation tailored to specific research questions
- 3. Large corpora
- 3.1 Inaccuracies in text sampling
- 3.2 Changes in the balance of subgenres
- 4. Historical databases
- 4.1 Issues with balance and metadata
- 4.2 OCR errors
- 4.2.1 Hapax legomena
- 4.2.2 Historical lexis
- 5. Discussion and conclusion
- Funding
- Named entities as potentially problematic items in corpora
- 2. Background
- 2.1 The concepts of proper nouns and proper names
- 2.2 Annotation of named entities
- 3. Case studies
- 3.1 Common nouns used as (parts of) proper nouns
- 3.2 Near-synonymous adjectives in named entities
- 4. Discussion and conclusion
- Challenges in the compilation, annotation, and analysis of learner corpus data
- 1. Introduction and general remarks
- 2. Challenges and how to respond to them
- 2.1 Multilingual practices and metalinguistic language use
- Response
- 2.2 Task effects
- 2.3 "Discourse of deficit" and learner corpus annotation
- 3. Summary and conclusion
- Early newspapers as data for corpus linguistics (and Digital Humanities)
- 2. Digital text analysis in the humanities
- 2.1 Digital Humanities
- 2.2 Corpus linguistics
- 2.3 Towards a useful synergy
- 3. Historical newspaper prose and the British Library Newspapers database
- 3.1 Problems with available search tools
- 3.2 Sampling, balance, and representativeness.
- 3.3 Registers and subregisters
- 3.4 Optical Character Recognition (OCR)
- 4. Discussion
- Open Corpus Linguistics - or How to overcome common problems in dealing with corpus data by adopting open research practices
- 2. Revisiting Rissanen's problems
- 3. Open Corpus Linguistics
- 4. Conclusion
- Text length and short texts
- 2.1 Text length, corpora, and social media
- 2.2 The importance of text length
- 3. Solutions and workarounds
- 3.1 Manipulation of the data
- 3.1.1 Exclusion
- 3.1.2 Combining
- 3.1.3 Chunking
- 3.2 Computational and statistical approaches
- 3.2.1 Lengthwise analysis
- 3.2.2 Multiple Correspondence Analysis
- 3.2.3 Resampling methods
- 3.3 A related problem
- Corpus genre categories
- 2. Looking up from the pit
- 3. Text genre categorization in literature
- 4. Text genre categorization in linguistics
- 5. The genre category pitfall
- 6. Conclusion
- Modeling fine-grained sociolinguistic variation
- 2. Theoretical and methodological background
- 2.1 Semantic shifts in Quebec English
- 2.2 Twitter-based corpora for language variation
- 2.3 Vector space models for lexical semantic variation
- 3. Data and method
- 3.1 A corpus of tweets
- 3.2 A set of semantic shifts in Quebec English
- 3.3 Neural word embeddings
- 3.4 Clustering and annotating the uses of a lexical item
- 4. Results
- 4.1 An overview of regionally specific clusters
- 4.2 Types of variation captured by the analysis
- 4.2.1 True positives
- A clear-cut distinction
- A subtler distinction
- 4.2.2 False positives
- Cultural effects
- Proper names
- French homographs in codeswitched tweets
- Structural patterns affecting model performance.
- 4.3 Deploying coarsely annotated data for linguistic description
- Subject index.
- Notes:
- Description based on publisher supplied metadata and other sources.
- Description based on print version record.
- Includes bibliographical references.
- ISBN:
- 9789027246530
- 902724653X
- OCLC:
- 1455385919
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.