2 options
Topic segmentation: Algorithms and applications.
- Format:
- Book
- Thesis/Dissertation
- Author/Creator:
- Reynar, Jeffrey C.
- Language:
- English
- Subjects (All):
- Computer science.
- Information science.
- Language and culture.
- 0679.
- 0723.
- 0984.
- Penn dissertations--Computer and information science.
- Computer and information science--Penn dissertations.
- Local Subjects:
- Penn dissertations--Computer and information science.
- Computer and information science--Penn dissertations.
- 0679.
- 0723.
- 0984.
- Physical Description:
- 169 pages
- Contained In:
- Dissertation Abstracts International 59-04B.
- System Details:
- Mode of access: World Wide Web.
- text file
- Summary:
- Most documents are about more than one subject, but the majority of natural language processing algorithms and information retrieval techniques implicitly assume that every document has just one topic. The work described herein is about clues which mark shifts to new topics, algorithms for identifying topic boundaries and the uses of such boundaries once identified.
- A number of topic shift indicators have been proposed in the literature. We review these features, suggest several new ones and test most of them in implemented topic segmentation algorithms. Hints about topic boundaries include repetitions of character sequences, patterns of word and word n-gram repetition, word frequency, the presence of cue words and phrases and the use of synonyms.
- The algorithms we present use cues singly or in combination to identify topic shifts in several kinds of documents. One algorithm tracks compression performance, which is an indicator of topic shift because self-similarity within topic segments should be greater than between-segment similarity. Another technique relies on word repetition and places boundaries by minimizing word repetitions across segment boundaries. A third method compares the performance of a language model with and without knowledge of the contents of preceding sentences to determine whether a topic shift has occurred. We use the output of this algorithm in a statistical model which incorporates synonymy, bigram repetition and other features for topic segmentation.
- We benchmark our algorithms and compare them to algorithms from the literature using concatenations of documents, and then perform further evaluation of our techniques using a collection of news broadcasts transcribed both by annotators and using a speech recognition system. We also test the effectiveness of our algorithms for identifying both chapter boundaries in works of literature and story boundaries in Spanish news broadcasts.
- We suggest ways to improve information retrieval, language modeling and various natural language processing algorithms by exploiting the topic segmentation.
- Notes:
- Thesis (Ph.D. in Computer and Information Science) -- University of Pennsylvania, 1998.
- Source: Dissertation Abstracts International, Volume: 59-04, Section: B, page: 1741.
- Adviser: Mitchell P. Marcus.
- Local Notes:
- School code: 0175.
- ISBN:
- 9780591828061
- Access Restriction:
- Restricted for use by site license.
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.