2 options
Three machine learning algorithms for lexical ambiguity resolution.
- Format:
- Book
- Thesis/Dissertation
- Author/Creator:
- Yarowsky, David Eric.
- Language:
- English
- Subjects (All):
- Computer science.
- Artificial intelligence.
- Information science.
- 0723.
- 0800.
- 0984.
- Penn dissertations--Computer and information science.
- Computer and information science--Penn dissertations.
- Local Subjects:
- Penn dissertations--Computer and information science.
- Computer and information science--Penn dissertations.
- 0723.
- 0800.
- 0984.
- Physical Description:
- 179 pages
- Contained In:
- Dissertation Abstracts International 57-04B.
- System Details:
- Mode of access: World Wide Web.
- text file
- Summary:
- Lexical ambiguity resolution is a pervasive problem in natural language processing. An important example is target-word choice in machine translation, such as deciding whether the English word sentence should be translated into French as peine (legal sentence) or phrase (grammatical sentence) depending upon analysis of surrounding context. The same problem arises in text-to-speech synthesis, where pronunciations such as lead role and lead mine must be resolved through context. Similar problems include capitalization and accent restoration, proper-name classification, and general word-sense disambiguation for many applications.
- This dissertation describes three original algorithms for solving this class of problems. The first is a Bayesian discriminator for semantic word classes. It uses statistical models of context to identify the most likely thesaurus category at each position in a document. Sense and translation differences are resolved through these class models. Applications of this work to discourse analysis and language modelling are explored.
- The second algorithm is a supervised statistical decision procedure using a variant of decision lists. It offers an efficient mechanism for utilizing diverse, non-independent sources of evidence in a very large parameter space. The dissertation includes empirical studies in language polysemy on which this algorithm and its smoothing procedures are based. The algorithm is evaluated on a wide range of homographs, include ambiguities in text-to-speech synthesis and accent restoration in Spanish and French.
- The third algorithm is an essentially unsupervised decision procedure that bootstraps from a small number of seed words automatically extracted from machine-readable dictionaries. The algorithm is driven by the joint exploitation of two empirically studied properties--that words tend to exhibit only one sense in a given collocation and in a given discourse. Accuracy exceeds 96% on diverse test sets. This performance rivals that of previous fully supervised methods while eliminating the need for costly hand-tagged training data, the lack of which has been a severe bottleneck for progress in this area.
- Notes:
- Thesis (Ph.D. in Computer and Information Science) -- University of Pennsylvania, 1996.
- Source: Dissertation Abstracts International, Volume: 57-04, Section: B, page: 2688.
- Supervisor: Mitchell Marcus.
- Local Notes:
- School code: 0175.
- Access Restriction:
- Restricted for use by site license.
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.