2 options
English CTS treebank with structural metadata.
- Format:
- Datafile
- Language:
- English
- Subjects (All):
- English language--Data processing.
- English language.
- Automatic speech recognition.
- Genre:
- Dictionaries.
- Physical Description:
- 1 DVD-ROM ; 4 3/4 in.
- 4 3/4 in.
- Place of Publication:
- [Philadelphia, PA] : Linguistic Data Consortium, [2004-2005, c2009]
- System Details:
- data file
- Summary:
- English CTS treebank with structural metadata, Linguistic Data Consortium (LDC) catalog number LDC2009T01 and isbn 1-58563-476-X, consists of metadata and syntactic structure annotations for 144 English telephone conversations, or 140,000 words, from data used in the EARS (Effective, Affordable, Reusable Speech-to-Text program). English CTS treebank with structural metadata was created to support EARS work in English. It applies EARS metadata extraction annotations and Penn Treebank methods to conversations from Switchboard-1 Release 2 (LDC97S62) and from data collected for EARS under the Fisher Protocol (released in EARS as LDC2004E16, LDC2004E29 and LDC2005E73). The purpose of the EARS program was to develop robust speech recognition technology to address a range of languages and speaking styles. LDC provided conversational and broadcast speech and transcripts, annotations, lexicons and texts for language modeling in each of the EARS languages (Arabic, Chinese, English). LDC also supported a metadata extraction (MDE) research evaluation, the goal of which was to enable technology to take raw speech-to-text (STT) output and to refine it into forms of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. Some of the data developed by LDC for the MDE task is contained in the LDC Catalog, i.e., RT-04 MDE Training Data Speech, LDC2005S16 and RT-04 MDE Training Data Text/Annotations, LDC2005T24.
- Notes:
- Title from disc label.
- "Authors: Ann Bies, Haejoong Lee, Stephanie Strassel, Christopher Walker" -- LDC catalogue.
- Data type: text.
- Data source: Telephone conversations.
- ISBN:
- 158563476X
- 9781585634767
- OCLC:
- 327967928
- Access Restriction:
- Restricted for use by site license.
- Online:
- LDC catalog entry
- Using LDC Data general information
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.