To download data sign in with GitHub
rows 10 / 296
ldc_id | copyright | title | url | intro | samples |
---|---|---|---|---|---|
LDC2008T03
|
Portions 2003 Agence France-Presse, 2003 The Associated Press, 2003 Cable News Network, LP, LLLP, 2007 The MITRE Corporation, 2003 New York Times, 2003 Xinhua News Agency, 2003, 2005, 2006, 2008 Trustees of the University of Pennsylvania About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data Contact: ldc@ldc.upenn.edu (c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.
|
ACE 2005 English SpatialML Annotations
|
The ACE (Automatic Content Extraction) program focuses on developing automatic content extraction technology to support automatic processing of human language in text form. The kind of information recognized and extracted from text includes entities, values, temporal expressions, relations and events. SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML's focus is primarily on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for potentially better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services. In ACE 2005 English SpatialML Annotations, the authors applied SpatialML tags to the English training data (originally annotated for entities, relations and events) in ACE 2005 Multilingual Training Corpus, LDC2006T06. (NOTE: 2005 ACE training data and evaluation data were distributed as e-corpora (LDC2005E18, LDC2005E23) to participants in the 2005 ACE evaluation. Some of the files in ACE 2005 English SpatialML Annotations may originate from one of those e-corpora, not from LDC2006T06). The SpatialML annotation scheme is intended to emulate earlier progress on time expressions such as TIMEX2, TimeML and the 2005 ACE guidelines. The main SpatialML tag is the PLACE tag. The central goal of SpatialML is to map PLACE information in text to data from gazetteers and other databases to the extent possible. Therefore, semantic attributes such as country abbreviations, country subdivision and dependent area abbreviations (e.g., US states), and geo-coordinates are used to help establish such a mapping. LINK and PATH tags express relations between places, such as inclusion relations and trajectories of various kinds. Information in the tag along with the tagged location string should be sufficient to uniquely determine the mapping, when such a mapping is possible. This also means that redundant information is not included in the tag. To the extent possible, SpatialML leverages ISO and other standards towards the goal of making the scheme compatible with existing and future corpora. The SpatialML guidelines are compatible with existing guidelines for spatial annotation and existing corpora within the ACE research program. In particular, the English Annotation Guidelines for Entities (Version 5.6.6 2006.08.01) were exploited, specifically the GPE, Location, and Facility entity tags, and the Physical relation tags, all of which are mapped to SpatialML tags. Ideas were also borrowed from Toponym Resolution Markup Language of Leidner (2006), the research of Schilder et al. (2004) and the annotation scheme in Garbin and Mani (2005). Information recorded in the annotation is compatible with the feature types in the Alexandria Digital Library. This corpus also leverages the integrated gazetteer database (IGDB) of Mardis and Burger (2005). Last but not least, this annotation scheme can be related to the Geography Markup Language (GML) defined by the Open Geospatial Consortium (OGC), as well as Google Earth's Keyhole Markup Language (KML), to express geographical features. SpatialML goes beyond these schemes, however, in terms of providing a richer markup for natural language that includes semantic features and relationships that allow mapping to existing resources such as gazetteers. Such a markup can be useful for (i) disambiguation, (ii) integration with mapping services, and (iii) spatial reasoning. In relation to (iii), it is possible to use spatial reasoning not only for integration with applications, but for better information extraction, e.g., for disambiguating a place name based on the locations of other place names in the document. SpatialML goes to some length to represent topological relationships among places, derived from the RCC8 Calculus (Randell et al. 1992, Cohn et al. 1997). Addtional information about SpatialML is contained in the paper "SpatialML: Annotation Scheme for Marking Spatial Expressions in Natural Lanugage," which is included in the online documentation for this corpus. Please direct all questions about this corpus to Janet Hitzeman (hitz@mitre.org)
|
[{'url': u'./desc/addenda/LDC2008T03.xml', 'title': u'sample'}]
|
|
LDC2011T02
|
Portions 2003 Agence France Presse, 2003 The Associated Press, 2003 Cable News Network, LP, LLLP, 2007, 2010 The MITRE Corporation, 2003 New York Times, 2003 Xinhua News Agency, 2003, 2005, 2006, 2008, 2011 Trustees of the University of Pennsylvania About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data Contact: ldc@ldc.upenn.edu (c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.
|
ACE 2005 English SpatialML Annotations Version 2
|
ACE 2005 English SpatialML Annotations Version 2, Linguistic Data Consortium (LDC) catalog number LDC2011T02 and isbn 1-58563-573-1, was developed by researchers at The MITRE Corporation and applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events in ACE 2005 Multilingual Training Corpus LDC2006T06. This second version eliminates a number of annotation inconsistencies and errors identified in ACE 2005 English SpatialML Annotations LDC2008T03. In addition, the SpatialML annotation schema has been updated from version 2.0 to version 3.0.1; the revised annotation guidelines are included in this release. The ACE (Automatic Content Extraction) program focused on developing automatic content extraction technology to support automatic processing of human language in text form., specifically, entities, values, temporal expressions, relations and events. SpatialML is a mark-up language for representing spatial expressions in natural language documents. It is intended to emulate earlier progress on time expression such as TIMEX2, TimeML, and the 2005 ACE guidelines. SpatialML includes syntax for marking up PLACEs mentioned in text and for linking them to data from gazetteers and other databases. LINKs are used to express relations between places, and RLINKs to capture trajectories for relative locations. To the extent possible, SpatialML leverages ISO and other standards with the goal of making the scheme compatible with existing and future corpora. SpatialML goes beyond these schemes, however, in terms of providing a richer markup for natural language that includes semantic features and relationships that allow mapping to existing resources such as gazetteers. Such markup can be useful for disambiguation, integration with mapping services and spatial reasoning. Data This corpus contains 210065 total words and 17821 unique words. Counts of unique words can be found in doc/ldc_wordcount.csv which includes all words that are not part of XML markup (e.g., without tag names, attribute names or values). Unique words are counted by comparing case insensitive transformations with preceding and trailing punctuation stripped off. "Words" consisting solely of punctuation are discarded. The principal change in the annotation schema is that "PATH" has been generalized to "RLINK" for relative link. At the top level, there is now a version attribute on the root SpatialML tag to capture which version of SpatialML was used. A number of smaller changes have been made to the annotation specification; these are listed in Section 2 of the updated guidelines. The files are provided in both in-line xml format and aif format. The gaz-deref files contain multiple gazetteer references when they exist for a single location; these different gazrefs sometimes correspond to slightly different latlongs. The sgm.dtd validated files do not contain document structure tags (such as , ) that would prevent them from being validated with the SpatialML DTD. These files total 22624650 bytes uncompressed. Updates Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2011T02.
|
[{'url': u'./desc/addenda/LDC2011T02.jpg', 'title': u''}]
|
|
LDC2010T09
|
Portions 2000-2001 China Broadcasting System, 2000-2001 China Central TV, 2000-2001 China National Radio, 2000-2001 China Television System, 2008-2009 The MITRE Corporation, 2005, 2006, 2010 Trustees of the University of Pennsylvania About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data Contact: ldc@ldc.upenn.edu (c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.
|
ACE 2005 Mandarin SpatialML Annotations
|
ACE 2005 Mandarin SpatialML Annotations was developed by researchers at The MITRE Corporation (MITRE). ACE 2005 Mandarin SpatialML Annotations applies SpatialML tags to a subset of the source Mandarin training data in ACE 2005 Multilingual Training Corpus (LDC2006T06). Annotations for entities, relations, and events, which were included in ACE 2005 Multilingual Training Corpus, are not included in the current SpatialML release. For SpatialML markup to ACE 2005 English data, see ACE 2005 English SpatialML Annotations (LDC2008T03). SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML focuses is on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services. The ACE (Automatic Content Extraction) Program seeks to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from automatic speech recognition and optical character recognition). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data. Thus the ACE program requires the development of technologies that automatically detect and characterize this meaning. The annotation efforts of the ACE program supports the development of automatic content extraction technology to support automatic processing of human language in text form. The kind of information recognized and extracted from text includes entities, values, temporal expressions, relations and events The SpatialML annotation scheme is intended to emulate earlier progress on time expressions such as TIMEX2, TimeML, and the 2005 ACE guidelines. The main SpatialML tag is the PLACE tag which encodes information about location. The central goal of SpatialML is to map location information in text to data from gazetteers and other databases to the extent possible by defining attributes in the PLACE tag. Therefore, semantic attributes such as country abbreviations, country subdivision and dependent area abbreviations (e.g., US states), and geo-coordinates are used to help establish such a mapping. LINK and PATH tags express relations between places, such as inclusion relations and trajectories of various kinds. Information in the tag along with the tagged location string should be sufficient to uniquely determine the mapping, when such a mapping is possible. This also means that redundant information is not included in the tag. To the extent possible, SpatialML leverages ISO and other standards towards the goal of making the scheme compatible with existing and future corpora. The SpatialML guidelines are compatible with existing guidelines for spatial annotation and existing corpora within the ACE research program.
|
|
|
LDC2006T06
|
Portions 2000-2003 Agence France Presse, 2003 The Associated Press, 2003 New York Times, 2000-2001, 2003 Xinhua News Agency, 2003 Cable News Network LP, LLLP, 2000-2001 SPH AsiaOne Ltd, 2000-2001 China Broadcasting System, 2000-2001 China National Radio, 2000-2001 China Television System, 2000-2001 China Central TV, 2000-2001 Al Hayat, 2000-2001 An-Nahar, 2000-2001 Nile TV, 2005, 2006 Trustees of the University of Pennsylvania About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data Contact: ldc@ldc.upenn.edu (c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.
|
ACE 2005 Multilingual Training Corpus
|
This publication contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from LDC. This data was previously distributed as an e-corpus (LDC2005E18) to participants in the 2005 ACE evaluation. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. The current publication comprises the official training data for these evaluation tasks. A complete description of the ACE 2005 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST). For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website Below is information about the amount of data included in the current release and its annotation status. 1P: data subject to first pass (complete) annotation DUAL: data also subject to dual first pass (complete) annotation ADJ: data also subject to discrepancy resolution/adjudication NORM: data also subject to TIMEX2 normalization English wordsfiles 1P DUAL ADJ NORM 1P DUAL ADJ NORM NW 60658 57807 33459 48399 128 124 81 106 BN 59239 58144 52444 55967 239 234 217 226 BC 46612 46110 33874 40415 68 67 52 60 WL 45210 43648 35529 37897 127 122 114 119 UN 45161 44473 26371 37366 58 57 37 49 CTS 47003 47003 34868 39845 46 46 34 39 Total 303833 297185 216545 259889 666 650 535 599 Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word. charsfiles 1P DUAL ADJ 1P DUAL ADJ NW 127319 124175 121797 248 242 238 BN 134963 133696 120513 332 328 298 WL 71839 68063 65681 107 101 97 Total 334121 325834 307991 687 671 633 Arabic wordsfiles 1P DUAL ADJ 1P DUAL ADJ NW 61287 56158 53026 239 226 221 BN 29259 27165 26907 134 128 127 WL 21687 20181 20181 60 55 55 Total 112233 103504 100114 433 409 403
|
[{'url': u'desc/addenda/LDC2006T06.txt', 'title': u'English'}, {'url': u'desc/addenda/LDC2006T06_ara.jpg', 'title': u'Arabic'}, {'url': u'desc/addenda/LDC2006T06_ch.jpg', 'title': u'Chinese'}]
|
|
LDC2011T08
|
Portions 2000 American Broadcasting Corporation, 2000, 2003 Cable News Network, LP, LLP, 2000 National Broadcasting Company, 2000 New York Times, 2000 Public Radio International, 2000 The Associated Press, 2005, 2006, 2011 Trustees of the University of Pennsylvania About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data Contact: ldc@ldc.upenn.edu (c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.
|
Datasets for Generic Relation Extraction (reACE)
|
Datasets for Generic Relation Extraction (reACE), Linguistic Data Consortium (LDC) catalog number LDC2011T08 and isbn 1-58563-582-0, was developed at The University of Edinburgh, Edinburgh, Scotland. It consists of English broadcast news and newswire data originally annotated for the ACE (Automatic Content Extraction) program to which the Edinburgh Regularized ACE (reACE) mark-up has been applied. The Edinburgh relation extraction (RE) task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and to recode it in a format such as a relational database or RDF triple store (a database for the storage and retreival of Resource Description Framework (RDF) metadata) that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluation of automatic systems for RE in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and different notions of what constitutes a relation. reACE solves this problem by converting data to a common document type using token standoff and including detailed linguistic markup while maintaining all information in the original annotation. The subsequent reannotation process normalises the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web. The data in this corpus consists of newswire and broadcast news material from ACE 2004 Multilingual Training Corpus LDC 2005T09 and ACE 2005 Multilingual Training Corpus LDC2006T06. This material has been standardised for evaluation of multi-type RE across domains. Complete documentation for this corpus is available at the publication provider's web site Datasets for Generic Relation Extraction. Data Annotation includes (1) a refactored version of the original data to a common XML document type; (2) linguistic information from LT-TTT (a system for tokenizing text and adding markup) and MINIPAR (an English parser); and (3) a normalised version of the original RE markup that complies with a shared notion of what constitutes a relation across domains. The data sources represented in the corpus were collected by LDC in 2000 and 2003 and consist of the following: ABC, Agence France Presse, Associated Press, Cable News Network, MSNBC/NBC, New York Times, Public Radio International, Voice of America and Xinhua News Agency.
|
[{'url': u'http://benhachey.info/data/gre/examples/ace.xml', 'title': u'sample file'}]
|
|
LDC2010T22
|
Portions 2000 The Associated Press, 1987-1989 Dow Jones & Company, Inc., 2000 New York Times, 1997-2002, 2010 Trustees of the University of Pennsylvania Contact: ldc@ldc.upenn.edu 2010 Linguistic Data Consortium , Trustees of the University of Pennsylvania . All Rights Reserved. About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data Contact: ldc@ldc.upenn.edu (c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.
|
Manually Annotated Sub-Corpus First Release
|
The Manually Annotated Sub-Corpus First Release (MASC I), Linguistic Data Consortium (LDC) catalog number LDC2010T22 and isbn 1-58563-569-3, is the first of three releases of 500,000 words of MASC data developed as part of the American National Corpus (ANC) project. MASC I consists of approximately 80,000 words of contemporary spoken and written American English annotated for a variety of linguistic phenomena. The MASC project is sponsored by the National Science Foundation and was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. Researchers from Vassar College, Columbia University and the International Computer Science Institute, University of California at Berkeley are the principal participants; the WordNet project provides consulting. The source texts in MASC I are drawn from the open portion of the American National Corpus (ANC) Second Release LDC2005T35, which includes written texts and spoken transcripts of American English from a broad range of genres produced since 1990; and from the Language Understanding Annotation Corpus LDC2009T09, (LU Corpus), a collection of various genres including broadcast, newswire, email and telephone speech annotated for committed belief, event and entity coreference, dialog acts and temporal relations. All of the words of data in MASC I have validated annotations for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities and Penn Treebank syntax. Full-text FrameNet annotations are available for seventeen texts and WordNet word sense annotations are available for 1000 occurrences of each of fifty-three words. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects. Software and services available from the ANC project website enable transduction of MASC into a wide variety of physical formats. Data The MASC directory contains two folders: "masc-1.0.3" and "masc_wordsense". masc-1.0.3 contains the actual MASC corpus and consists of two folders, "spoken" and "written". The spoken folder contains data and annotations for spoken material, and the written folder contains the same for written texts. The files in each of the respective folders have naming conventions that describe the contents of the file. masc_wordsense contains the MASC sentence samples with word sense annotations using WordNet sense numbers as the annotation values. Updates Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T22.
|
[{'url': u'./desc/addenda/LDC2010T22_spoken.pdf', 'title': u''}, {'url': u'./desc/addenda/LDC2010T22_written.pdf', 'title': u''}]
|
|
LDC2008T25
|
Portions 2004-2006 Agence France Presse, The Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post News Service, Inc., New York Times, Xinhua News Agency, 2004-2006, 2007, 2008 Trustees of the University of Pennsylvania About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data Contact: ldc@ldc.upenn.edu (c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.
|
AQUAINT-2 Information-Retrieval Text Research Collection
|
AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31). The AQUAINT (Advanced Question-Answering for Intelligence) program addresses interactivity with scenarios or tasks. The scenario provides a context in which questions will be asked and answered, and the task reflects the overall assignment. The program is committed to solve a single problem: how to find topically relevant, semantically related, timely information in massive amounts of data in diverse languages, formats, and genres. AQUAINT technology is advancing the development of components and functions that allows users to pose a series of intertwined, complex questions and obtain comprehensive answers in the context of broad information-gathering tasks. In addition, while most information retrieval systems present only links to documents, AQUAINT is producing technology that will present answers to the user's questions. This question-answering technology is being developed with features for managing semantic similarity, co-reference, event characterization, opinions, linguistic and social and world inferencing, redundancy, deception, and missing or contradictory information. In order to allow the analyst to guide the exploration in concert with the machine, AQUAINT technology employs interactive question-answering, the automatic suggestion of additional paths of exploration, and the inferencing of the social context of the information search. Data AQUAINT-2 Information-Retrieval Text Research Collection is a subset of LDC's English Gigaword Third Edition (LDC2007T07). The collection comprises approximately 2.5 GB of text (about 907K documents) spanning the time period October 2004 - March 2006. For each source, all of the usable data collected by LDC was processed into a consistent XML format in which the stories for a given month are concatenated in chronological order into a single "DOCSTREAM" element; each story is a single "DOC" element within that stream and has a globally unique "id" attribute. The collection consists of newswire data in English drawn from six distinct sources, listed below in terms of their file name designations and full names: afp_eng Agence France Presse apw_eng Associated Press cna_eng Central News Agency (Taiwan) English Service ltw_eng Los Angeles Times - Washington Post News Service nyt_eng New York Times xin_eng Xinhua News Agency (Beijing) English Service
|
[{'url': u'./desc/addenda/LDC2008T25.jpg', 'title': u'image'}]
|
|
LDC93S4A
|
|
ATIS0 Complete
|
|
|
|
LDC93S4B
|
|
ATIS0 Pilot
|
|
|
|
LDC93S4B-2
|
|
ATIS0 Read
|
|
|
To download data sign in with GitHub
rows 10 / 49
url | ldc_id |
---|---|
LDC93S4A
|
|
LDC93S4B
|
|
LDC93S4B-2
|
|
LDC93S4B-3
|
|
LDC93S5
|
|
LDC95S26
|
|
LDC94S19
|
|
LDC93S6A
|
|
LDC93S6C
|
|
LDC93S6B
|
Total run time: less than 5 seconds
Total cpu time used: less than 5 seconds
Total disk space used: 21 KB