City University of Hong Kong Dep
Department of Chinese, Translation and Linguistics,
Language Information Sciences Research Centre &
The Halliday Centre for Intelligent Applications of Language Studies
The American National Corpus
Prof. Nancy Ide
The Computer Science Department, Vassar College, Poughkeepsie, New York
Date: 29 August 2006, Tuesday
Time: 4:30pm - 5:30pm
Venue: B7603 (Lift 3, 7/F, Blue Zone), Academic Building, CityU
This presentation will provide an overview of the American National Corpus (ANC), its contents, annotations, representation format, and plans for its future development. The ANC is being developed to have, for American English, the kind of linguistic documentation that exists for British English in the British National Corpus. The goal for the ANC is to parallel the general structure of the BNC, while adding genres like blogging and instant messaging that did not exist when the BNC was created. The ANC now has 22 million words and is constantly adding more. Like the BNC, the data is annotated for part of speech, but the ANC is also annotated for other linguistic phenomena such as noun chunks and verb chunks, and more annotations for additional phenomena are being added. Another difference between the BNC and the ANC is the way the data and annotations are represented; the ANC has been built using state-of-the-art techniques that enable users to choose which annotations they want, and in which format, in their working version of the corpus.
Nancy Ide is Professor and Chair of the Computer Science Department at Vassar College in Poughkeepsie, New York. She has been involved in the development of representation standards for language data since 1987, when she founded the Text Encoding Initiative. Since then she has been involved in several corpus building projects, including MULTEXT, MULTEXT-EAST, and now the American National Corpus , and continues to work on standards as a member of the International Standards Organization's committee on Language Resource Management. Professor Ide has published extensively in the fields of computational linguistics and computational lexicography, especially in the area of word sense disambiguation. Since 1997 she has been a co-organizer of the biennial EUROLAN summer schools on computational linguistics. She is currently co-editor-in-chief of the journal "Language Resources and Evaluation"(formerly "Computers and the Humanities"), and co-edits a book series for Springer entitled "Text, Speech, and Language Technology".
~ All Are Welcome ~