Department of Chinese, Translation and Linguistics
Research Degree Forum
Mining Knowledge from Comparable Chinese-English Patents
Mr. LU Bin
PhD candidate, Department of Chinese, Translation and Linguistics, City University of Hong Kong
Date: 2 February 2010, Tuesday
Time: 2:30 - 3:30 pm
Venue: B7603 (7/F, Blue Zone), Academic Building, CityU
Patent documents constitute a special text genre, and present special challenges because of the use of technical language in a legalistic context. Thus the processing of patent documents is not only important to industry and the legal profession, but is also attracting more researchers specializing in natural language processing and information retrieval. However, relatively little research has been conducted on the processing of multilingual patents. In this presentation, we address some major issues of mining parallel knowledge from comparable Chinese-English patents which contains both equivalent sentences as well as noise.
When compared to comparable patents, a parallel corpus of matched equivalent sentences is an invaluable resource for many NLP applications, such as machine translation, multilingual lexicography, and cross-lingual information retrieval. However, obtaining a large-scale parallel corpus is much more expensive than obtaining a comparable bilingual corpus. From our corpus of about 7000 Chinese-English comparable patents with titles, abstracts, claims and full texts, we try to address the following three issues:
1) Parallel sentence extraction: aligns only parallel sentences in the comparable patents by combining three quality measures, thereby deriving a useful parallel corpus;
2) Bilingual term extraction: identifies correct bilingual terms by combining both linguistic and statistical information under an SVM classifier;
3) Chinese to English SMT: automatically translates patents from Chinese to English based on an SMT engine trained on the mined parallel sentences.
The experiments show that the proposed methods achieve good performance, and the SMT engine trained on the parallel sentences could achieve a BLEU score of 0.25.
Given the relative paucity of parallel patent data, the use of such comparable corpus for mining parallel knowledge would be a helpful step towards MT research and other cross-lingual access applications in the patent domain, such as cross-lingual information retrieval.
Mr. LU Bin is a PhD candidate from the Department of Chinese, Translation and Linguistics.His research interests includes Sentiment Analysis and Opinion Mining, Statistical Machine Translation (SMT), Computational Linguistics, and Natural Language Processing (NLP).
~ CTL Staff and Research Degree Students only ~