Third International Chinese Language Processing Bakeoff
Detailed Instructions |
The following comprises the complete description of the training and testing for the Third International Chinese Language Processing Bakeoff. By participating in this competition, you are declaring that you understand these descriptions, and that you agree to abide by the specific terms as laid out below.
This document will first present instructions common to both tasks, and then detailed information about the corpora, formats and evaluation methodology for each individual task.
You may decide to participate in either an open test or a closed test, or both.
In the open test you will be allowed to train on the training set for a particular corpus, and in addition you may use any other material including material from other training corpora, proprietary dictionaries, material from the WWW and so forth.
If you elect the open test, you will be required, in the writeup of your results, to explain what percentage of your correct/incorrect results came from which sources. For example, if you score an F measure of 0.7 on words in the testing corpus that are out-of-vocabulary with respect to the training corpus, you must explain how you got that result: was it just because you have a good coverage dictionary, do you have a good unknown word detection algorithm, etc?
In the closed test you may only use training material from the training data for the particular corpus you are testing on. No other material or knowledge is allowed, including (but not limited to):
Both training and testing data will be published in the original coding schemes used by the data sources. Additionally it will be transcoded by the organizers into Unicode UTF-8 (or, if provided in Unicode, into the defacto encoding for the locale.) The training data will be formatted as follows.
The corpora have been made available by the providers for the purposes of this competition only. By downloading the training and testing corpora, you agree that you will not use these corpora for any other purpose than as material for this competition. Petitions to use the data for any other purpose MUST be directed to the original providers of the data. Neither SIGHAN nor the ACL will assume any liability for a participant's misuse of the data.
The test data will be available for each corpus at the website at 12:00 GMT, May 15, 2006. The test data will be in the formats described below for each task.
The data must be returned in the same coding scheme as they were published in. (For example, If you utilize the UTF-8 encoded version of the testing data, then the results must be returned in UTF-8.) Participants are reminded that ASCII character codes may occur in Chinese text to represent Latin letters, numbers and so forth: such codes should be left in their original coding scheme. Do not convert them to their GB/Big5 equivalents. Similarly GB/Big5 codings of Latin letters or Arabic numerals should be left in their original coding, and not converted to ASCII.
The results will be scored completely automatically. The scripts that were used to score will be made publicly available. The measures that will be reported are precision, recall, and an evenly-weighted F-measure. We will also report scores for in-vocabulary and out-of-vocabulary words for Word Segmentation.
Note: by downloading the test material and submitting results on this material you are thereby declaring that you have not previously seen the test material for the given corpus.
You are also declaring that your testing will be fully automatic. This means that any kind of manual intervention is disallowed, including, but not limited to:
Results will be provided in two phases. Privately to individual participants by May 19, 2006, then publicly to all participants and to the community at large at the SIGHAN Workshop. By participating in this contest, you are agreeing that the results of the test may be published, including the names of the participants.
By electing to participate in any part of this contest, you are agreeing to provide, by June 2, 2006, a four-page writeup that briefly describes your segmentation system and/or NER system, and a summary of your results. In the closed tests you may describe the technical details of how you came by the particular results. In the open test you must describe the technical details of how you came by the particular results.
The format of the paper must adhere to the style guidelines for ACL-COLING 2006, except for the four page limit and submission location.
Four corpora are available for this bakeoff:
Corpus | Encoding | ||||
---|---|---|---|---|---|
Traditional Chinese | |||||
Academia Sinica | Unicode/Big Five Plus | ||||
City University of Hong Kong | HKSCS Unicode/Big Five | ||||
Simplified Chinese | |||||
Microsoft Research | CP936/Unicode | ||||
University of Pennsylvania/University of Colorado | CP936/Unicode |
Participants may also obtain data in XML format with sentences delimited
by "
Corpus | Encoding | NE Types | |||
---|---|---|---|---|---|
Traditional Chinese | |||||
City University of Hong Kong | HKSCS Unicode/Big Five | PER, LOC, ORG | |||
Simplified Chinese | |||||
Microsoft Research | CP936/Unicode | PER,ORG,LOC | |||
Linguistic Data Consortium | CP936/Unicode | PER,LOC,ORG,GPE |
Tag | Meaning |
---|---|
0 (zero) | Not part of a named entity |
B-PER | Beginning character of a person name |
I-PER | Non-beginning character of a person name |
B-ORG | Beginning character of an organization name |
I-ORG | Non-beginning character of an organization name |
B-LOC | Beginning character of a location name |
I-LOC | Non-beginning character of a location name |
B-GPE | Beginning character of a geopolitical entity |
I-GPE | Non-beginning character of a geopolitical entity |
A format in which sentences and named entities are tagged in-line will also be available.
The LDC also uses different offset tagging scheme for named entities. Software will be provided by the Bakeoff organizers to convert the LDC files to the format descibed above for MSRA and CityU data.