SIGHAN Third International Chinese Language Processing Bakeoff
Detailed Instructions

The following comprises the complete description of the training and testing for the Third International Chinese Language Processing Bakeoff. By participating in this competition, you are declaring that you understand these descriptions, and that you agree to abide by the specific terms as laid out below.

General Instructions

Participants may enter either (or both) of two tasks in the Third International Chinese Language Processing Bakeoff:

This document will first present instructions common to both tasks, and then detailed information about the corpora, formats and evaluation methodology for each individual task.

Training: Description of Tracks

Dimension 1: Corpora

Several corpora are available for each bakeoff task: 4 for Word Segmentation and 3 for Named Entity Recognition as described below. You may declare that you will return results on any subset of these corpora. For example, you may decide that you will test on the Sinica Corpus and the City University corpus. The only constraint is that you must not select a corpus where you have knowingly had previous access to the testing portion of the corpus. A corollary of this is that a team may not test on the data from their own institution.

Dimension 2: Open or Closed Test

You may decide to participate in either an open test or a closed test, or both.

In the open test you will be allowed to train on the training set for a particular corpus, and in addition you may use any other material including material from other training corpora, proprietary dictionaries, material from the WWW and so forth.

If you elect the open test, you will be required, in the writeup of your results, to explain what percentage of your correct/incorrect results came from which sources. For example, if you score an F measure of 0.7 on words in the testing corpus that are out-of-vocabulary with respect to the training corpus, you must explain how you got that result: was it just because you have a good coverage dictionary, do you have a good unknown word detection algorithm, etc?

In the closed test you may only use training material from the training data for the particular corpus you are testing on. No other material or knowledge is allowed, including (but not limited to):

  1. Part-of-speech information
  2. Externally generated word-frequency counts
  3. Arabic and Chinese Numbers
  4. Feature characters for place names
  5. Common Chinese surnames

Format of the data

Both training and testing data will be published in the original coding schemes used by the data sources. Additionally it will be transcoded by the organizers into Unicode UTF-8 (or, if provided in Unicode, into the defacto encoding for the locale.) The training data will be formatted as follows.

  1. Task specific annotations and format will be described below for Word Segmentation and Named Entity Recognition.
  2. There will be no further annotations, such as part-of-speech tags: if the original corpus includes those, those will be removed.

Licensing

The corpora have been made available by the providers for the purposes of this competition only. By downloading the training and testing corpora, you agree that you will not use these corpora for any other purpose than as material for this competition. Petitions to use the data for any other purpose MUST be directed to the original providers of the data. Neither SIGHAN nor the ACL will assume any liability for a participant's misuse of the data.

Testing

The test data will be available for each corpus at the website at 12:00 GMT, May 15, 2006. The test data will be in the formats described below for each task.

The data must be returned in the same coding scheme as they were published in. (For example, If you utilize the UTF-8 encoded version of the testing data, then the results must be returned in UTF-8.) Participants are reminded that ASCII character codes may occur in Chinese text to represent Latin letters, numbers and so forth: such codes should be left in their original coding scheme. Do not convert them to their GB/Big5 equivalents. Similarly GB/Big5 codings of Latin letters or Arabic numerals should be left in their original coding, and not converted to ASCII.

The results will be scored completely automatically. The scripts that were used to score will be made publicly available. The measures that will be reported are precision, recall, and an evenly-weighted F-measure. We will also report scores for in-vocabulary and out-of-vocabulary words for Word Segmentation.

Note: by downloading the test material and submitting results on this material you are thereby declaring that you have not previously seen the test material for the given corpus.

You are also declaring that your testing will be fully automatic. This means that any kind of manual intervention is disallowed, including, but not limited to:

  1. Manual correction of the output of your segmentation.
  2. Prepopulating the dictionary with words derived by a manual inspection of the test corpus

Results

Results will be provided in two phases. Privately to individual participants by May 19, 2006, then publicly to all participants and to the community at large at the SIGHAN Workshop. By participating in this contest, you are agreeing that the results of the test may be published, including the names of the participants.

Writeup

By electing to participate in any part of this contest, you are agreeing to provide, by June 2, 2006, a four-page writeup that briefly describes your segmentation system and/or NER system, and a summary of your results. In the closed tests you may describe the technical details of how you came by the particular results. In the open test you must describe the technical details of how you came by the particular results.

The format of the paper must adhere to the style guidelines for ACL-COLING 2006, except for the four page limit and submission location.

Word Segmentation Task

Four corpora are available for this bakeoff:

Corpus Encoding
Traditional Chinese
Academia Sinica Unicode/Big Five Plus
City University of Hong Kong HKSCS Unicode/Big Five
Simplified Chinese
Microsoft Research CP936/Unicode
University of Pennsylvania/University of Colorado CP936/Unicode

Data Formats

Training data will be provided one-sentence-per-line with word and punctuation separated by spaces. Test data will be in the same format, except that, of course, the spaces will be absent.

Participants may also obtain data in XML format with sentences delimited by "","" tags and words delimited by "","" if they wish.

Named Entity Recognition Task

There are three corpora available for this task:
Corpus Encoding NE Types
Traditional Chinese
City University of Hong Kong HKSCS Unicode/Big Five PER, LOC, ORG
Simplified Chinese
Microsoft Research CP936/Unicode PER,ORG,LOC
Linguistic Data Consortium CP936/Unicode PER,LOC,ORG,GPE

Data Formats

Training data will be available for CityU and MSRA in two formats. The primary format will be similar to that of the Co-NLL NER task 2002, adapted for Chinese. The data will be presented in two-column format, where the first column consists of the character and the second is a tag. The tag is specified as follows:
Tag Meaning
0 (zero) Not part of a named entity
B-PER Beginning character of a person name
I-PER Non-beginning character of a person name
B-ORG Beginning character of an organization name
I-ORG Non-beginning character of an organization name
B-LOC Beginning character of a location name
I-LOC Non-beginning character of a location name
B-GPE Beginning character of a geopolitical entity
I-GPE Non-beginning character of a geopolitical entity

A format in which sentences and named entities are tagged in-line will also be available.

LDC Data Formats

Because of copyright restrictions, the LDC data MUST be downloaded directly from the LDC, only upon completion of the LDC's copyright form, available here. The form should be fax to the LDC at: +1-215-573-2175. Download instructions and password will be sent to participants by the LDC once the copyright form has been received.

The LDC also uses different offset tagging scheme for named entities. Software will be provided by the Bakeoff organizers to convert the LDC files to the format descibed above for MSRA and CityU data.

Test data

Test data will be provided one-sentence per line, unsegmented with no tags. Participants should format their results to conform to the training data format described above. Scoring will be done automatically using a variant of the Co-NLL 2003 scoring script. Comments at the beginning of the file describe usage.
levow at cs.uchicago.edu
Last edited: April 16, 2006 10:44 CDT