Third International Chinese Language Processing Bakeoff

Third International Chinese Language Processing Bakeoff
Bakeoff 2006 Result Submission Instructions

Thank you for participating in the 3rd International Chinese Language Processing Bakeoff. Please read this page completely: it contains very important information on result submission.

You should submit your results by email to me at this address:

levow@cs.uchicago.edu

The subject of the message should be

Bakeoff 2006 Result Submission

The message should be sent by the primary investigator, i.e., the one who registered for participation. This is how I will match your submission to your registration.

Please use the following conventions:

Submit a single archive file (.zip, .rar, .tar.gz, or .tar.bz2) containing your output file(s). Do this even if you are submitting a single result file.
The archive should be named with the email user id of the submitter. For example, "levow@cs.uchicago.edu" would submit "levow.zip".

Each test file should be named X_test_result_TASK_[OC]_Y.Z where

X	is the name of the corpus: ckip, cityu, msra, upuc, ldc
TASK	is whether the result is for Word Segmentation (WS) or Named Entity Recognition (NER)
[OC]	is whether the result is for the Open or Closed track
Y	is an optional identifier (a lower-case letter) for multiple runs of the system
Z	is the file suffix, .txt or .utf8

For example, the results of running the UTF-8 encoded cityu corpus in the closed track of the Word Segmentation task in a single run would be

cityu_test_result_WS_C.utf8

Running the GB version of the Microsoft corpus in the open track for Named Entity Recognition with two alternate systems would be

msra_test_result_NER_O_a.txt msra_test_result_NER_O_b.txt

Note that 'a' and 'b' is used to distinguish two separate runs on this corpus.

The format for the file should be consistent with the NON-XML format of the training data for that task.

Specifically, for word segmentation, the results file should appear with one line for each sentence/line in the test file with words and punctuation separated by whitespace.

For named entity recognition, the results file should be in CoNLL two column format, with one character per line with the appropriate tag in the second column separated by a single whitespace character. The primary format will be that of the Co-NLL NER task 2002, adapted for Chinese. The data will be presented in two-column format, where the first column consists of the character and the second is a tag. The tag is specified as follows:

Tag	Meaning
0 (zero)	Not part of a named entity
B-PER	Beginning character of a person name
I-PER	Non-beginning character of a person name
B-ORG	Beginning character of an organization name
I-ORG	Non-beginning character of an organization name
B-LOC	Beginning character of a location name
I-LOC	Non-beginning character of a location name
B-GPE	Beginning character of a geopolitical entity
I-GPE	Non-beginning character of a geopolitical entity

The deadline for which I will accept result submissions is 14:00 GMT on Wednesday, May 17. Results received after this time will not be scored unless the team can provide me with a *very* good reason to allow the extension (e.g., the building burned down, the national computer network went down, etc.)

To minimize confusion, 14:00 GMT on 2006/05/17 is:

07:00 2006/05/17 in San Francisco
10:00 2006/05/17 in New York
22:00 2006/05/17 in Hong Kong
22:00 2006/05/17 in Beijing
22:00 2006/05/17 in Taipei
22:00 2006/05/17 in Singapore
23:00 2006/05/17 in Seoul
23:00 2006/05/17 in Tokyo