SIGHAN Third International Chinese Language Processing Bakeoff
Bakeoff 2006 Result Submission Instructions
Thank you for participating in the 3rd International Chinese Language Processing Bakeoff. Please read this page completely: it contains very important information on result submission.

You should submit your results by email to me at this address:

levow@cs.uchicago.edu

The subject of the message should be

Bakeoff 2006 Result Submission

The message should be sent by the primary investigator, i.e., the one who registered for participation. This is how I will match your submission to your registration.

Please use the following conventions:

  1. Submit a single archive file (.zip, .rar, .tar.gz, or .tar.bz2) containing your output file(s). Do this even if you are submitting a single result file.
  2. The archive should be named with the email user id of the submitter. For example, "levow@cs.uchicago.edu" would submit "levow.zip".
  3. Each test file should be named X_test_result_TASK_[OC]_Y.Z where
    X is the name of the corpus: ckip, cityu, msra, upuc, ldc
    TASK is whether the result is for Word Segmentation (WS) or Named Entity Recognition (NER)
    [OC] is whether the result is for the Open or Closed track
    Y is an optional identifier (a lower-case letter) for multiple runs of the system
    Z is the file suffix, .txt or .utf8

    For example, the results of running the UTF-8 encoded cityu corpus in the closed track of the Word Segmentation task in a single run would be

    cityu_test_result_WS_C.utf8

    Running the GB version of the Microsoft corpus in the open track for Named Entity Recognition with two alternate systems would be

    msra_test_result_NER_O_a.txt msra_test_result_NER_O_b.txt

    Note that 'a' and 'b' is used to distinguish two separate runs on this corpus.

  4. The format for the file should be consistent with the NON-XML format of the training data for that task.

    Specifically, for word segmentation, the results file should appear with one line for each sentence/line in the test file with words and punctuation separated by whitespace.

    For named entity recognition, the results file should be in CoNLL two column format, with one character per line with the appropriate tag in the second column separated by a single whitespace character. The primary format will be that of the Co-NLL NER task 2002, adapted for Chinese. The data will be presented in two-column format, where the first column consists of the character and the second is a tag. The tag is specified as follows:

    Tag Meaning
    0 (zero) Not part of a named entity
    B-PER Beginning character of a person name
    I-PER Non-beginning character of a person name
    B-ORG Beginning character of an organization name
    I-ORG Non-beginning character of an organization name
    B-LOC Beginning character of a location name
    I-LOC Non-beginning character of a location name
    B-GPE Beginning character of a geopolitical entity
    I-GPE Non-beginning character of a geopolitical entity

  5. The deadline for which I will accept result submissions is 14:00 GMT on Wednesday, May 17. Results received after this time will not be scored unless the team can provide me with a *very* good reason to allow the extension (e.g., the building burned down, the national computer network went down, etc.)
    1. To minimize confusion, 14:00 GMT on 2006/05/17 is: