SIGHAN Bakeoff Registration

Registration

The Third SIGHAN Chinese Language Processing Bakeoff will feature two tasks: To participate, please fill out the text registration form here, with some basic information about participants and the tasks and resources you plan to use. Please email the completed form to levow@cs.uchicago.edu no later than May 8, 2006. You will be sent an acknowledgment and passwords for data download within 24 hours.

Additional information about the tasks and data sources appears below:

Word Segmentation Task

The Word Segmentation task requires identification of word boundaries in running Chinese text.

The following resources will be available:

Matched training and (new) test sets from:

Source Institution Character EncodingApproximate Size (chars)
CKIP, Academia Sinica, Taiwan Traditional, Big58.3M
City University of Hong Kong Traditional, Big52.4M
Microsoft ResearchSimplified, CP9365M
University of Pennsylvania
University of Colorado, Boulder
Simplified1M

Segmentation guidelines for the following corpora are available. These were supplied to SIGHAN by each data provider, and converted into PDF by the organizer:

CorpusMS WordPDF
Academia Sinica516 KB336 KB
City University of Hong Kong154 KB237 KB
Microsoft Research41 KB70 KB

Named Entity Recognition Task

The Named Entity Recognition Task requires participants to identify named entities (person, location, and organization) in running unsegmented Chinese text.

The following resources will be available:

Matched training and (new) test sets from:

Source Institution Character Encoding
City University of Hong Kong Traditional, Big5
Microsoft ResearchSimplified, CP936
Linguistic Data ConsortiumSimplified, CP936

You may declare that you will return results on any subset of these corpora. For example, you may decide that you will test on the Sinica Corpus and the City University corpus. The only constraint is that you must not select a corpus where you have knowingly had previous access to the testing portion of the corpus. A corollary of this is that a team may not test on the data from their own institution.