Second International Chinese Word Segmentation Bakeoff
Data |
The Second International Chinese Word Segmentation Bakeoff took place over the summer of 2005 and the results were presented at the 4th SIGHAN Workshop, held at IJCNLP'05, October 14-15.
Corpora from the following organizations were used:
The complete training, testing, and gold-standard data sets, as well as the scoring script, are available for research use:
The Detailed Instructions for the bakeoff are available. Please read them carefully.
Segmentation guidelines for the following corpora are available. These were supplied to SIGHAN by each data provider, and converted into PDF by the organizer:
Corpus | MS Word | |
---|---|---|
Academia Sinica | 516 KB | 336 KB |
City University of Hong Kong | 154 KB | 237 KB |
Peking University | 177 KB | 294 KB |
Microsoft Research | 41 KB | 70 KB |
The collected results of all participating systems are also available.
Many thanks to the data providers and the bakeoff participants!
The bakeoff was organized by Tom EMERSON of Basis Technology Corp.
Questions on the bakeoff should be addressed to Tom Emerson.