Bakeoff Registration |
Additional information about the tasks and data sources appears below:
The following resources will be available:
Matched training and (new) test sets from:
Source Institution | Character Encoding | Approximate Size (chars) |
CKIP, Academia Sinica, Taiwan | Traditional, Big5 | 8.3M |
City University of Hong Kong | Traditional, Big5 | 2.4M |
Microsoft Research | Simplified, CP936 | 5M |
University of Pennsylvania University of Colorado, Boulder | Simplified | 1M |
Segmentation guidelines for the following corpora are available. These were supplied to SIGHAN by each data provider, and converted into PDF by the organizer:
Corpus | MS Word | |
---|---|---|
Academia Sinica | 516 KB | 336 KB |
City University of Hong Kong | 154 KB | 237 KB |
Microsoft Research | 41 KB | 70 KB |
The following resources will be available:
Matched training and (new) test sets from:
Source Institution | Character Encoding |
City University of Hong Kong | Traditional, Big5 |
Microsoft Research | Simplified, CP936 |
Linguistic Data Consortium | Simplified, CP936 |