![]() |
Bakeoff Registration |
Additional information about the tasks and data sources appears below:
The following resources will be available:
Matched training and (new) test sets from:
| Source Institution | Character Encoding | Approximate Size (chars) |
| CKIP, Academia Sinica, Taiwan | Traditional, Big5 | 8.3M |
| City University of Hong Kong | Traditional, Big5 | 2.4M |
| Microsoft Research | Simplified, CP936 | 5M |
| University of Pennsylvania University of Colorado, Boulder | Simplified | 1M |
Segmentation guidelines for the following corpora are available. These were supplied to SIGHAN by each data provider, and converted into PDF by the organizer:
| Corpus | MS Word | |
|---|---|---|
| Academia Sinica | 516 KB | 336 KB |
| City University of Hong Kong | 154 KB | 237 KB |
| Microsoft Research | 41 KB | 70 KB |
The following resources will be available:
Matched training and (new) test sets from:
| Source Institution | Character Encoding |
| City University of Hong Kong | Traditional, Big5 |
| Microsoft Research | Simplified, CP936 |
| Linguistic Data Consortium | Simplified, CP936 |