Noun phrase chunking with FlexCRFs
We describe a case study of noun phrase chunking (NP chunking) with FlexCRFs as an example of labeling and segmenting sequence data. Text chunking (also known as phrase chunking, phrase recognition, or shallow parsing) - an intermediate step towards full parsing of natural language - recognizes phrase types (e.g., noun phrase - NP, verb phrase - VP, prepositional phrase - PP, etc.) in input text sentences. NP chunking deals with a part of this task: it involves recognizing the chunks that consists of noun phrases (NPs). Here is an example of a sentence with NP phrase marking: "[NP Rolls-Royce Motor Cars Inc.] expects [NP its U.S. sales] to remain steady at [NP about 1,200 cars] in [NP 1990]."
The standard dataset put forward by Ramshaw and Marcus consists of sections 15-18 of the Wall street Journal corpus as training set and the section 20 of that corpus as testing set. The description of NP chunking task and the dataset can be downloaded from this site: http://staff.science.uva.nl/~erikt/research/np-chunking.html
The evaluation measure for this task are precision, recall, and F1-score based on whole chunks: precision = a / b; recall = a / c; F1-score = 2 x precision x recall / (precision + recall), in which a is the number of correctly predicted NP phrases by the CRF model, b is the number of NP phrases predicted by the CRF model, and c is the number of actual NP phrases annotated by humans.
NP chunking on the dataset of CoNLL2000 shared task with FlexCRFs (using second-order Markov CRFs) | ||
Raw training data file (section 15-18 of WSJ): | train.txt (gzip) | provided by users |
Raw testing data file (section 20 of WSJ): | test.txt (gzip) | provided by users |
Training data after performing feature selection: | train.tagged (gzip) | preparing by using "chunkingfeasel" utility |
Testing data after performing feature selection: | test.tagged (gzip) | preparing by using "chunkingfeasel" utility |
Option file for CRF model: | option.txt | prepared by users |
Output CRF model file (containing the model parameters): | model.txt (gzip) | automatically generated after training |
Ouput training log file (saving information during training): | trainlog.txt | automatically generated during training |
Output of final testing data: | test.tagged.model (gzip) | automatically generated after training |
The highest phrase-based result: precision = 94.70%, recall = 94.44%, and F1-score = 94.57%(at iteration 70th) |
The results of NP chunking with FlexCRFs (using second-order Markov dependency) on the dataset of the CoNLL2000 shared task are presented in the above table. Please see the manual for how to run the experiment.
All-phrase chunking with PCRFs
Text chunking (also known as phrase chunking, phrase recognition, or shallow parsing) - an intermediate step towards full parsing of natural language - recognizes phrase types (e.g., noun phrase - NP, verb phrase - VP, prepositional phrase - PP, etc.) in input text sentences. We performed chunking task on the dataset of the CoNLL2000 shared task using PCRFs on 90 parallel processors of a Cray XT3 system. The experimental settings and results are given in the following table. Please see the manual document for a complete instruction of how to run the experiment on massively parallel systems.
All-phrase chunking on the dataset of CoNLL2000 shared task with PCRFs (using second-order Markov CRFs) | ||
Raw training data file (section 15-18 of WSJ): | train.txt (gzip) | provided by users |
Raw testing data file (section 20 of WSJ): | test.txt (gzip) | provided by users |
Training data after performing feature selection: | train.tagged (gzip) | preparing by using "chunkingfeasel" utility |
Testing data after performing feature selection: | test.tagged (gzip) | preparing by using "chunkingfeasel" utility |
Option file for CRF model: | option.txt | prepared by users |
Output CRF model file (containing the model parameters): | model.txt (gzip) | automatically generated after training |
Ouput training log file (saving information during training): | trainlog.txt | automatically generated during training |
Output of final testing data: | test.tagged.model (gzip) | automatically generated after training |
The highest phrase-based result: precision = 94.11%, recall = 93.98%, and F1-score = 94.05% (at iteration 63th) |
Noun phrase chuking on a larger dataset (CoNLL2000-L)
Using PCRFs with second-order Markov CRFs, training on 45 parallel processors of a Cray XT3 system: Training set consists of 39,832 sentences of sections 02-21 of WSJ, and test set has 1,921 sentences of section 00. See this page for the accuracy comparison.
Initial q | IOB2, #features: 1,351,627 | IOE2, #features: 1,350,514 | ||||
Precision | Recall | F1-score | Precision | Recall | F1-score | |
0.00 | 96.54% | 96.37% | 96.45 | 96.49% | 96.37% | 96.43 |
0.01 | 96.50% | 96.32% | 96.44 | 96.51% | 96.44% | 96.48 |
0.02 | 96.63% | 96.31% | 96.47 | 96.59% | 96.36% | 96.47 |
0.03 | 96.53% | 96.31% | 96.42 | 96.50% | 96.44% | 96.47 |
0.04 | 96.67% | 96.35% | 96.51 | 96.57% | 96.33% | 96.45 |
0.05 | 96.59% | 96.29% | 96.44 | 96.63% | 96.55% | 96.59 |
0.06 | 96.54% | 96.40% | 96.47 | 96.72% | 96.43% | 96.58 |
0.07 | 96.59% | 96.33% | 96.46 | 96.49% | 96.54% | 96.51 |
Voting | Precision = 96.80%, Recall = 96.68%, F1-score = 96.74 |
All-phrase chunking on a larger dataset (CoNLL2000-L)
Using PCRFs with second-order Markov CRFs, training on 90 parallel processors of a Cray XT3 system: Training set consists of 39,832 sentences of sections 02-21 of WSJ, and test set has 1,921 sentences of section 00.
Initial q | IOB2, #features: 1,471,007 | IOE2, #features: 1,466,312 | ||||
Precision | Recall | F1-score | Precision | Recall | F1-score | |
0.00 | 96.09% | 96.04% | 96.06% | 96.10% | 96.10% | 96.10 |
0.01 | 96.09% | 96.04% | 96.06% | 96.12% | 96.09% | 96.11 |
0.02 | 96.11% | 96.10% | 96.10% | 96.19% | 96.09% | 96.14 |
0.03 | 96.09% | 96.01% | 96.05% | 96.13% | 96.08% | 96.11 |
0.04 | 96.07% | 95.98% | 96.03% | 96.16% | 96.04% | 96.10 |
0.05 | 96.12% | 96.01% | 96.07% | 96.13% | 96.04% | 96.09 |
0.06 | 96.10% | 96.00% | 96.05% | 96.20% | 96.17% | 96.18 |
0.07 | 96.03% | 96.07% | 96.05% | 96.12% | 96.17% | 96.15 |
Voting | Precision = 96.33%, Recall = 96.33%, F1-score = 96.33 |
25-fold cross-validation test of noun phrase chunking on all 25 sections of WSJ corpus
Using PCRFs with second-order Markov CRFs, each fold is trained on 45 parallel processors of a Cray XT3 system:
Fold | IOB2 | IOE2 | Max F1-score | Fold | IOB2 | IOE2 | Max F1-score |
F1-score | F1-score | F1-score | F1-score | ||||
00 | 96.56 | 96.54 | 96.56 | 13 | 97.17 | 97.17 | 97.17 |
01 | 96.72 | 96.76 | 96.76 | 14 | 96.29 | 96.51 | 96.29 |
02 | 96.76 | 96.81 | 96.81 | 15 | 96.04 | 96.19 | 96.19 |
03 | 96.56 | 96.53 | 96.56 | 16 | 96.42 | 96.33 | 96.42 |
04 | 96.65 | 96.67 | 96.67 | 17 | 96.50 | 96.52 | 96.52 |
05 | 96.55 | 96.48 | 96.55 | 18 | 96.46 | 96.62 | 96.62 |
06 | 96.07 | 96.78 | 96.78 | 19 | 96.90 | 96.92 | 96.92 |
07 | 95.42 | 95.54 | 95.54 | 20 | 95.91 | 96.05 | 96.05 |
08 | 96.79 | 97.12 | 97.12 | 21 | 96.28 | 96.25 | 96.28 |
09 | 96.08 | 96.06 | 96.08 | 22 | 96.47 | 96.52 | 96.52 |
10 | 96.59 | 96.61 | 96.61 | 23 | 96.45 | 96.43 | 96.45 |
11 | 96.01 | 96.06 | 96.06 | 24 | 95.42 | 95.26 | 95.42 |
12 | 95.68 | 95.97 | 95.97 | Avg. | 96.35 | 96.42 | 96.45 |