FlexCRFs: Flexible Conditional Random Fields

Noun phrase chunking with FlexCRFs

We describe a case study of noun phrase chunking (NP chunking) with FlexCRFs as an example of labeling and segmenting sequence data. Text chunking (also known as phrase chunking, phrase recognition, or shallow parsing) - an intermediate step towards full parsing of natural language - recognizes phrase types (e.g., noun phrase - NP, verb phrase - VP, prepositional phrase - PP, etc.) in input text sentences. NP chunking deals with a part of this task: it involves recognizing the chunks that consists of noun phrases (NPs). Here is an example of a sentence with NP phrase marking: "[NP Rolls-Royce Motor Cars Inc.] expects [NP its U.S. sales] to remain steady at [NP about 1,200 cars] in [NP 1990]."

The standard dataset put forward by Ramshaw and Marcus consists of sections 15-18 of the Wall street Journal corpus as training set and the section 20 of that corpus as testing set. The description of NP chunking task and the dataset can be downloaded from this site: http://staff.science.uva.nl/~erikt/research/np-chunking.html

The evaluation measure for this task are precision, recall, and F1-score based on whole chunks: precision = a / b; recall = a / c; F1-score = 2 x precision x recall / (precision + recall), in which a is the number of correctly predicted NP phrases by the CRF model, b is the number of NP phrases predicted by the CRF model, and c is the number of actual NP phrases annotated by humans.

NP chunking on the dataset of CoNLL2000 shared task with FlexCRFs (using second-order Markov CRFs)
Raw training data file (section 15-18 of WSJ):	train.txt (gzip)	provided by users
Raw testing data file (section 20 of WSJ):	test.txt (gzip)	provided by users
Training data after performing feature selection:	train.tagged (gzip)	preparing by using "chunkingfeasel" utility
Testing data after performing feature selection:	test.tagged (gzip)	preparing by using "chunkingfeasel" utility
Option file for CRF model:	option.txt	prepared by users
Output CRF model file (containing the model parameters):	model.txt (gzip)	automatically generated after training
Ouput training log file (saving information during training):	trainlog.txt	automatically generated during training
Output of final testing data:	test.tagged.model (gzip)	automatically generated after training
The highest phrase-based result: precision = 94.70%, recall = 94.44%, and F1-score = 94.57%(at iteration 70th)

The results of NP chunking with FlexCRFs (using second-order Markov dependency) on the dataset of the CoNLL2000 shared task are presented in the above table. Please see the manual for how to run the experiment.

All-phrase chunking with PCRFs

Text chunking (also known as phrase chunking, phrase recognition, or shallow parsing) - an intermediate step towards full parsing of natural language - recognizes phrase types (e.g., noun phrase - NP, verb phrase - VP, prepositional phrase - PP, etc.) in input text sentences. We performed chunking task on the dataset of the CoNLL2000 shared task using PCRFs on 90 parallel processors of a Cray XT3 system. The experimental settings and results are given in the following table. Please see the manual document for a complete instruction of how to run the experiment on massively parallel systems.

All-phrase chunking on the dataset of CoNLL2000 shared task with PCRFs (using second-order Markov CRFs)
Raw training data file (section 15-18 of WSJ):	train.txt (gzip)	provided by users
Raw testing data file (section 20 of WSJ):	test.txt (gzip)	provided by users
Training data after performing feature selection:	train.tagged (gzip)	preparing by using "chunkingfeasel" utility
Testing data after performing feature selection:	test.tagged (gzip)	preparing by using "chunkingfeasel" utility
Option file for CRF model:	option.txt	prepared by users
Output CRF model file (containing the model parameters):	model.txt (gzip)	automatically generated after training
Ouput training log file (saving information during training):	trainlog.txt	automatically generated during training
Output of final testing data:	test.tagged.model (gzip)	automatically generated after training
The highest phrase-based result: precision = 94.11%, recall = 93.98%, and F1-score = 94.05% (at iteration 63th)

Noun phrase chuking on a larger dataset (CoNLL2000-L)

Using PCRFs with second-order Markov CRFs, training on 45 parallel processors of a Cray XT3 system: Training set consists of 39,832 sentences of sections 02-21 of WSJ, and test set has 1,921 sentences of section 00. See this page for the accuracy comparison.

Initial q	IOB2, #features: 1,351,627			IOE2, #features: 1,350,514
Initial q	Precision	Recall	F1-score	Precision	Recall	F1-score
0.00	96.54%	96.37%	96.45	96.49%	96.37%	96.43
0.01	96.50%	96.32%	96.44	96.51%	96.44%	96.48
0.02	96.63%	96.31%	96.47	96.59%	96.36%	96.47
0.03	96.53%	96.31%	96.42	96.50%	96.44%	96.47
0.04	96.67%	96.35%	96.51	96.57%	96.33%	96.45
0.05	96.59%	96.29%	96.44	96.63%	96.55%	96.59
0.06	96.54%	96.40%	96.47	96.72%	96.43%	96.58
0.07	96.59%	96.33%	96.46	96.49%	96.54%	96.51
Voting	Precision = 96.80%, Recall = 96.68%, F1-score = 96.74

All-phrase chunking on a larger dataset (CoNLL2000-L)

Using PCRFs with second-order Markov CRFs, training on 90 parallel processors of a Cray XT3 system: Training set consists of 39,832 sentences of sections 02-21 of WSJ, and test set has 1,921 sentences of section 00.

Initial q	IOB2, #features: 1,471,007			IOE2, #features: 1,466,312
Initial q	Precision	Recall	F1-score	Precision	Recall	F1-score
0.00	96.09%	96.04%	96.06%	96.10%	96.10%	96.10
0.01	96.09%	96.04%	96.06%	96.12%	96.09%	96.11
0.02	96.11%	96.10%	96.10%	96.19%	96.09%	96.14
0.03	96.09%	96.01%	96.05%	96.13%	96.08%	96.11
0.04	96.07%	95.98%	96.03%	96.16%	96.04%	96.10
0.05	96.12%	96.01%	96.07%	96.13%	96.04%	96.09
0.06	96.10%	96.00%	96.05%	96.20%	96.17%	96.18
0.07	96.03%	96.07%	96.05%	96.12%	96.17%	96.15
Voting	Precision = 96.33%, Recall = 96.33%, F1-score = 96.33

25-fold cross-validation test of noun phrase chunking on all 25 sections of WSJ corpus

Using PCRFs with second-order Markov CRFs, each fold is trained on 45 parallel processors of a Cray XT3 system:

Fold	IOB2	IOE2	Max F1-score	Fold	IOB2	IOE2	Max F1-score
Fold	F1-score	F1-score	Max F1-score	Fold	F1-score	F1-score	Max F1-score
00	96.56	96.54	96.56	13	97.17	97.17	97.17
01	96.72	96.76	96.76	14	96.29	96.51	96.29
02	96.76	96.81	96.81	15	96.04	96.19	96.19
03	96.56	96.53	96.56	16	96.42	96.33	96.42
04	96.65	96.67	96.67	17	96.50	96.52	96.52
05	96.55	96.48	96.55	18	96.46	96.62	96.62
06	96.07	96.78	96.78	19	96.90	96.92	96.92
07	95.42	95.54	95.54	20	95.91	96.05	96.05
08	96.79	97.12	97.12	21	96.28	96.25	96.28
09	96.08	96.06	96.08	22	96.47	96.52	96.52
10	96.59	96.61	96.61	23	96.45	96.43	96.45
11	96.01	96.06	96.06	24	95.42	95.26	95.42
12	95.68	95.97	95.97	Avg.	96.35	96.42	96.45