SourceForge

FlexCRFs: Flexible Conditional Random Fields

(Including PCRFs - a parallel version of FlexCRFs)

URL: http://flexcrfs.sourceforge.net/

Copyright (c) 2004-2005 by

Xuan-Hieu Phan (pxhieu at gmail dot com), Graduate School of Information Sciences, Tohoku University

Le-Minh Nguyen (nguyenml at jaist dot ac dot jp), Graduate School of Information Sciences, JAIST

Cam-Tu Nguyen (ncamtu at gmail dot com), College of Technology, Vietnam National University, Hanoi 


[Introduction] [License] [News] [Download] [Case Studies] [Links] [Citations] [Acknowledgements] [Reference]


Introduction

FlexCRFs is a conditional random field toolkit for segmenting and labeling sequence data written in C/C++ using STL library. It was implemented based on the theoretic model presented in (Lafferty et al. 2001) and (Sha and Pereira 2003). The toolkit uses L-BFGS (Liu and Nocedal 1989) - an advanced convex optimization procedure - to train CRF models. FlexCRFs was designed to deal with hundreds of thousand data sequences and millions of features. FlexCRFs supports both first-order and second-order Markov CRFs. We have tested FlexCRFs on Linux (Red Hat, Fedora), Sun Solaris, and MS Windows with MS Visual C++.

PCRFs is a parallel version of FlexCRFs that allows us to train conditional random fields on massively parallel processing systems supporting Message Passing Interface (MPI). PCRFs helps to train conditional random fields on large-scale datasets containing up to millions of data sequences. We have tested PCRFs on large parallel systems, such as Cray XT3, SGI Altix, and IBM SP.

All comments, suggestions, and error detections are highly appreciated.

License

FlexCRFs and PCRFs are free tools. You can redistribute them and/or modify them under the terms of GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version.

FlexCRFs and PCRFs are distributed in the hope that they will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY of FITNESS FOR A PARTICULAR PURPOSE. Please see the GNU General Public License for more details.

Any research using this tool for running experiments should include the following citation:

Xuan-Hieu Phan, Le-Minh Nguyen, and Cam-Tu Nguyen, "FlexCRFs: Flexible Conditional Random Field Toolkit", http://flexcrfs.sourceforge.net, 2005.

Here is the Bibtex entry:

@{PhanFlexCRFs,

author = "Xuan-Hieu Phan, Le-Minh Nguyen, and Cam-Tu Nguyen",

title = "FlexCRFs: Flexible Conditional Random Field Toolkit",

note = "http://flexcrfs.sourceforge.net",

year = 2005

}

News

- April 03, 2007: Win-32 binary version available.

Document & Source Code

The source code and documents of FlexCRFs and PCRFs can be download here. Please see the manual document for how to compile and run FlexCRFs and PCRFs.

Several related tools are now available online:

GibbsLDA++: GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for parameter estimation and inference. GibbsLDA++ is fast and is designed to analyze hidden/latent topic structures of large-scale (text) data collections.

CRFTagger : A Java-based Conditional Random Fields Part-of-Speech (POS) Tagger for English. The model was trained on sections 01..24 of WSJ corpus and using section 00 as the development test set (accuracy of 97.00%). Tagging speed: 500 sentences / second.

CRFChunker : A Java-based Conditional Random Fields Phrase Chunker (Phrase Chunking Tool) for English. The model was trained on sections 01..24 of WSJ corpus and using section 00 as the development test set (F1-score of 95.77). Chunking speed: 700 sentences / second.

JTextPro: A Java-based text processing tool that includes sentence boundary detection (using maximum entropy classifier), word tokenization (following Penn convention), part-of-speech tagging (using CRFTagger), and phrase chunking (using CRFChunker).

JWebPro: A Java-based tool that can interact with Google search via Google Web APIs and then process the returned Web documents in a couple of ways. The outputs of JWebPro can serve as inputs for natural language processing, information retrieval, information extraction, Web data mining, online social network extraction/analysis, and ontology development applications.

JVnSegmenter: A Java-based and open-source Vietnamese word segmentation tool. The segmentation model in this tool was trained on about 8,000 labeled sentences using FlexCRFs. It would be useful for Vietnamese NLP community.

Case Studies

Noun phrase chunking with FlexCRFs:

We describe a case study of noun phrase chunking (NP chunking) with FlexCRFs as an example of labeling and segmenting sequence data. Text chunking (also known as phrase chunking, phrase recognition, or shallow parsing) - an intermediate step towards full parsing of natural language - recognizes phrase types (e.g., noun phrase - NP, verb phrase - VP, prepositional phrase - PP, etc.) in input text sentences. NP chunking deals with a part of this task: it involves recognizing the chunks that consists of noun phrases (NPs). Here is an example of a sentence with NP phrase marking: "[NP Rolls-Royce Motor Cars Inc.] expects [NP its U.S. sales] to remain steady at [NP about 1,200 cars] in [NP 1990]."

The standard dataset put forward by Ramshaw and Marcus consists of sections 15-18 of the Wall street Journal corpus as training set and the section 20 of that corpus as testing set. The description of NP chunking task and the dataset can be downloaded from this site: http://staff.science.uva.nl/~erikt/research/np-chunking.html

The evaluation measure for this task are precision, recall, and F1-score based on whole chunks: precision = a / b; recall = a / c; F1-score = 2 x precision x recall / (precision + recall), in which a is the number of correctly predicted NP phrases by the CRF model, b is the number of NP phrases predicted by the CRF model, and c is the number of actual NP phrases annotated by humans.

NP chunking on the dataset of CoNLL2000 shared task with FlexCRFs (using second-order Markov CRFs)
Raw training data file (section 15-18 of WSJ): train.txt (gzip) provided by users
Raw testing data file (section 20 of WSJ): test.txt (gzip) provided by users
Training data after performing feature selection: train.tagged (gzip) preparing by using "chunkingfeasel" utility
Testing data after performing feature selection: test.tagged (gzip) preparing by using "chunkingfeasel" utility
Option file for CRF model: option.txt prepared by users
Output CRF model file (containing the model parameters): model.txt (gzip) automatically generated after training
Ouput training log file (saving information during training): trainlog.txt automatically generated during training
Output of final testing data: test.tagged.model (gzip) automatically generated after training

The highest phrase-based result: precision = 94.70%, recall = 94.44%, and F1-score = 94.57% (at iteration 70th)

The results of NP chunking with FlexCRFs (using second-order Markov dependency) on the dataset of the CoNLL2000 shared task are presented in the above table. Please see the manual document for how to run the experiment.

All-phrase chunking with PCRFs:

Text chunking (also known as phrase chunking, phrase recognition, or shallow parsing) - an intermediate step towards full parsing of natural language - recognizes phrase types (e.g., noun phrase - NP, verb phrase - VP, prepositional phrase - PP, etc.) in input text sentences. We performed chunking task on the dataset of the CoNLL2000 shared task using PCRFs on 90 parallel processors of a Cray XT3 system. The experimental settings and results are given in the following table. Please see the manual document for a complete instruction of how to run the experiment on massively parallel systems.

All-phrase chunking on the dataset of CoNLL2000 shared task with PCRFs (using second-order Markov CRFs)
Raw training data file (section 15-18 of WSJ): train.txt (gzip) provided by users
Raw testing data file (section 20 of WSJ): test.txt (gzip) provided by users
Training data after performing feature selection: train.tagged (gzip) preparing by using "chunkingfeasel" utility
Testing data after performing feature selection: test.tagged (gzip) preparing by using "chunkingfeasel" utility
Option file for CRF model: option.txt prepared by users
Output CRF model file (containing the model parameters): model.txt (gzip) automatically generated after training
Ouput training log file (saving information during training): trainlog.txt automatically generated during training
Output of final testing data: test.tagged.model (gzip) automatically generated after training

The highest phrase-based result: precision = 94.11%, recall = 93.98%, and F1-score = 94.05% (at iteration 63th)

NP chunking on a larger dataset (CoNLL2000-L) (using PCRFs with second-order Markov CRFs, training on 45 parallel processors of a Cray XT3 system):

Training set consists 39,832 sentences of sections 02-21 of WSJ, and testing set has 1,921 sentences of section 00. See this page for a accuracy comparison.

Initial q

IOB2, #features: 1,351,627

IOE2, #features: 1,350,514

Precision Recall F1-score Precision Recall F1-score
0.00 96.54% 96.37% 96.45 96.49% 96.37% 96.43
0.01 96.50% 96.32% 96.44 96.51% 96.44% 96.48
0.02 96.63% 96.31% 96.47 96.59% 96.36% 96.47
0.03 96.53% 96.31% 96.42 96.50% 96.44% 96.47
0.04 96.67% 96.35% 96.51 96.57% 96.33% 96.45
0.05 96.59% 96.29% 96.44 96.63% 96.55% 96.59
0.06 96.54% 96.40% 96.47 96.72% 96.43% 96.58
0.07 96.59% 96.33% 96.46 96.49% 96.54% 96.51
Voting  Precision = 96.80%, Recall = 96.68%, F1-score = 96.74

All-phrase chunking on a larger dataset (CoNLL2000-L) (using PCRFs with second-order Markov CRFs, training on 90 parallel processors of a Cray XT3 system):

Training set consists 39,832 sentences of sections 02-21 of WSJ, and testing set has 1,921 sentences of section 00.

Initial q

IOB2, #features: 1,471,007

IOE2, #features: 1,466,312

Precision Recall F1-score Precision Recall F1-score
0.00 96.09% 96.04% 96.06% 96.10% 96.10% 96.10
0.01 96.09% 96.04% 96.06% 96.12% 96.09% 96.11
0.02 96.11% 96.10% 96.10% 96.19% 96.09% 96.14
0.03 96.09% 96.01% 96.05% 96.13% 96.08% 96.11
0.04 96.07% 95.98% 96.03% 96.16% 96.04% 96.10
0.05 96.12% 96.01% 96.07% 96.13% 96.04% 96.09
0.06 96.10% 96.00% 96.05% 96.20% 96.17% 96.18
0.07 96.03% 96.07% 96.05% 96.12% 96.17% 96.15
Voting  Precision = 96.33%, Recall = 96.33%, F1-score = 96.33

25-fold Cross-Validation Test of NP chunking on all 25 sections of WSJ corpus (using PCRFs with second-order Markov CRFs, each fold is trained on 45 parallel processors of a Cray XT3 system):

Fold IOB2 IOE2 Max F1-score Fold IOB2 IOE2 Max F1-score
F1-score F1-score F1-score F1-score
00 96.56 96.54 96.56 13 97.17 97.17 97.17
01 96.72 96.76 96.76 14 96.29 96.51 96.29
02 96.76 96.81 96.81 15 96.04 96.19 96.19
03 96.56 96.53 96.56 16 96.42 96.33 96.42
04 96.65 96.67 96.67 17 96.50 96.52 96.52
05 96.55 96.48 96.55 18 96.46 96.62 96.62
06 96.07 96.78 96.78 19 96.90 96.92 96.92
07 95.42 95.54 95.54 20 95.91 96.05 96.05
08 96.79 97.12 97.12 21 96.28 96.25 96.28
09 96.08 96.06 96.08 22 96.47 96.52 96.52
10 96.59 96.61 96.61 23 96.45 96.43 96.45
11 96.01 96.06 96.06 24 95.42 95.26 95.42
12 95.68 95.97 95.97 Avg. 96.35 96.42 96.45

Links

Citations

      Here is an incomplete list of published papers that used FlexCRFs to conduct their experiments:

      Query the Web for more information, discussion, and comments.

Acknowledgements

We would like to thank professor Tu-Bao Ho for providing us Penn Treebank data for evaluation.

We would like to thank professor Jorge Nocedal, Department of Electrical and Computer Engineering, School of Engineering and Applied Science, Northwestern University, for providing L-BFGS FORTRAN source code. www.ece.northwestern.edu/~nocedal

The C version of L-BFGS used in this project is borrowed from CRF++ project developed by Taku Kudo (www.chasen.org/~taku/software/CRF++). We would like to thank him for his open source project.

A part of this project, the training section (e.g., the computation of log-likelihood function and its gradient vector), is based on the Java source code of CRF project developed by professor Sunita Sarawagi, KR School of Information Technology, IIT Bombay. We would like to thank professor Sunita Sarawagi for sharing her CRF package and answering related question. www.it.iitb.ac.in/~sunita

We would like to thank Sourceforge.net for hosting this project.

SourceForge

Reference

(Berger et al., 1996): A. Berger, A. Della Pietra, and J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, pp.39-71, No.1, Vol.22, 1996 (or download from citeseer).

(Lafferty et al., 2001): J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In the proceedings of International Conference on Machine Learning (ICML), pp.282-289, 2001 (or download from citeseer).

(Liu and Nocedal, 1989): D. Liu and J. Nocedal. On the limited memory BFGS method for large-scale optimization. Mathematical Programming, pp.503-528, Vol.45, 1989 (or download from citeseer).

(Sha and Pereira, 2003): F. Sha and F. Pereira. Shallow parsing with conditional random fields. In the proceedings of Human Language Technology/North American chapter of the Association for Computational Linguistics annual meeting (HLT/NAACL), 2003.


Last updated: February 26, 2008