Conditional random fields | Training a CRF means |
Conditional random fields | The software we use as an implementation of conditional random fields is named CRF++ (Kudo, 2007). |
Conditional random fields | We adopt the default parameter settings of CRF++ , so no development set or tuning set is needed in our work. |
Experimental design | These are learning experiments so we also use tenfold cross validation in the same way as with CRF++ . |
Experimental results | Figure 1 shows how the error rate is affected by increasing the CRF probability threshold for each language. |
Experimental results | For the English language, the CRF using the Viterbi path has overall error rate of 0.84%, compared to 6.81% for the TEX algorithm using American English patterns, which is eight times worse. |
Experimental results | However, the serious error rate for the CRF is less good: 0.41% compared to 0.24%. |
Abstract | This paper presents a joint optimization method of a two-step conditional random field ( CRF ) model for machine transliteration and a fast decoding algorithm for the proposed method. |
Abstract | In the two-step CRF model, the first CRF segments an input word into chunks and the second one converts each chunk into one unit in the target language. |
Introduction | In the “NEWS 2009 Machine Transliteration Shared Task”, a new two-step CRF model for transliteration task has been proposed (Yang et al., 2009), in which the first step is to segment a word in the source language into character chunks and the second step is to perform a context-dependent mapping from each chunk into one written unit in the target language. |
Introduction | In this paper, we propose to jointly optimize a two-step CRF model. |
Introduction | The rest of this paper is organized as follows: Section 2 explains the two-step CRF method, followed by Section 3 which describes our joint optimization method and its fast decoding algorithm; Section 4 introduces a rapid implementation of a J SCM system in the weighted finite state transducer (WFST) framework; and the last section reports the experimental results and conclusions. |
Two-step CRF method | 2.1 CRF introduction |
Two-step CRF method | CT. CRF training is usually performed through the L-BFGS algorithm (Wal-lach, 2002) and decoding is performed by the Viterbi algorithm. |
Two-step CRF method | We formalize machine transliteration as a CRF tagging problem, as shown in Figure 2. |
Base Models | Semi-CRFs segment and label the text simultaneously, whereas a linear-chain CRF will only label each word, and segmentation is implied by the labels assigned to the words. |
Base Models | doing named entity recognition, a semi-CRF will have one node for each entity, unlike a regular CRF which will have one node for each word.2 See Figure 3ab for an example of a semi-CRF and a linear-chain CRF over the same sentence. |
Base Models | Note that the entity Hilary Clinton has one node in the semi-CRF representation, but two nodes in the linear-chain CRF . |
Experiments and Discussion | For each section of the data (ABC, MNB, NBC, PRI, VOA) we ran experiments training a linear-chain CRF on only the named entity information, a CRF-CFG parser on only the parse information, a joint parser and named entity recognizer, and our hierarchical model. |
Hierarchical Joint Learning | Figure 3: A linear-chain CRF (a) labels each word, whereas a semi-CRF (b) labels entire entities. |
Extraction with Lexicons | Domain-independence requires access to an extremely large number of lists, but our tight integration of lexicon acquisition and CRF learning requires that relevant lists be accessed instantaneously. |
Extraction with Lexicons | While training a CRF extractor for a given relation, LUCHS uses its corpus of lists to automatically generate a set of semantic lexicons — specific to that relation. |
Extraction with Lexicons | The semantic lexicons are added as features to the CRF learning algorithm. |
Introduction | These lexicons form Boolean features which, along with lexical and dependency parser-based features, are used to produce a CRF extractor for each relation — one which performs much better than lexicon-free extraction on sparse training data. |
Learning Extractors | We use a linear-chain conditional random field ( CRF ) — an undirected graphical model connecting a sequence of input and output random variables, cc 2 (x0, . |
Learning Extractors | The CRF models are represented with a log-linear distribution |
Conclusion | WOE can run in two modes: a CRF extractor (WOEPOS) trained with shallow features like POS tags; a pattern classfier (WOEparse) learned from dependency path patterns. |
Experiments | To compare with TextRunner, we tested four different ways to generate training examples from Wikipedia for learning a CRF extractor. |
Experiments | The CRF extractors are trained using the same learning algorithm and feature selection as TextRunner. |
Related Work | Wu and Weld proposed the KYLIN system (Wu and Weld, 2007; Wu et al., 2008) which has the same spirit of matching Wikipedia sentences with infoboxes to learn CRF extractors. |
Wikipedia-based Open IE | [ Primary Entity Matching ] - Sentence—Matching MatCher K Pattern Classifier over Parser Features ] Learner CRF Extractor over Shallow Features |
Wikipedia-based Open IE | In contrast, WOEPOS (like TextRunner) trains a conditional random field ( CRF ) to output certain text between noun phrases when the text denotes such a relation. |
Wikipedia-based Open IE | Since high speed can be crucial when processing Web-scale corpora, we additionally learn a CRF extractor WOEPOS based on shallow features like POS-tags. |
Introduction | To address the task of open-domain predicate identification, we construct a Conditional Random Field ( CRF ) (Lafferty et al., 2001) model with target labels of B-Pred, I-Pred, and O-Pred (for the beginning, interior, and outside of a predicate). |
Introduction | We use an open source CRF software package to implement our CRF models.1 We use words, POS tags, chunk labels, and the predicate label at the preceding and following nodes as features for our Baseline system. |
Introduction | We refer to the CRF model with these features as our Baseline SRL system; in what follows we extend the Baseline model with more sophisticated features. |
Background | To this end, several popular machine-leaming methods could be utilized, like Bayesian classifier (BC) (Kupiec et al., 1999), Gaussian mixture model (GMM) (Fattah and Ren, 2009) , hidden Markov model (HMM) (Conroy and O'leary, 2001), support vector machine (SVM) (Kolcz et al., 2001), maximum entropy (ME) (Ferrier, 2001), conditional random field ( CRF ) (Galley, 2006; Shen et al., 2007), to name a few. |
Background | Although such supervised summarizers are effective, most of them (except CRF ) usually implicitly assume that sentences are independent of each other (the so-called “bag-ofsentences” assumption) and classify each sentence individually without leveraging the relationship among the sentences (Shen et al., 2007). |
Experimental results and discussions 6.1 Baseline experiments | In the final set of experiments, we compare our proposed summarization methods with a few existing summarization methods that have been widely used in various summarization tasks, including LEAD, VSM, LexRank and CRF ; the corresponding results are shown in Table 5. |
Experimental results and discussions 6.1 Baseline experiments | To our surprise, CRF does not provide superior results as compared to the other summarization methods. |
Experimental results and discussions 6.1 Baseline experiments | One possible explanation is that the structural evidence of the spoken documents in the test set is not strong enough for CRF to show its advantage of modeling the local structural information among sentences. |
Clustering-based word representations | However, the CRF chunker in Huang and Yates (2009), which uses their HMM word clusters as extra features, achieves F1 lower than |
Clustering-based word representations | a baseline CRF chunker (Sha & Pereira, 2003). |
Supervised evaluation tasks | The linear CRF chunker of Sha and Pereira (2003) is a standard near-state-of—the-art baseline chunker. |
Supervised evaluation tasks | In fact, many off-the-shelf CRF implementations now replicate Sha and Pereira (2003), including their choice of feature set: |
Supervised evaluation tasks | 0 CRF++ by Taku Kudo (http://crfpp. |
Conditional Random Fields | Hence, any numerical optimization strategy may be used and practical solutions include limited memory BFGS (L-BFGS) (Liu and Nocedal, 1989), which is used in the popular CRF++ (Kudo, 2005) and CRFsuite (Okazaki, 2007) packages; conjugate gradient (Nocedal and Wright, 2006) and Stochastic Gradient Descent (SGD) (Bottou, 2004; Vishwanathan et al., 2006), used in CRFsgd (Bottou, 2007). |
Conditional Random Fields | CRF++ and CRFsgd perform the forward-backward recursions in the log-domain, which guarantees that numerical over/underflows are avoided no matter the length T (i) of the sequence. |
Conditional Random Fields | The CRF package developed in the course of this study implements many algorithmic optimizations and allows to design innovative training strategies, such as the one presented in section 5.4. |