Introduction | §2 introduces the maximum entropy (maxent) and conditional random field ( CRF ) learning techniques employed, along with specifications for the design and training of our hierarchical prior. |
Investigation | Specifically, we compared our approximate hierarchical prior model (HIER), implemented as a CRF, against three baselines: o GAUSS: CRF model tuned on a single domain’s data, using a standard N(0, 1) prior 0 CAT: CRF model tuned on a concatenation of multiple domains’ data, using a N(0, 1) prior 0 CHELBA: CRF model tuned on one domain’s data, using a prior trained on a different, related domain’s data (cf. |
Investigation | Line a shows the F1 performance of a CRF model tuned only on the target MUC6 domain (GAUSS) across a range of tuning data sizes. |
Investigation | Line I) shows the same experiment, but this time the CRF model has been tuned on a dataset comprised of a simple concatenation of the training MUC6 data from (a), along with a different training set from MUC7 (CAT). |
Models considered 2.1 Basic Conditional Random Fields | The parametric form of the CRF for a sentence of length n is given as follows: |
Models considered 2.1 Basic Conditional Random Fields | CRF learns a model consisting of a set of weights A = {A1...)\F} over the features so as to maximize the conditional likelihood of the training data, p(lémm|Xtmm), given the model p A. |
Models considered 2.1 Basic Conditional Random Fields | 2.2 CRF with Gaussian priors |
Models 2.1 Baseline Models | A conditional random field ( CRF ) (Lafferty et al., 2001) defines the conditional probability as a linear score for each candidate y and a global normalization term: |
Models 2.1 Baseline Models | However, the output 31* from the CRF decoder is still only a sequence of abstract suffix tags. |
Models 2.1 Baseline Models | The abstract suffix tags are extracted from the unsupervised morpheme learning process, and are carefully designed to enable CRF training and decoding. |
Abstract | We propose to combine a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields ( CRF ) model under a semi-supervised learning framework to tackle these challenges. |
Abstract | The KNN based classifier conducts pre-labeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. |
Introduction | Following the two-stage prediction aggregation methods (Krishnan and Manning, 2006), such pre-labeled results, together with other conventional features used by the state-of-the-art NER systems, are fed into a linear Conditional Random Fields ( CRF ) (Lafferty et al., 2001) model, which conducts fine-grained tweet level NER. |
Introduction | Furthermore, the KNN and CRF model are repeatedly retrained with an incrementally augmented training set, into which high confidently labeled tweets are added. |
Introduction | Indeed, it is the combination of KNN and CRF under a semi-supervised learning framework that differentiates ours from the existing. |
Related Work | (2010) use Amazons Mechanical Turk service 2 and CrowdFlower 3 to annotate named entities in tweets and train a CRF model to evaluate the effectiveness of human labeling. |
Related Work | To achieve this, a KNN classifier with a CRF model is combined to leverage cross tweets information, and the semi-supervised learning is adopted to leverage unlabeled tweets. |
Related Work | (2005) use CRF to train a sequential NE labeler, in which the BIO (meaning Beginning, the Inside and the Outside of |
Evaluation | +LowThresh +CRF +List *Our Work +Our Work+Con |
Evaluation | The CRF and hard-constrained consensus lines terminate because of low record yield. |
Evaluation Setup | +LowThresh +CRF +List -)'(-OurWork |
Evaluation Setup | The CRF lines terminate because of low record yield. |
Evaluation Setup | Our List Baseline labels messages by finding string overlaps against a list of musical artists and venues scraped from web data (the same lists used as features in our CRF component). |
Inference | Since a uniform initialization of all factors is a saddle-point of the objective, we opt to initialize the q(y) factors with the marginals obtained using just the CRF parameters, accomplished by running forwards-backwards on all messages using only the |
Inference | To do so, we run the CRF component of our model (ngEQ) over the corpus and extract, for each 6, all spans that have a token-level probability of being labeled 6 greater than A = 0.1. |
Introduction | We bias local decisions made by the CRF to be consistent with canonical record values, thereby facilitating consistency within an event cluster. |
Model | The sequence labeling factor is similar to a standard sequence CRF (Lafferty et al., 2001), Where the potential over a message label sequence decomposes |
Model | The weights of the CRF component of our model, QSEQ, are the only weights learned at training time, using a distant supervision process described in Section 6. |
Abstract | In order to automatically generate a novel and non-redundant community answer summary, we segment the complex original multi-sentence question into several sub questions and then propose a general Conditional Random Field ( CRF ) based answer summary method with group L1 regularization. |
Introduction | We tackle the answer summary task as a sequential labeling process under the general Conditional Random Fields ( CRF ) framework: every answer sentence in the question thread is labeled as a summary sentence or non-summary sentence, and we concatenate the sentences with summary label to form the final summarized answer. |
Introduction | First, we present a general CRF based framework |
Introduction | Second, we propose a group L1-regularization approach in the CRF model for automatic optimal feature learning to unleash the potential of the features and enhance the performance of answer summarization. |
The Summarization Framework | Then under CRF (Lafferty et al., 2001), the conditional probability of y given X obeys the following distribution: p(ylx) = 22mm 2 Mgl<v,y|.,x> |
The Summarization Framework | Therefore, to explore the optimal combination of these features, we propose a group L1 regularization term in the general CRF model (Section 3.3) for feature learning. |
The Summarization Framework | These sentence-level features can be easily utilized in the CRF framework. |
Abstract | This paper presents a joint optimization method of a two-step conditional random field ( CRF ) model for machine transliteration and a fast decoding algorithm for the proposed method. |
Abstract | In the two-step CRF model, the first CRF segments an input word into chunks and the second one converts each chunk into one unit in the target language. |
Introduction | In the “NEWS 2009 Machine Transliteration Shared Task”, a new two-step CRF model for transliteration task has been proposed (Yang et al., 2009), in which the first step is to segment a word in the source language into character chunks and the second step is to perform a context-dependent mapping from each chunk into one written unit in the target language. |
Introduction | In this paper, we propose to jointly optimize a two-step CRF model. |
Introduction | The rest of this paper is organized as follows: Section 2 explains the two-step CRF method, followed by Section 3 which describes our joint optimization method and its fast decoding algorithm; Section 4 introduces a rapid implementation of a J SCM system in the weighted finite state transducer (WFST) framework; and the last section reports the experimental results and conclusions. |
Two-step CRF method | 2.1 CRF introduction |
Two-step CRF method | CT. CRF training is usually performed through the L-BFGS algorithm (Wal-lach, 2002) and decoding is performed by the Viterbi algorithm. |
Two-step CRF method | We formalize machine transliteration as a CRF tagging problem, as shown in Figure 2. |
Abstract | We use linear-chain conditional random fields (CRF) for sentence type tagging, and a 2D CRF to label the dependency relation between sentences. |
Introduction | We use linear-chain conditional random fields ( CRF ) to take advantage of many long-distance and nonlocal features. |
Introduction | First each sentence is considered as a source, and we run a linear-chain CRF to label whether each of the other sentences is its target. |
Introduction | Because multiple runs of separate linear-chain CRFs ignore the dependency between source sentences, the second approach we propose is to use a 2D CRF that models all pair relationships jointly. |
Related Work | In (Ding et al., 2008), a two-pass approach was used to find relevant solutions for a given question, and a skip-chain CRF was adopted to model long range de- |
Thread Structure Tagging | Linear-chain CRF is a special case of the general CRFs. |
Thread Structure Tagging | In linear-chain CRF , cliques only involve two adjacent variables in the sequence. |
Thread Structure Tagging | Figure 3 shows the graphical structure of a linear-chain CRF . |
Conditional random fields | Training a CRF means |
Conditional random fields | The software we use as an implementation of conditional random fields is named CRF++ (Kudo, 2007). |
Conditional random fields | We adopt the default parameter settings of CRF++ , so no development set or tuning set is needed in our work. |
Experimental design | These are learning experiments so we also use tenfold cross validation in the same way as with CRF++ . |
Experimental results | Figure 1 shows how the error rate is affected by increasing the CRF probability threshold for each language. |
Experimental results | For the English language, the CRF using the Viterbi path has overall error rate of 0.84%, compared to 6.81% for the TEX algorithm using American English patterns, which is eight times worse. |
Experimental results | However, the serious error rate for the CRF is less good: 0.41% compared to 0.24%. |
Introduction | It initially starts with training supervised Conditional Random Fields ( CRF ) (Lafferty et al., 2001) on the source training data which has been semantically tagged. |
Introduction | Our SSL uses MTR to smooth the semantic tag posteriors on the unlabeled target data (decoded using the CRF model) and later obtains the best tag sequences. |
Introduction | While retrospective learning itera-tively trains CRF models with the automatically annotated target data (explained above), it keeps track of the errors of the previous iterations so as to carry the properties of both the source and target domains. |
Markov Topic Regression - MTR | We use the word-tag posterior probabilities obtained from a CRF sequence model trained on labeled utterances as features. |
Related Work and Motivation | We present a retrospective SSL for CRF , in that, the iterative learner keeps track of the errors of the previous iterations so as to carry the properties of both the source and target domains. |
Semi-Supervised Semantic Labeling | 4.1 Semi Supervised Learning (SSL) with CRF |
Semi-Supervised Semantic Labeling | They decode unlabeled queries from target domain (t) using a CRF model trained on the POS-labeled newswire data (source domain (0)). |
Semi-Supervised Semantic Labeling | Since CRF tagger only uses local features of the input to score tag pairs, they try to capture all the context with the graph with additional context features on types. |
Cohesion across Utterances | This paper focuses on surface realisation from these trees using a CRF as shown in the surface realisation module. |
Cohesion across Utterances | As shown in the architecture diagram in Figure 1, a CRF surface realiser takes a semantic tree as input. |
Cohesion across Utterances | Figure 2: (a) Graphical representation of a linear-chain Conditional Random Field (CRF), where empty nodes correspond to the labelled sequence, shaded nodes to linguistic observations, and dark squares to feature functions between states and observations; (b) Example semantic trees that are updated at each time step in order to provide linguistic features to the CRF (only one possible surface realisation is shown and parse categories are omitted for brevity); (c) Finite state machine of phrases (labels) for this example. |
Introduction | Our main hypothesis is that the use of global context in a CRF with semantic trees can lead to surface realisations that are better phrased, more natural and less repetitive than taking only local features into account. |
Bilingual NER by Agreement | The English-side CRF model assigns the following probability for a tag sequence ye: |
Bilingual NER by Agreement | where V6 is the set of vertices in the CRF and De is the set of edges. |
Bilingual NER by Agreement | 2mm) and w(vi,vj) are the node and edge clique potentials, and Ze(e) is the partition function for input sequence e under the English CRF model. |
Experimental Setup | We train the two CRF models on all portions of the OntoNotes corpus that are annotated with named entity tags, except the parallel-aligned portion which we reserve for development and test purposes. |
Experimental Setup | 1The exact feature set and the CRF implementation |
Experiment | To benchmark the improvement that the factorial CRF model has by doing the two tasks jointly, we compare with a LCRF solution that chains these two tasks together. |
Experiment | We note that the CRF based models achieve much higher precision score than baseline systems, which means that the CRF based models can make accurate predictions without enlarging the scope of prospective informal words. |
Experiment | Compared with the CRF based models, the SVM and DT both over-predict informal words, incurring a larger precision penalty. |
Introduction | Our techniques significantly outperform both research and commercial state-of-the-art for these problems, including two-step linear CRF baselines which perform the two tasks sequentially. |
Methodology | 2.2.1 Linear-Chain CRF |
Methodology | A linear-chain CRF (LCRF; Figure 2a) predicts the output label based on feature functions provided by the scientist on the input. |
Methodology | 2.2.2 Factorial CRF |
Related Work | This is a weakness as their linear CRF model requires retraining. |
Experiments | Each extracts opinion entities first using the same CRF employed in our approach, and then predicts opinion relations on the opinion entity candidates obtained from the CRF prediction. |
Experiments | We report results using opinion entity candidates from the best CRF output and from the merged lO-best CRF output.10 The motivation of merging the lO-best output is to increase recall for the pipeline methods. |
Model | 1We randomly split the training data into 10 parts and obtained the 50-best CRF predictions on each part for the generation of candidates. |
Model | We also experimented with candidates generated from more CRF predictions, but did not find any performance improvement for the task. |
Results | We compare our approach with the pipeline baselines and CRF (the first step of the pipeline). |
Results | by adding the relation extraction step, the pipeline baselines are able to improve precision over the CRF but fail at recall. |
Results | CRF+Syn and CRF+Adj provide the same performance as CRF , since the relation extraction step only affects the results of opinion arguments. |
Background | For this underlying model, we employ a chain-structured conditional random field ( CRF ), since CRFs have been shown to perform better than other simple unconstrained models like hidden markov models for citation extraction (Peng and McCallum, 2004). |
Background | The MAP inference task in a CRF be can expressed as an optimization problem with a lin- |
Background | Since the log probability of some 3/ in the CRF is proportional to sum of the scores of all the factors, we can concatenate the indicator variables as a vector y and the scores as a vector 21) and write the MAP problem as |
Citation Extraction Data | Here, 3/], represents an output tag of the CRF , SO if = = 1, then we have that 3/], was given a label with index i. |
Citation Extraction Data | We constrain the output labeling of the chain-structured CRF to be a valid BIO encoding. |
Citation Extraction Data | Rather than enforcing these constraints using dual decomposition, they can be enforced directly when performing MAP inference in the CRF by modifying the dynamic program of the Viterbi algorithm to only allow valid pairs of adj a-cent labels. |
Soft Constraints in Dual Decomposition | Since we truncate penalties at 0, this suggests that we will learn a penalty of 0 for constraints in three categories: constraints that do not hold in the ground truth, constraints that hold in the ground truth but are satisfied in practice by performing inference in the base CRF model, and constraints that are satisfied in practice as a side-effect of imposing nonzero penalties on some other constraints . |
Experiments | We investigate the use of smoothing in two test systems, conditional random field ( CRF ) models for POS tagging and chunking. |
Experiments | Finally, we train the CRF model on the annotated training set and apply it to the test set. |
Experiments | We use an open source CRF software package designed by Sunita Sajarwal and William W. Cohen to implement our CRF models.1 We use a set of boolean features listed in Table 1. |
Experimental Comparison with Unsupervised Learning | We compare GE and supervised training of an edge-factored CRF with unsupervised learning of a DMV model (Klein and Manning, 2004) using EM and contrastive estimation (CE) (Smith and Eisner, 2005). |
Experimental Comparison with Unsupervised Learning | We note that there are considerable differences between the DMV and CRF models. |
Experimental Comparison with Unsupervised Learning | The DMV model is more expressive than the CRF because it can model the arity of a head as well as sibling relationships. |
Generalized Expectation Criteria | In the following sections we apply GE to non-projective CRF dependency parsing. |
Generalized Expectation Criteria | We first consider an arbitrarily structured conditional random field (Lafferty et al., 2001) p)‘ (y We describe the CRF for non-projective dependency parsing in Section 3.2. |
Generalized Expectation Criteria | We now define a CRF p)‘ (y|x) for unlabeled, non-projective5 dependency parsing. |
Introduction | With GE we may add a term to the objective function that encourages a feature-rich CRF to match this expectation on unlabeled data, and in the process learn about related features. |
Introduction | In this paper we use a non-projective dependency tree CRF (Smith and Smith, 2007). |
Abstract | To enhance the accuracy of the pipeline, we add additional constraints in the Viterbi decoding of the first CRF . |
Bottom-up tree-building | Secondly, as a joint model, it is mandatory to use a dynamic CRF , for which exact inference is usually intractable or slow. |
Bottom-up tree-building | Figure 4a shows our intra-sentential structure model in the form of a linear-chain CRF . |
Bottom-up tree-building | Thus, different CRF chains have to be formed for different pairs of constituents. |
Features | In our local models, to encode two adjacent units, U j and U j+1, within a CRF chain, we use the following 10 sets of features, some of which are modified from J oty et al.’s model. |
Introduction | Specifically, in the Viterbi decoding of the first CRF , we include additional constraints elicited from common sense, to make more effective local decisions. |
Related work | (2013) approach the problem of text-level discourse parsing using a model trained by Conditional Random Fields ( CRF ). |
Abstract | The context-aware constraints provide additional power to the CRF model and can guide semi-supervised learning when labeled data is limited. |
Approach | The CRF model the following conditional probabilities: |
Approach | The objective function for a standard CRF is to maximize the log-likelihood over a collection of labeled doc- |
Approach | Most of our constraints can be factorized in the same way as factorizing the model features in the first-order CRF model, and we can compute the expectations under q very efficiently using the forward-backward algorithm. |
Experiments | We trained our model using a CRF incorporated with the proposed posterior constraints. |
Experiments | For the CRF features, we include the tokens, the part-of-speech tags, the prior polarities of lexical patterns indicated by the opinion lexicon and the negator lexicon, the number of positive and negative tokens and the output of the voteflip algorithm (Choi and Cardie, 2009). |
Experiments | We set the CRF regularization parameter a = l and set the posterior regularization parameter 6 and y (a tradeoff parameter we introduce to balance the supervised objective and the posterior regularizer in 2) by using grid search 8. |
Introduction | Specifically, we use the Conditional Random Field (CRF) model as the learner for sentence-level sentiment classification, and incorporate rich discourse and lexical knowledge as soft constraints into the learning of CRF parameters via Posterior Regularization (PR) (Ganchev et al., 2010). |
Introduction | Unlike most previous work, we explore a rich set of structural constraints that cannot be naturally encoded in the feature-label form, and show that such constraints can improve the performance of the CRF model. |
Context and Answer Detection | Finally, we will briefly introduce CRF models and the features that we used for CRF model. |
Context and Answer Detection | A CRF is an undirected graphical model G of the conditional distribution P(Y|X). |
Context and Answer Detection | Linear CRF model has been successfully applied in NLP and text mining tasks (McCallum and Li, 2003; Sha and Pereira, 2003). |
Introduction | To capture the dependency between contexts and answers, we introduce Skip-chain CRF model for answer detection. |
Introduction | Experimental results show that 1) Linear CRFs outperform SVM and decision tree in both context and answer detection; 2) Skip-chain CRFs outperform Linear CRFs for answer finding, which demonstrates that context improves answer finding; 3) 2D CRF model improves the performance of Linear CRFs and the combination of 2D CRFs and Skip-chain CRFs achieves better performance for context detection. |
Base Models | Semi-CRFs segment and label the text simultaneously, whereas a linear-chain CRF will only label each word, and segmentation is implied by the labels assigned to the words. |
Base Models | doing named entity recognition, a semi-CRF will have one node for each entity, unlike a regular CRF which will have one node for each word.2 See Figure 3ab for an example of a semi-CRF and a linear-chain CRF over the same sentence. |
Base Models | Note that the entity Hilary Clinton has one node in the semi-CRF representation, but two nodes in the linear-chain CRF . |
Experiments and Discussion | For each section of the data (ABC, MNB, NBC, PRI, VOA) we ran experiments training a linear-chain CRF on only the named entity information, a CRF-CFG parser on only the parse information, a joint parser and named entity recognizer, and our hierarchical model. |
Hierarchical Joint Learning | Figure 3: A linear-chain CRF (a) labels each word, whereas a semi-CRF (b) labels entire entities. |
Extraction with Lexicons | Domain-independence requires access to an extremely large number of lists, but our tight integration of lexicon acquisition and CRF learning requires that relevant lists be accessed instantaneously. |
Extraction with Lexicons | While training a CRF extractor for a given relation, LUCHS uses its corpus of lists to automatically generate a set of semantic lexicons — specific to that relation. |
Extraction with Lexicons | The semantic lexicons are added as features to the CRF learning algorithm. |
Introduction | These lexicons form Boolean features which, along with lexical and dependency parser-based features, are used to produce a CRF extractor for each relation — one which performs much better than lexicon-free extraction on sparse training data. |
Learning Extractors | We use a linear-chain conditional random field ( CRF ) — an undirected graphical model connecting a sequence of input and output random variables, cc 2 (x0, . |
Learning Extractors | The CRF models are represented with a log-linear distribution |
Comparable Question Mining | Therefore, we decided to employ a Conditional Random Fields ( CRF ) tag- |
Comparable Question Mining | ger (Lafferty et al., 2001) to the task, since CRF was shown to be state-of-the-art for sequential relation extraction (Mooney and Bunescu, 2005; Culotta et al., 2006; J indal and Liu, 2006). |
Comparable Question Mining | This transformation helps us to design a simpler CRF than that of (Jindal and Liu, 2006), since our CRF utilizes the known positions of the target entities in the text. |
Related Work | Our extraction of comparable relations falls within the field of Relation Extraction, in which CRF is a state-of-the-art method (Mooney and Bunescu, 2005; Culotta et al., 2006). |
A Class-based Model of Agreement | We treat segmentation as a character-level sequence modeling problem and train a linear-chain conditional random field ( CRF ) model (Lafferty et al., 2001). |
A Class-based Model of Agreement | Class-based Agreement Model t E T Set of morpho-syntactic classes 3 E S Set of all word segments 6569 Learned weights for the CRF-based segmenter 0mg Learned weights for the CRF-based tagger gbo, gbt CRF potential functions (emission and transition) |
A Class-based Model of Agreement | For this task we also train a standard CRF model on full sentences with gold classes and segmentation. |
Conclusion and Outlook | The model can be implemented with a standard CRF package, trained on existing treebanks for many languages, and integrated easily with many MT feature APIs. |
Hybrid Relation Extraction | Due to the sequential nature of our RE task, H-CRF employs a CRF as the meta-leamer, as opposed to a decision tree or regression-based classifier. |
Hybrid Relation Extraction | To obtain the probability at each position of a linear-chain CRF , the constrained forward-backward technique described in (Culotta and McCallum, 2004) is used. |
Related Work | (2006) used a CRF for RE, yet their task differs greatly from open extraction. |
Relation Extraction | Figure 1: Relation Extraction as Sequence Labeling: A CRF is used to identify the relationship, born in, between Kafka and Prague |
Relation Extraction | The resulting set of labeled examples are described using features that can be extracted without syntactic or semantic analysis and used to train a CRF , a sequence model that learns to identify spans of tokens believed to indicate explicit mentions of relationships between entities. |
Relation Extraction | The entity pair serves to anchor each end of a linear-chain CRF , and both entities in the pair are assigned a fixed label of ENT. |
Conclusion | WOE can run in two modes: a CRF extractor (WOEPOS) trained with shallow features like POS tags; a pattern classfier (WOEparse) learned from dependency path patterns. |
Experiments | To compare with TextRunner, we tested four different ways to generate training examples from Wikipedia for learning a CRF extractor. |
Experiments | The CRF extractors are trained using the same learning algorithm and feature selection as TextRunner. |
Related Work | Wu and Weld proposed the KYLIN system (Wu and Weld, 2007; Wu et al., 2008) which has the same spirit of matching Wikipedia sentences with infoboxes to learn CRF extractors. |
Wikipedia-based Open IE | [ Primary Entity Matching ] - Sentence—Matching MatCher K Pattern Classifier over Parser Features ] Learner CRF Extractor over Shallow Features |
Wikipedia-based Open IE | In contrast, WOEPOS (like TextRunner) trains a conditional random field ( CRF ) to output certain text between noun phrases when the text denotes such a relation. |
Wikipedia-based Open IE | Since high speed can be crucial when processing Web-scale corpora, we additionally learn a CRF extractor WOEPOS based on shallow features like POS-tags. |
Previous work | For case prediction, they trained a CRF with access to lemmas and POS-tags within a given window. |
Previous work | While the CRF used for case prediction in Fraser et al. |
Translation pipeline | Morphological features are predicted on four separate CRF models, one for each feature. |
Translation pipeline | Instead, we limit the power of the CRF model through experimenting with the removal of features, until we had a system that was robust to this problem. |
Using subcategorization information | subject, object or modifier) are passed on to the system at the level of nouns and integrated into the CRF through the derived probabilities. |
Using subcategorization information | The classification task of the CRF consists in predicting a sequence of labels: case values for NPsflDPs or no value otherwise, cf. |
Using subcategorization information | In addition to the probability/frequency of the respective functions, we also provide the CRF with bigrams containing the two parts of the tuple, |
CRF and features | The choice of CRF for sequence labelling was mainly influenced by its successful application to chunking of Polish (Radziszewski and Pawlaczek, 2012). |
Evaluation | The performed evaluation assumed training of the CRF on the whole development set annotated with the induced transformations and then applying the trained model to tag the evaluation part with transformations. |
Evaluation | We decided to implement both baseline algorithms using the same CRF model but trained on fabricated data. |
Evaluation | Also, it turns out that the variation of the matching procedure using the ‘lem’ transformation (row labelled CRF lem) performs slightly worse than the procedure without this transformation (row CRF nolem). |
Introduction | In this paper we present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem and tackled using a method devised for such problems, namely Conditional Random Fields ( CRF ). |
Phrase lemmatisation as a tagging problem | Our idea is simple: by expressing phrase lemmatisation in terms of word-level transformations we can reduce the task to tagging problem and apply well known Machine Learning techniques that have been devised for solving such problems (e. g. CRF ). |
Abbreviator with Nonlocal Information | The DPLVM is a natural extension of the CRF model (see Figure 2), which is a special case of the DPLVM, with only one latent variable assigned for each label. |
Introduction | Figure 2: CRF vs. DPLVM. |
Recognition as a Generation Task | As can be seen in Table 6, using the latent variables significantly improved the performance (see DPLVM vs. CRF), and using the GI encoding improved the performance of both the DPLVM and the CRF . |
Recognition as a Generation Task | Table 7 shows that the back-off method further improved the performance of both the DPLVM and the CRF model. |
Results and Discussion | special case of the DPLVM is exactly the CRF (see Section 2.1), this case is hereinafter denoted as the CRF . |
Results and Discussion | The results revealed that the latent variable model significantly improved the performance over the CRF model. |
Results and Discussion | All of its top-l, top-2, and top-3 accuracies were consistently better than those of the CRF model. |
Introduction | To address the task of open-domain predicate identification, we construct a Conditional Random Field ( CRF ) (Lafferty et al., 2001) model with target labels of B-Pred, I-Pred, and O-Pred (for the beginning, interior, and outside of a predicate). |
Introduction | We use an open source CRF software package to implement our CRF models.1 We use words, POS tags, chunk labels, and the predicate label at the preceding and following nodes as features for our Baseline system. |
Introduction | We refer to the CRF model with these features as our Baseline SRL system; in what follows we extend the Baseline model with more sophisticated features. |
Proposed Method | We use a sequence discrimination technique based on CRF (Lafferty et al., 2001) to identify the label sequence that corresponds to the NP. |
Proposed Method | There are two differences between our task and the CRF task. |
Proposed Method | One difference is that CRF discriminates label sequences that consist of labels from all of the label candidates, whereas we constrain the label sequences to sequences where the label at the CP is C, the label at an NPC is N, and the labels between the CP and the NPC are I. |
Named Entity Recognition | Conditional Random Fields ( CRF ) (Lafferty et. |
Named Entity Recognition | We employed a linear chain CRF with L2 regularization as the baseline algorithm to which we added phrase cluster features. |
Named Entity Recognition | The features in our baseline CRF classifier are a subset of the conventional features. |
Algorithm | (2) under the log-loss results in a probabilistic model commonly known as a conditional random field ( CRF ) (Lafferty et al., 2001). |
Discussion | Large-margin learning, using the Passive-Aggressive and Pegasos algorithms, has benefits over CRF learning for our task: It produces sparser models, is faster, and produces better lexical access results. |
Experiments | 4We use the term “CRF” since the learning algorithm corresponds to CRF learning, although the task is multiclass classification rather than a sequence or structure prediction task. |
Experiments | CRF learning with the same features performs about 6% worse than the corresponding PA and Pegasos models. |
Experiments | The single-threaded running time for PNDP+ and Pegasos/DP+ is about 40 minutes per epoch, measured on a dual-core AMD 2.4GHz CPU with 8GB of memory; for CRF, it takes about 100 minutes for each epoch, which is almost entirely because the weight vector 0 is less sparse with CRF learning. |
Approaches | We define a conditional random field ( CRF ) (Lafferty et al., 2001) for this task. |
Approaches | This model extends the CRF model in Section 3.1 to include the projective syntactic dependency parse for a sentence. |
Approaches | train our CRF models by maximizing conditional log-likelihood using stochastic gradient descent with an adaptive learning rate (AdaGrad) (Duchi et al., 2011) over mini-batches. |
Experiments | Similarly, improving the non-convex optimization of our latent-variable CRF (Marginalized) may offer further gains. |
Introduction | 0 Simpler joint CRF for syntactic and semantic dependency parsing than previously reported. |
Introduction | The joint models use a non-loopy conditional random field ( CRF ) with a global factor constraining latent syntactic edge variables to form a tree. |
Background | To this end, several popular machine-leaming methods could be utilized, like Bayesian classifier (BC) (Kupiec et al., 1999), Gaussian mixture model (GMM) (Fattah and Ren, 2009) , hidden Markov model (HMM) (Conroy and O'leary, 2001), support vector machine (SVM) (Kolcz et al., 2001), maximum entropy (ME) (Ferrier, 2001), conditional random field ( CRF ) (Galley, 2006; Shen et al., 2007), to name a few. |
Background | Although such supervised summarizers are effective, most of them (except CRF ) usually implicitly assume that sentences are independent of each other (the so-called “bag-ofsentences” assumption) and classify each sentence individually without leveraging the relationship among the sentences (Shen et al., 2007). |
Experimental results and discussions 6.1 Baseline experiments | In the final set of experiments, we compare our proposed summarization methods with a few existing summarization methods that have been widely used in various summarization tasks, including LEAD, VSM, LexRank and CRF ; the corresponding results are shown in Table 5. |
Experimental results and discussions 6.1 Baseline experiments | To our surprise, CRF does not provide superior results as compared to the other summarization methods. |
Experimental results and discussions 6.1 Baseline experiments | One possible explanation is that the structural evidence of the spoken documents in the test set is not strong enough for CRF to show its advantage of modeling the local structural information among sentences. |
Evaluation | As seen in this plot, when there is a small amount of training data, the parser performs better than the CRF module and parser+SVM module performs better than the other two. |
Evaluation | With a large amount of training data, the CRF and parser almost have the same performance. |
Evaluation | 4 The CRF module also uses the lexical resources and regular expressions. |
Introduction | MEMM and CRF are discriminative models; hence they are highly dependent on the training data. |
Summary | Test N0 = 3000 P R F Q CRF 0.815 0.812 0.813 0.509 Parser 0.808 0.814 0.811 0.494 |
Baseline Arabic NER System | For the baseline system, we used the CRF++1 implementation of CRF sequence labeling with default parameters. |
Cross-lingual Features | translation was capitalized or not respectively; and the weights were binned because CRF++ only takes nominal features. |
Introduction | Conditional Random Fields ( CRF )) can often identify such indicative words. |
Related Work | Benajiba and Rosso (2008) used CRF sequence labeling and incorporated many language specific features, namely POS tagging, base-phrase chunking, Arabic tokenization, and adjectives indicating nationality. |
Related Work | (2008), they examined the same feature set on the Automatic Content Extraction (ACE) datasets using CRF |
Related Work | The use of CRF sequence labeling for NER has shown success (McCallum and Li, 2003; Nadeau and Sekine, 2009; Benajiba and Rosso, 2008). |
System Architecture | Based on our CRF word segmentation system, we can compute a probability for each segment. |
System Architecture | Note that, although our model is a Markov CRF model, we can still use word features to learn word information in the training data. |
System Architecture | For traditional implementation of CRF systems (e.g., the HCRF package), usually the edges features contain only the information of yi_1 and y, and without the information of |
Clustering-based word representations | However, the CRF chunker in Huang and Yates (2009), which uses their HMM word clusters as extra features, achieves F1 lower than |
Clustering-based word representations | a baseline CRF chunker (Sha & Pereira, 2003). |
Supervised evaluation tasks | The linear CRF chunker of Sha and Pereira (2003) is a standard near-state-of—the-art baseline chunker. |
Supervised evaluation tasks | In fact, many off-the-shelf CRF implementations now replicate Sha and Pereira (2003), including their choice of feature set: |
Supervised evaluation tasks | 0 CRF++ by Taku Kudo (http://crfpp. |
Experiment | P R F OOV CRF 87.8 85.7 86.7 57.1 NN 92.4 92.2 92.3 60.0 NN+Tag Embed 93.0 92.7 92.9 61.0 MMTNN 93.7 93.4 93.5 64.2 |
Experiment | We also compare our model with the CRF model (Lafferty et al., 2001), which is a widely used log-linear model for Chinese word segmentation. |
Experiment | The input feature to the CRF model is simply the context characters (unigram feature) without any additional feature engineering. |
Domain Adaptation | Next, we train a CRF model using all features (i.e. |
Domain Adaptation | Finally, the trained CRF model is applied to a target domain test sentence. |
Experiments and Results | The L-BFGS (Liu and Nocedal, 1989) method is used to train the CRF and logistic regression models. |
Experiments and Results | Specifically, in POS tagging, a CRF trained on source domain labeled sentences is applied to target domain test sentences, whereas in sentiment classification, a logistic regression classifier trained using source domain labeled reviews is applied to the target domain test reviews. |
Related Work | Huang and Yates (2009) train a Conditional Random Field ( CRF ) tagger with features retrieved from a smoothing model trained using both source and target domain unlabeled data. |
Conclusion | In this paper, we have presented a novel discourse parser that applies an optimal parsing algorithm to probabilities inferred from two CRF models: one for intra-sentential parsing and the other for multi-sentential parsing. |
Introduction | The CRF models effectively represent the structure and the label of a DT constituent jointly, and whenever possible, capture the sequential dependencies between the constituents. |
Introduction | To cope with this limitation, our CRF models support a probabilistic bottom-up parsing |
Parsing Models and Parsing Algorithm | Figure 62 A CRF as a multi—sentential parsing model. |
Parsing Models and Parsing Algorithm | It becomes a CRF if we directly model the hidden (output) variables by conditioning its clique potential (or factor) gb on the observed (input) variables: |
Experiments | 6This is because the weights of unigram to trigram features in a loglinear CRF model is a balanced consequence for maximization. |
Introduction | The QA system employs a linear chain Conditional Random Field ( CRF ) (Lafferty et al., 2001) and tags each token as either an answer (ANS) or not (0). |
Introduction | With weights optimized by CRF training (Table 1), we can learn how answer features are correlated with question features. |
Introduction | These features, whose weights are optimized by the CRF training, directly reflect what the most important answer types associated with each question type are. |
Evaluation | We first tested a standalone MWE recognizer based on CRF . |
Evaluation | The CRF recognizer relies on the software Wapiti6 (Lavergne et al., 2010) to train and apply the model, and on the software Unitex (Paumier, 2011) to apply lexical resources. |
Evaluation | Table 3: MWE identification with CRF : base are the features corresponding to token properties and word n-grams. |
MWE-dedicated Features | In order to deal with unknown words and special tokens, we incorporate standard tagging features in the CRF : lowercase forms of the words, word prefixes of length l to 4, word suffice of length l to 4, whether the word is capitalized, whether the token has a digit, whether it is an hyphen. |
Two strategies, two discriminative models | For such a task, we used Linear chain Conditional Ramdom Fields ( CRF ) that are discriminative prob- |
Log-Linear Models | If the structure is a sequence, the model is called a linear-chain CRF model, and the marginal probabilities of the features and the partition function can be efficiently computed by using the forward-backward algorithm. |
Log-Linear Models | If the structure is a tree, the model is called a tree CRF model, and the marginal probabilities can be computed by using the inside-outside algorithm. |
Log-Linear Models | We evaluate the effectiveness our training algorithm using linear-chain CRF models and three NLP tasks: text chunking, named entity recognition, and POS tagging. |
Experiments | The CRF model training in line (6) of the algorithm is implemented using CRF++ toolkit3. |
Joint Query Annotation | Accordingly, we can directly use a superv1sed sequential probabilistic model such as CRF (Lafferty |
Joint Query Annotation | In this CRF |
Joint Query Annotation | It then produces a set of independent annotation estimates, which are jointly used, together with the ground truth annotations, to learn a CRF model for each annotation type. |
Results | Comparison against Supervised CRF Our final set of experiments compares a semi-supervised version of our model against a conditional random field ( CRF ) model. |
Results | The CRF model was trained using the same features as our model’s argument features. |
Results | At the sentence level, our model compares very favorably to the supervised CRF . |
UK and XP stand for unknown and X phrase, respectively. | 6“CRFTagger: CRF English POS Tagger,” Xuan-Hieu Phan, http: //crftagger . |
UK and XP stand for unknown and X phrase, respectively. | Method Native Corpus Learner Corpus CRF 0.970 0.932 HMM 0.887 0.926 |
UK and XP stand for unknown and X phrase, respectively. | HMM CRF POS Freq. |
Introduction | Formally, our model is a CRF where the features factor over anchored rules of a small backbone grammar, as shown in Figure 1. |
Parsing Model | All of these past CRF parsers do also exploit span features, as did the structured margin parser of Taskar et al. |
Surface Feature Framework | Recall that our CRF factors over anchored rules 7“, where each 7“ has identity rule(7“) and anchoring span(r). |
Surface Feature Framework | As far as we can tell, all past CRF parsers have used “positive” features only. |
Background: Pairwise Coreference | For higher accuracy, a graphical model such as a conditional random field ( CRF ) is constructed from the compatibility functions to jointly reason about the pairwise decisions (McCallum and Wellner, 2004). |
Background: Pairwise Coreference | We now describe the pairwise CRF for coreference as a factor graph. |
Background: Pairwise Coreference | Given the pairwise CRF , the problem of coreference is then solved by searching for the setting of the coreference decision variables that has the highest probability according to Equation 1 subject to the |
Conclusion | Indeed, inference in the hierarchy is orders of magnitude faster than a pairwise CRF , allowing us to infer accurate coreference on |
Introduction | (2011) develop a system that exploits a CRF model to segment named |
Related Work | A linear CRF model |
Related Work | (2010) use Amazons Mechanical Turk service 3 and CrowdFlower 4 to annotate named entities in tweets and train a CRF model to evaluate the effectiveness of human labeling. |
Related Work | (2011) rebuild the NLP pipeline for tweets beginning with POS tagging, through chunking, to NER, which first exploits a CRF model to segment named entities and then uses a distantly supervised approach based on LabeledLDA to classify named entities. |
Introduction | Our work is similar to Jakob and Gurevych (2010) which proposed a Conditional Random Field ( CRF ) for cross-domain topic word extraction. |
Introduction | We denote iSVM and iCRF the in-domain SVM and CRF classifiers in experiments, and compare our proposed methods, |
Introduction | Cross-Domain CRF (Cross-CRF) we implement a cross-domain CRF algorithm proposed by (Jakob and Gurevych, 2010). |
Experimental Setup | The dataset from Clarke and Lapata (2008) is used to train the CRF and MaxEnt classifiers (Section 4). |
Sentence Compression | The CRF model is built using the features shown in Table 3. |
Sentence Compression | During inference, we find the maximally likely sequence Y according to a CRF with parameter 6 (Y = arg maXy/ P(Y’|X;6)), while simultaneously enforcing the rules of Table 2 to reduce the hypothesis space and encourage grammatical compression. |
Distant Supervision | Following this assumption, we train MaXEnt and Conditional Random Field ( CRF , (McCallum, 2002)) classifiers on the [6% of documents that have the lowest maximum flow values f, where k is a parameter which we optimize using the run count method introduced in Section 4. |
Distant Supervision | Conditions 4-8 train supervised classifiers based on the labels from DSlabels+MinCut: (4) MaXEnt with named entities (NE); (5) MaXEnt with NE and semantic (SEM) features; (6) CRF with NE; (7) MaXEnt with NE and sequential (SQ) features; (8) MaXEnt with NE, SQ, and SEM. |
Distant Supervision | quence model, we train a CRF (line 6); however, the improvement over MaxEnt (line 4) is not significant. |
Conditional Random Fields | Hence, any numerical optimization strategy may be used and practical solutions include limited memory BFGS (L-BFGS) (Liu and Nocedal, 1989), which is used in the popular CRF++ (Kudo, 2005) and CRFsuite (Okazaki, 2007) packages; conjugate gradient (Nocedal and Wright, 2006) and Stochastic Gradient Descent (SGD) (Bottou, 2004; Vishwanathan et al., 2006), used in CRFsgd (Bottou, 2007). |
Conditional Random Fields | CRF++ and CRFsgd perform the forward-backward recursions in the log-domain, which guarantees that numerical over/underflows are avoided no matter the length T (i) of the sequence. |
Conditional Random Fields | The CRF package developed in the course of this study implements many algorithmic optimizations and allows to design innovative training strategies, such as the one presented in section 5.4. |
Experiments | We generated the node and the edge features of a CRF model as described in Table 3 using these atomic features. |
Experiments | To train CRF models, we used Taku Kudo’s CRF++ (ver. |
Using Gazetteers as Features of NER | These annotated IOB tags can be used in the same way as other features in a CRF tagger. |
Existing algorithms 3.1 Yarowsky | Note that this is a linear model similar to a conditional random field ( CRF ) (Lafferty et al., 2001) for unstructured multiclass problems. |
Existing algorithms 3.1 Yarowsky | It uses a CRF (Lafferty et al., 2001) as the underlying supervised learner. |
Existing algorithms 3.1 Yarowsky | It differs significantly from Yarowsky in two other ways: First, instead of only training a CRF it also uses a step of graph propagation between distributions over the n-grams in the data. |
Arabic Word Segmentation Model | A CRF model (Lafferty et a1., 2001) defines a distri-butionp(Y|X; 6), where X = {$1, . |
Arabic Word Segmentation Model | The model of Green and DeNero is a third-order (i.e., 4—gram) Markov CRF , employing the following indicator features: |
Introduction | The model is an extension of the character-level conditional random field ( CRF ) model of Green and DeNero (2012). |
Introduction | For example, in (Lafferty et al., 2001), when switching from a generatively trained hidden Markov model (HMM) to a discriminatively trained, linear chain, conditional random field ( CRF ) for part-of-speech tagging, their error drops from 5.7% to 5.6%. |
Introduction | When they add in only a small set of orthographic features, their CRF error rate drops considerably more to 4.3%, and their out-of-vocabulary error rate drops by more than half. |
The Model | We then define a conditional probability distribution over entire trees, using the standard CRF distribution, shown in (1). |