Abstract | We propose dealing with the induced word boundaries as soft constraints to bias the continuous learning of a supervised CRFs model, trained by the treebank data (labeled), on the bilingual data (unlabeled). |
Introduction | Crucially, the GP expression with the bilingual knowledge is then used as side information to regularize a CRFs (conditional random fields) model’s learning over treebank and bitext data, based on the posterior regularization (PR) framework (Ganchev et al., 2010). |
Introduction | This constrained learning amounts to a jointly coupling of GP and CRFs , i.e., integrating GP into the estimation of a parametric structural model. |
Methodology | language “Di and “Di: Ensure: 6: the CRFs model parameters : DCHf <— char_align_bitext (133,135) 7“ <— learn_word_bound (“Ba—’10) : Q <— encode_graph_constraint ( $133, 7“) : 6 <— pr_crf_graph (13$?wa Q) |
Methodology | The GP expression will be defined as a PR constraint in Section 3.3 that reflects the interactions between the graph and the CRFs model. |
Methodology | Supervised linear-chain CRFs can be modeled in a standard conditional log-likelihood objective with a Gaussian prior: |
Related Work | (2008) enhanced a CRFs segmentation model in MT tasks by tuning the word granularity and improving the segmentation consistence. |
Related Work | Rather than playing the “hard” uses of the bilingual segmentation knowledge, i.e., directly merging “char-to-word” alignments to words as supervisions, this study extracts word boundary information of characters from the alignments as soft constraints to regularize a CRFs model’s learning. |
Related Work | (2014), proposed GP for inferring the label information of unlabeled data, and then leverage these GP outcomes to learn a semi-supervised scalable model (e. g., CRFs ). |
A Sentence Trimmer with CRFs | Our idea on how to make CRFs comply with grammar is quite simple: we focus on only those label sequences that are associated with grammatically correct compressions, by making CRFs look at only those that comply with some grammatical constraints G, and ignore others, regardless of how probable they are.1 But how do we find compressions that are grammatical? |
A Sentence Trimmer with CRFs | 1Assume as usual that CRFs take the form, p(le) 0< eX10 21m- )‘jfj(yk7yk—17X)+ Z,- Migi($k7ykax)) |
A Sentence Trimmer with CRFs | (1) fj and g, are ‘features’ associated with edges and vertices, respectively, and k: E C, where C denotes a set of cliques in CRFs . |
Abstract | The paper presents a novel sentence trimmer in Japanese, which combines a non-statistical yet generic tree generation model and Conditional Random Fields ( CRFs ), to address improving the grammaticality of compression while retaining its relevance. |
Features in CRFs | We use an array of features in CRFs which are either derived or borrowed from the taxonomy that a Japanese tokenizer called JUMAN and KNP,6 a Japanese dependency parser (aka Kurohashi-Nagao Parser), make use of in characterizing the output they produce: both JUMAN and KNP are part of the compression model we build. |
Introduction | What sets this work apart from them, however, is a novel use we make of Conditional Random Fields ( CRFs ) to select among possible compressions (Lafferty et al., 2001; Sutton and McCallum, 2006). |
Introduction | An obvious benefit of using CRFs for sentence compression is that the model provides a general (and principled) probabilistic framework which permits information from various sources to be integrated towards compressing sentence, a property K&M do not share. |
Introduction | Nonetheless, there is some cost that comes with the straightforward use of CRFs as a discriminative classifier in sentence compression; its outputs are often ungrammatical and it allows no control over the length of compression they generates (Nomoto, 2007). |
Abstract | Our model adopts a greedy bottom-up approach, with two linear-chain CRFs applied in cascade as local classifiers. |
Bottom-up tree-building | 4.1 Linear-chain CRFs as Local models |
Bottom-up tree-building | While our bottom-up tree-building shares the greedy framework with HILDA, unlike HILDA, our local models are implemented using CRFs . |
Bottom-up tree-building | Therefore, our model incorporates the strengths of both HILDA and Joty et al.’s model, i.e., the efficiency of a greedy parsing algorithm, and the ability to incorporate sequential information with CRFs . |
Experiments | For local models, our structure models are trained using MALLET (McCallum, 2002) to include constraints over transitions between adjacent labels, and our relation models are trained using CRFSuite (Okazaki, 2007), which is a fast implementation of linear-chain CRFs . |
Introduction | DCRF (Dynamic Conditional Random Fields) is a generalization of linear-chain CRFs , in which each time slice contains a set of state variables and edges (Sutton et al., 2007). |
Introduction | Second, by using two linear-chain CRFs to label a sequence of discourse constituents, we can incorporate contextual information in a more natural way, compared to using traditional discriminative classifiers, such as SVMs. |
Post-editing | is almost identical to their counterparts of the bottom-up tree-building, except that the linear-chain CRFs in post-editing includes additional features to represent information from constituents on higher levels (to be introduced in Section 7). |
Abstract | The derived label distributions are regarded as virtual evidences to regularize the learning of linear conditional random fields ( CRFs ) on unlabeled data. |
Abstract | Empirical results on Chinese tree bank (CTB-7) and Microsoft Research corpora (MSR) reveal that the proposed model can yield better results than the supervised baselines and other competitive semi-supervised CRFs in this task. |
Background | The first-order CRFs model (Lafferty et al., 2001) has been the most common one in this task. |
Background | The goal is to learn a CRFs model in the form, |
Introduction | The derived label distributions are regarded as prior knowledge to regularize the learning of a sequential model, conditional random fields ( CRFs ) in this case, on both |
Introduction | Section 3 reviews the background, including supervised character-based joint S&T model based on CRFs and graph-based label propagation. |
Related Work | Sun and Xu (2011) enhanced a CWS model by interpolating statistical features of unlabeled data into the CRFs model. |
Related Work | also differs from other semi-supervised CRFs algorithms. |
Related Work | (2006), extended by Mann and McCallum (2007), reported a semi-supervised CRFs model which aims to guide the learning by minimizing the conditional entropy of unlabeled data. |
Experiments | In this section, we describe the baseline setup, the CRFs training results, the RNN training results |
Experiments | 0 Wapiti toolkit (Lavergne et al., 2010) used for CRFs ; RNN is built by the RNNLIB toolkit. |
Experiments | 5.2 CRFs Training Results |
Tagging-style Reordering Model | For this supervised learning task, we choose the approach conditional random fields ( CRFs ) (Lafferty et al., 2001; Sutton and Mccallum, 2006; Lavergne et al., 2010) and recurrent neural network (RNN) (Elman, 1990; Jordan, 1990; Lang et al., 1990). |
Tagging-style Reordering Model | For the first method, we adopt the linear-chain CRFs . |
Tagging-style Reordering Model | However, even for the simple linear-chain CRFs , the complexity of learning and inference grows quadratically with respect to the number of output labels and the amount of structural features which are with regard to adjacent pairs of labels. |
Abstract | Conditional Random Fields ( CRFs ) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels. |
Abstract | In this paper, we address the issue of training very large CRFs , containing up to hundreds output labels and several billion features. |
Abstract | Our experiments demonstrate that very large CRFs can be trained efficiently and that very large models are able to improve the accuracy, while delivering compact parameter sets. |
Introduction | Conditional Random Fields ( CRFs ) (Lafferty et al., 2001; Sutton and McCallum, 2006) constitute a widely-used and effective approach for supervised structure learning tasks involving the mapping between complex objects such as strings and trees. |
Introduction | An important property of CRFs is their ability to handle large and redundant feature sets and to integrate structural dependency between output labels. |
Introduction | However, even for simple linear chain CRFs , the complexity of learning and inference Was partly supported by ANR projects CroTaL |
Abstract | In this paper, we propose a general framework based on Conditional Random Fields ( CRFs ) to detect the contexts and answers of questions from forum threads. |
Abstract | We improve the basic framework by Skip—chain CRFs and 2D CRFs to better accommodate the features of forums for better performance. |
Context and Answer Detection | We first discuss using Linear CRFs for context and answer detection, and then extend the basic framework to Skip-chain CRFs and 2D CRFs to better model our problem. |
Context and Answer Detection | 3.1 Using Linear CRFs |
Context and Answer Detection | For ease of presentation, we focus on detecting contexts using Linear CRFs . |
Introduction | First, we employ Linear Conditional Random Fields ( CRFs ) to identify contexts and answers, which can capture the relationships between contiguous sentences. |
Introduction | We also extend the basic model to 2D CRFs to model dependency between contiguous questions in a forum thread for context and answer identification. |
Introduction | Experimental results show that 1) Linear CRFs outperform SVM and decision tree in both context and answer detection; 2) Skip-chain CRFs outperform Linear CRFs for answer finding, which demonstrates that context improves answer finding; 3) 2D CRF model improves the performance of Linear CRFs and the combination of 2D CRFs and Skip-chain CRFs achieves better performance for context detection. |
Introduction | Because multiple runs of separate linear-chain CRFs ignore the dependency between source sentences, the second approach we propose is to use a 2D CRF that models all pair relationships jointly. |
Introduction | Our experimental results show that our proposed sentence type tagging method works very well, even for the minority categories, and that using 2D CRF further improves performance over linear-chain CRFs for identifying dependency relation between sentences. |
Introduction | In Section 3, we introduce the use of CRFs for sentence type and dependency tagging. |
Related Work | Our study is different in several aspects: we are using forum domains, unlike most work of DA tagging on conversational speech; we use CRFs for sentence type tagging; and more importantly, we also propose to use different CRFs for sentence relation detection. |
Thread Structure Tagging | To automatically label sentences in a thread with their types, we adopt a sequence labeling approach, specifically linear-chain conditional random fields ( CRFs ), which have shown good performance in many other tasks (Lafferty, 2001). |
Thread Structure Tagging | Linear-chain CRFs is a type of undirected graphical models. |
Thread Structure Tagging | CRFs is a special case of undirected graphical model in which w are log-linear functions: |
Experiment | The feature templates in (Zhao et al., 2006) and (Zhang and Clark, 2007) are used in training the CRFs model and Perceptrons model, respectively. |
Introduction | Sun and Xu (2011) enhanced the segmentation results by interpolating the statistics-based features derived from unlabeled data to a CRFs model. |
Segmentation Models | This section briefly reviews two supervised models in these categories, a character-based CRFs model, and a word-based Perceptrons model, which are used in our approach. |
Segmentation Models | 2.1 Character-based CRFs Model |
Segmentation Models | Xue (2003) first proposed the use of CRFs model (Lafferty et al., 2001) in character-based CWS. |
Semi-supervised Learning via Co-regularizing Both Models | The model induction process is described in Algorithm 1: given labeled dataset D; and unlabeled dataset Du, the first two steps are training a CRFs (character-based) and Perceptrons (word-based) model on the labeled data D; , respectively. |
Semi-supervised Learning via Co-regularizing Both Models | Afterwards, the agreements A are used as a set of constraints to bias the learning of CRFs (§ 3.2) and Perceptron (§ 3.3) on the unlabeled data. |
Semi-supervised Learning via Co-regularizing Both Models | 3.2 CRFs with Constraints |
Abstract | We formulate surface realisation as a sequence labelling task and combine the use of conditional random fields ( CRFs ) with semantic trees. |
Abstract | Due to their extended notion of context, CRFs are able to take the global utterance context into account and are less constrained by local features than other realisers. |
Cohesion across Utterances | This grammar defines the surface realisation space for the CRFs . |
Conclusion and Future Directions | We have argued that CRFs are well suited for this task because they are not restricted by independence assumptions. |
Conclusion and Future Directions | In addition, we may compare different sequence labelling algorithms for surface realisation (Nguyen and Guo, 2007) or segmented CRFs (Sarawagi and Cohen, 2005) and apply our method to more complex surface realisation domains such as text generation or summarisation. |
Evaluation | CRFs and other state-of-the-art methods, we also compare our system to two other baselines: |
Incremental Surface Realisation | Since CRFs are not restricted by the Markov condition, they are less constrained by local context than other models and can take nonlocal dependencies into account. |
Incremental Surface Realisation | While their extended context awareness can often make CRFs slow to train, they are fast at execution and therefore very applicable to the incremental scenario. |
Introduction | to surface realisation within incremental systems, because CRFs are able to model context across full as well as partial generator inputs which may undergo modifications during generation. |
Related Work | (2009) who also use CRFs to find the best surface realisation from a semantic tree. |
Introduction | While most of the state-of-the-art CWS systems used semi-Markov conditional random fields or latent variable conditional random fields, we simply use a single first-order conditional random fields ( CRFs ) for the joint modeling. |
Introduction | The semi-Markov CRFs and latent variable CRFs relax the Markov assumption of CRFs to express more complicated dependencies, and therefore to achieve higher disambiguation power. |
Introduction | Alternatively, our plan is not to relax Markov assumption of CRFs , but to exploit more complicated dependencies via using refined high-dimensional features. |
System Architecture | 3.1 A Joint Model Based on CRFs |
System Architecture | First, we briefly review CRFs . |
System Architecture | CRFs are proposed as a method for structured classification by solving “the label bias problem” (Lafferty et al., 2001). |
Introduction | Sequence tagging algorithms including HMMs (Ra-biner, 1989), CRFs (Lafferty et al., 2001), and Collins’s perceptron (Collins, 2002) have been widely employed in NLP applications. |
Problem formulation | In this section, we formulate the sequential decoding problem in the context of perceptron algorithm (Collins, 2002) and CRFs (Lafferty et al., 2001). |
Problem formulation | and a CRFs model is |
Problem formulation | For CRFs , Z (x) is an instance-specific normalization function |
Related work | A similar idea was applied to CRFs as well (Cohn, 2006; Jeong, 2009). |
Experimental Comparison with Unsupervised Learning | Figure 2: Comparison of GE training of the restricted and full CRFs with unsupervised learning of DMV. |
Generalized Expectation Criteria | GE has been applied to logistic regression models (Mann and McCallum, 2007; Druck et al., 2008) and linear chain CRFs (Mann and McCallum, 2008). |
Generalized Expectation Criteria | 3.1 GE in General CRFs |
Generalized Expectation Criteria | 3.2 Non-Projective Dependency Tree CRFs |
Introduction | Generalized expectation (GE) (Mann and McCallum, 2008; Druck et al., 2008) is a recently proposed framework for incorporating prior knowledge into the learning of conditional random fields ( CRFs ) (Lafferty et al., 2001). |
The Chunking-based Segmentation for Chinese ONs | 4.3 The CRFs Model for Chunking |
The Chunking-based Segmentation for Chinese ONs | Considered as a discriminative probabilistic model for sequence joint labeling and with the advantage of flexible feature fusion ability, Conditional Random Fields ( CRFs ) [J .Lafferty et al., 2001] is believed to be one of the best probabilistic models for sequence labeling tasks. |
The Chunking-based Segmentation for Chinese ONs | So the CRFs model is employed for chunking. |
The Framework of Our System | CRFs Chunking Mode] |
Background | For this underlying model, we employ a chain-structured conditional random field (CRF), since CRFs have been shown to perform better than other simple unconstrained models like hidden markov models for citation extraction (Peng and McCallum, 2004). |
Background | The algorithms we present in later sections for handling soft global constraints and for learning the penalties of these constraints can be applied to general structured linear models, not just CRFs , provided we have an available algorithm for performing MAP inference. |
Citation Extraction Data | Later, CRFs were shown to perform better on CORA, improving the results from the Hmm’s token-level F1 of 86.6 to 91.5 with a CRF(Peng and McCallum, 2004). |
Citation Extraction Data | This approach is limited in its use of an HMM as an underlying model, as it has been shown that CRFs perform significantly better, achieving 95.37 token-level accuracy on CORA (Peng and McCallum, 2004). |
Introduction | Hidden Markov models and linear-chain conditional random fields ( CRFs ) have previously been applied to citation extraction (Hetzner, 2008; Peng and McCallum, 2004) . |
Conclusions | plication of CRFs , which are a major advance of recent years in machine learning. |
Conclusions | A third contribution of our work is a demonstration that current CRF methods can be used straightforwardly for an important application and outperform state-of—the-art commercial and open-source software; we hope that this demonstration accelerates the widespread use of CRFs . |
Introduction | Research on structured learning has been highly successful, with sequence classification as its most important and successful subfield, and with conditional random fields ( CRFs ) as the most influential approach to learning sequence classifiers. |
Introduction | we show that CRFs can achieve extremely good performance on the hyphenation task. |
Causal Relations for Why-QA | We regard this task as a sequence labeling problem and use Conditional Random Fields ( CRFs ) (Laf-ferty et al., 2001) as a machine learning framework. |
Causal Relations for Why-QA | In our task, CRFs take three sentences of a causal relation candidate as input and generate their cause-effect annotations with a set of possible cause-effect IOB labels, including Begin-Cause (BC), Inside-Cause (IC), Begin-Effect (BE), Inside-Effect (IE), and Outside (0). |
Causal Relations for Why-QA | We used the three types of feature sets in Table 3 for training the CRFs , where j is in the range of z' — 4 g j g i + 4 for current position i in a causal relation candidate. |
Introduction | We evaluate the effectiveness of our method by using linear-chain conditional random fields ( CRFs ) and three traditional NLP tasks, namely, text chunking (shallow parsing), named entity recognition, and POS tagging. |
Log-Linear Models | The CRF++ version 0.50, a popular CRF library developed by Taku Kudo,6 is reported to take 4,021 seconds on Xeon 3.0GHz processors to train the model using a richer feature set.7 CRFsuite version 0.4, a much faster library for CRFs , is reported to take 382 seconds on Xeon 3.0GHz, using the same feature set as ours.8 Their library uses the OWL-QN algorithm for optimization. |
Log-Linear Models | (2006) report an f-score of 71.48 on the same data, using semi-Markov CRFs . |
Log-Linear Models | We have conducted experiments using CRFs and three NLP tasks, and demonstrated empirically that our training algorithm can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for L1-regularization. |
Experiments | We think this shows one of the strengths of machine learning methods such as CRFs . |
Gazetteer Induction 2.1 Induction by MN Clustering | We expect that tagging models ( CRFs in our case) can learn an appropriate weight for each gazetteer match regardless of whether it is an NE or not. |
Related Work and Discussion | Using models such as Semi-Markov CRFs (Sarawagi and Cohen, 2004), which handle the features on overlapping regions, is one possible direction. |
Using Gazetteers as Features of NER | The NER task is then treated as a tagging task, which assigns IOB tags to each character in a sentence.10 We use Conditional Random Fields ( CRFs ) (Lafferty et al., 2001) to perform this tagging. |
Methodology | Given the general performance and discrimi-native framework, Conditional Random Fields ( CRFs ) (Lafferty et al., 2001) is a suitable framework for tackling sequence labeling problems. |
Methodology | CRFs represent a basic, simple and well-understood framework for sequence labeling, making it a suitable framework for adapting to perform joint inference. |
Methodology | Figure 2: Graphical representations of the two types of CRFs used in this work. |
Experiments | We trained CRFs for opinion entity identification using the following features: indicators for words, POS tags, and lexicon features (the subjectivity strength of the word in the Subjectivity Lexicon). |
Model | We formulate the task of opinion entity identification as a sequence labeling problem and employ conditional random fields ( CRFs ) (Lafferty et al., 2001) to learn the probability of a sequence assignment y for a given sentence x. |
Model | We define potential function fig that gives the probability of assigning a span 2' with entity label 2, and the probability is estimated based on the learned parameters from CRFs . |
Introduction | The most widely used approaches to these problems have been sequential models including hidden Markov models (HMMS), maximum entropy Markov models (MEMMS) (Mccallum 2000), and conditional random fields ( CRFS ) (Lafferty et al. |
Introduction | Because of this limitation, Viola and Narasimhand (2007) use a discriminative context-free (phrase structure) grammar for extracting information from semistructured data and report higher performances over CRFs . |
Introduction | Contextual information often plays a big role in resolving tagging ambiguities and is one of the key benefits of discriminative models such as CRFs . |