Abstract | We propose a log-linear model to compute the paraphrase likelihood of two patterns and exploit feature functions based on maximum likelihood estimation (MLE) and lexical weighting (LW). |
Conclusion | We use a log-linear model to compute the paraphrase likelihood and exploit feature functions based on MLE and LW. |
Conclusion | In addition, the log-linear model with the proposed feature functions significantly outperforms the conventional models. |
Experiments | 4.1 Evaluation of the Log-linear Model |
Experiments | As previously mentioned, in the log-linear model of this paper, we use both MLE based and LW based feature functions. |
Experiments | In this section, we evaluate the log-linear model (LL-Model) and compare it with the MLE based model (MLE-Model) presented by Bannard and Callison-Burch (2005)6. |
Introduction | parsing and English-foreign language word alignment, (2) aligned patterns induction, which produces English patterns along with the aligned pivot patterns in the foreign language, (3) paraphrase patterns extraction, in which paraphrase patterns are extracted based on a log-linear model . |
Introduction | Secondly, we propose a log-linear model for computing the paraphrase likelihood. |
Introduction | Besides, the log-linear model is more effective than the conventional model presented in (Bannard and Callison-Burch, 2005). |
Proposed Method | In order to exploit more and richer information to estimate the paraphrase likelihood, we propose a log-linear model: |
Proposed Method | In this paper, 4 feature functions are used in our log-linear model , which include: |
Abstract | Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. |
Introduction | Recently, great progress has been achieved in SMT, especially since Och and Ney (2002) proposed the log-linear model: almost all the state-of-the-art SMT systems are based on the log-linear model . |
Introduction | Regardless of how successful the log-linear model is in SMT, it still has some shortcomings. |
Introduction | Compared with the log-linear model , it has more powerful expressive abilities and can deeply interpret and represent features with hidden units in neural networks. |
Abstract | This paper describes log-linear models for a general-purpose sentence realizer based on dependency structures. |
Abstract | Then the best linearizations compatible with the relative order are selected by log-linear models . |
Abstract | The log-linear models incorporate three types of feature functions, including dependency relations, surface words and headwords. |
Introduction | The other is log-linear model with different syntactic and semantic features (Velldal and Oepen, 2005; Nakanishi et al., 2005; Cahill et al., 2007). |
Introduction | Compared with n-gram model, log-linear model is more powerful in that it is easy to integrate a variety of features, and to tune feature weights to maximize the probability. |
Introduction | This paper presents a general-purpose realizer based on log-linear models for directly linearizing dependency relations given dependency structures. |
Log-linear Models | We use log-linear models for selecting the sequence with the highest probability from all the possible linearizations of a subtree. |
Log-linear Models | 4.1 The Log-linear Model |
Log-linear Models | Log-linear models employ a set of feature functions to describe properties of the data, and a set of learned weights to determine the contribution of each feature. |
Abstract | Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for Ll-regularized log-linear models . |
Introduction | Log-linear models (a.k.a maximum entropy models) are one of the most widely-used probabilistic models in the field of natural language processing (NLP). |
Introduction | Log-linear models have a major advantage over other |
Introduction | Kazama and Tsujii (2003) describe a method for training a Ll-regularized log-linear model with a bound constrained version of the BFGS algorithm (Nocedal, 1980). |
Log-Linear Models | In this section, we briefly describe log-linear models used in NLP tasks and L1 regularization. |
Log-Linear Models | A log-linear model defines the following probabilistic distribution over possible structure y for input x: |
Log-Linear Models | The weights of the features in a log-linear model are optimized in such a way that they maximize the regularized conditional log-likelihood of the training data: |
Argument Identification | where p9 is a log-linear model normalized over the set Ry, with features described in Table 1. |
Argument Identification | Inference Although our learning mechanism uses a local log-linear model , we perform inference globally on a per-frame basis by applying hard structural constraints. |
Discussion | However, since the input representation is shared across all frames, every other training example from all the lexical units affects the optimal estimate, since they all modify the joint parameter matrix M. By contrast, in the log-linear models each label has its own set of parameters, and they interact only via the normalization constant. |
Discussion | They also use a log-linear model , but they incorporate a latent variable that uses WordNet (Fellbaum, 1998) to get lexical-semantic relationships and smooths over frames for ambiguous lexical units. |
Discussion | Another difference is that when training the log-linear model , they normalize over all frames, while we normalize over the allowed frames for the current lexical unit. |
Experiments | The baselines use a log-linear model that models the following probability at training time: |
Experiments | For comparison with our model from §3, which we call WSABIE EMBEDDING, we implemented two baselines with the log-linear model . |
Experiments | So the second baseline has the same input representation as WSABIE EMBEDDING but uses a log-linear model instead of WSABIE. |
A Generic Phrase Training Procedure | Note that under the log-linear model , applying threshold for filtering is equivalent to comparing the “likelihood” ratio. |
Conclusions | In this paper, the problem of extracting phrase translation is formulated as an information retrieval process implemented with a log-linear model aiming for a balanced precision and recall. |
Discussions | The generic phrase training algorithm follows an information retrieval perspective as in (Venugopal et al., 2003) but aims to improve both precision and recall with the trainable log-linear model . |
Discussions | Under the general framework, one can put as many features as possible together under the log-linear model to evaluate the quality of a phrase and a phase pair. |
Experimental Results | Our decoder is a phrase-based multi-stack implementation of the log-linear model similar to Pharaoh (Koehn et al., 2003). |
Experimental Results | Like other log-linear model based decoders, active features in our translation engine include translation models in two directions, lexicon weights in two directions, language model, lexicalized distortion models, sentence length penalty and other heuristics. |
Experimental Results | Since the translation engine implements a log-linear model , the discriminative training of feature weights in the decoder should be embedded in the whole end-to-end system jointly with the discriminative phrase table training process. |
Abstract | We build a log-linear model that incorporates these asymmetries for ranking German string reali-sations from input LFG F-structures. |
Generation Ranking | (2007), a log-linear model based on the Lexical Functional Grammar (LFG) Framework (Kaplan and Bresnan, 1982). |
Generation Ranking | (2007) describe a log-linear model that uses linguistically motivated features and improves over a simple trigram language model baseline. |
Generation Ranking | We take this log-linear model as our starting point.3 |
Generation Ranking Experiments | We tune the parameters of the log-linear model on a small development set of 63 sentences, and carry out the final evaluation on 261 unseen sentences. |
Generation Ranking Experiments | We evaluate the string chosen by the log-linear model against the original treebank string in terms of exact match and BLEU score (Papineni et al., |
Discriminative Synchronous Transduction | 3.1 A global log-linear model |
Discriminative Synchronous Transduction | Our findings echo those observed for latent variable log-linear models successfully used in monolingual parsing (Clark and Curran, 2007; Petrov et al., 2007). |
Discriminative Synchronous Transduction | This method has been demonstrated to be effective for (non-convex) log-linear models with latent variables (Clark and Curran, 2004; Petrov et al., 2007). |
Introduction | First, we develop a log-linear model of translation which is globally trained on a significant number of parallel sentences. |
Product of Experts | For these bigram or trigram overlap features, a similar log-linear model has to be normalized with a partition function, which considers the (unnormalized) scores of all possible target sentences, given the source sentence. |
QG for Paraphrase Modeling | 5 We use log-linear models three times: for the configuration, the lexical semantics class, and the word. |
QG for Paraphrase Modeling | (2007),6 we employ a 14-feature log-linear model over all logically possible combinations of the 14 WordNet relations (Miller, 1995).7 Similarly to Eq. |
QG for Paraphrase Modeling | 14, we normalize this log-linear model based on the set of relations that are nonempty in WordNet for the word 3360-). |
Introduction | Word embedding is used as the input to learn translation confidence score, which is combined with commonly used features in the conventional log-linear model . |
Our Model | The difference between our model and the conventional log-linear model includes: |
Phrase Pair Embedding | Instead of integrating the sparse features directly into the log-linear model , we use them as the input to learn a phrase pair embedding. |
Phrase Pair Embedding | To train the neural network, we add the confidence scores to the conventional log-linear model as features. |
Related Work | Together with other commonly used features, the translation confidence score is integrated into a conventional log-linear model . |
A Joint Model for Two Formalisms | Instead, we assume that the distribution over yCFG is a log-linear model with parameters 601:0 (i.e., a sub-vector of 6) , namely: |
Evaluation Setup | In this setup, the model reduces to a normal log-linear model for the target formalism. |
Experiment and Analysis | It’s not surprising that Cahill’s model outperforms our log-linear model because it relies heavily on handcrafted rules optimized for the dataset. |
Features | Feature functions in log-linear models are designed to capture the characteristics of each derivation in the tree. |
Baselines | where m ranges over IN and OUT, pm(é| f) is an estimate from a component phrase table, and each Am is a weight in the top-level log-linear model , set so as to maximize dev-set BLEU using minimum error rate training (Och, 2003). |
Ensemble Decoding | In the typical log-linear model SMT, the posterior |
Ensemble Decoding | Since in log-linear models , the model scores are not normalized to form probability distributions, the scores that different models assign to each phrase-pair may not be in the same scale. |
Experiments & Results 4.1 Experimental Setup | It was filtered to retain the top 20 translations for each source phrase using the TM part of the current log-linear model . |
Conclusion and Future Work | The consensus statistics are integrated into the conventional log-linear model as features. |
Experiments and Results | Instead of using graph-based consensus confidence as features in the log-linear model , we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)). |
Features and Training | Therefore, we can alternatively update graph-based consensus features and feature weights in the log-linear model . |
Graph-based Translation Consensus | Our MT system with graph-based translation consensus adopts the conventional log-linear model . |
Abstract | We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features. |
Introduction | Our model is conceptually similar to those used in speech recognition and other applications: we assume the intended tokens are generated from a bigram language model and then distorted by a noisy channel, in particular a log-linear model of phonetic variability. |
Lexical-phonetic model | (2008), we parameterize these distributions with a log-linear model . |
Lexical-phonetic model | In modern phonetics and phonology, these generalizations are usually expressed as Optimality Theory constraints; log-linear models such as ours have previously been used to implement stochas- |
A semantic span can include one or more eus. | Following Och and Ney (2002), our model is framed as a log-linear model: |
A semantic span can include one or more eus. | courage the decoder to generate transitional words and phrases; the score is utilized as an additional feature hk (es, ft) in the log-linear model . |
A semantic span can include one or more eus. | In general, according to formula (3), the translation quality based on the log-linear model is related tightly with the features chosen. |
Conclusion | Our contributions can be summarized as: l) the new translation rules are more discriminative and sensitive to cohesive information by converting the source string into a CSS-based tagged-flattened string; 2) the new additional features embedded in the log-linear model can encourage the decoder to produce transitional expressions. |
Related Work | (2013) went beyond the log-linear model for SMT and proposed a novel additive neural networks based translation model, which overcome some of the shortcomings suffered by the log-linear model : linearity and the lack of deep interpretation and representation in features. |
Semi-Supervised Deep Auto-encoder Features Learning for SMT | Each translation rule in the phrase-based translation model has a set number of features that are combined in the log-linear model (Och and Ney, 2002), and our semi-supervised DAE features can also be combined in this model. |
Semi-Supervised Deep Auto-encoder Features Learning for SMT | To combine these learned features (DBN and DAB feature) into the log-linear model , we need to eliminate the impact of the nonlinear learning mechanism. |
Experimental Setup | The decoder’s log-linear model includes a standard feature set. |
Experimental Setup | The decoder’s log-linear model is tuned with MERT (Och, 2003). |
Experimental Setup | Both the decoder’s log-linear model and the re-ranking models are trained on the same development set. |
Experiments | We evaluate the performance of adding new topic-related features to the log-linear model and compare the translation accuracy with the method in (Xiao et al., 2012). |
Introduction | We integrate topic similarity features in the log-linear model and evaluate the performance on the NIST Chinese-to-English translation task. |
Topic Similarity Model with Neural Network | The similarity scores are integrated into the standard log-linear model for making translation decisions. |
Our Proposal: A Latent LC Approach | We address the task with a latent variable log-linear model , representing the LCs of the predicates. |
Our Proposal: A Latent LC Approach | The introduction of latent variables into the log-linear model leads to a non-convex objective function. |
Our Proposal: A Latent LC Approach | Once h has been fixed, the model collapses to a convex log-linear model . |
Batch Models | 7We use log-linear models over reasonable alternatives such as perceptron or SVM, following the practice of a wide range of previous work in related areas (Smith, 2004; Liu et a1., 2005; Poon et a1., 2009) including text classification in social media (Van Durme, 2012b; Yang and Eisenstein, 2013). |
Batch Models | The corresponding log-linear model is defined as: |
Experimental Setup | We experiment with log-linear models defined in Eq. |
Experimental Setup | But instead of using just the PMI scores of bilingual NE pairs, as in our work, they employed a feature-rich log-linear model to capture bilingual correlations. |
Experimental Setup | Parameters in their log-linear model require training with bilingually annotated data, which is not readily available. |
Related Work | (2010a) presented a supervised learning method for performing joint parsing and word alignment using log-linear models over parse trees and an ITG model over alignment. |
Building Dialog Trees from Instructions | Given a single instruction 2' with category au, we use a log-linear model to represent the distri- |
Understanding Initial Queries | We employ a log-linear model and try to maximize initial dialog state distribution over the space of all nodes in a dialog network: |
Understanding Query Refinements | Dialog State Update Model We use a log-linear model to maximize a dialog state distribution over the space of all nodes in a dialog network: |
Inference | We then report the corresponding chains 0(a) as the system output.3 For learning, the gradient takes the standard form of the gradient of a log-linear model , a difference of expected feature counts under the gold annotation and under no annotation. |
Introduction | We use a log-linear model that can be expressed as a factor graph. |
Models | The final log-linear model is given by the following formula: |
Markov Topic Regression - MTR | log-linear models with parameters, AiméRM , is |
Markov Topic Regression - MTR | labeled data, 712?, based on the log-linear model in Eq. |
Semi-Supervised Semantic Labeling | The a: is used as the input matrix of the kth log-linear model (corresponding to kth semantic tag (topic)) to infer the [3 hyper-parameter of MTR in Eq. |
Abstract | Och (2003) proposed using a log-linear model to incorporate multiple features for translation, and proposed a minimum error rate training (MERT) method to train the feature weights to optimize a desirable translation metric. |
Abstract | While the log-linear model itself is discriminative, the phrase and lexicon translation features, which are among the most important components of SMT, are derived from either generative models or heuristics (Koehn et al., 2003, Brown et al., 1993). |
Abstract | In that work, multiple features, most of them are derived from generative models, are incorporated into a log-linear model , and the relative weights of them are tuned discriminatively on a small tuning set. |
Background | where Pr(e| f) is the probability that e is the translation of the given source string f. To model the posterior probability Pr(e| f) , most of the state-of-the-art SMT systems utilize the log-linear model proposed by Och and Ney (2002), as follows, |
Background | In this paper, u denotes a log-linear model that has Mfixed features {h1(f,e), ..., hM(f,e)}, ,1 = {3.1, ..., AM} denotes the M parameters of u, and u(/1) denotes a SMT system based on u with parameters ,1. |
Background | In this paper, we use the term training set to emphasize the training of log-linear model . |
Collaborative Decoding | The requirement for a log-linear model aims to provide a natural way to integrate the new co-decoding features. |
Collaborative Decoding | Referring to the log-linear model formulation, the translation posterior P(e'|dk) can be computed as: |
Conclusion | In this paper, we present a framework of collaborative decoding, in which multiple MT decoders are coordinated to search for better translations by re-ranking partial hypotheses using augmented log-linear models with translation consensus -based features. |
Consensus Decoding Algorithms | The distribution P(e| f) can be induced from a translation system’s features and weights by expo-nentiating with base I) to form a log-linear model: |
Experimental Results | The log-linear model weights were trained using MIRA, a margin-based optimization procedure that accommodates many features (Crammer and Singer, 2003; Chiang et al., 2008). |
Experimental Results | We tuned b, the base of the log-linear model , to optimize consensus decoding performance. |
Cohesive Decoding | This count becomes a feature in the decoder’s log-linear model , the weight of which is trained with MERT. |
Experiments | Weights for the log-linear model are set using MERT, as implemented by Venugopal and Vogel (2005). |
Experiments | Since adding features to the decoder’s log-linear model is straightforward, we also experiment with a combined system that uses both the cohesion constraint and a lexical reordering model. |