Abstract | We propose a log-linear model to compute the paraphrase likelihood of two patterns and exploit feature functions based on maximum likelihood estimation (MLE) and lexical weighting (LW). |
Conclusion | We use a log-linear model to compute the paraphrase likelihood and exploit feature functions based on MLE and LW. |
Conclusion | In addition, the log-linear model with the proposed feature functions significantly outperforms the conventional models. |
Experiments | 4.1 Evaluation of the Log-linear Model |
Experiments | As previously mentioned, in the log-linear model of this paper, we use both MLE based and LW based feature functions. |
Experiments | In this section, we evaluate the log-linear model (LL-Model) and compare it with the MLE based model (MLE-Model) presented by Bannard and Callison-Burch (2005)6. |
Introduction | parsing and English-foreign language word alignment, (2) aligned patterns induction, which produces English patterns along with the aligned pivot patterns in the foreign language, (3) paraphrase patterns extraction, in which paraphrase patterns are extracted based on a log-linear model. |
Introduction | Secondly, we propose a log-linear model for computing the paraphrase likelihood. |
Introduction | Besides, the log-linear model is more effective than the conventional model presented in (Bannard and Callison-Burch, 2005). |
Proposed Method | In order to exploit more and richer information to estimate the paraphrase likelihood, we propose a log-linear model: |
Proposed Method | In this paper, 4 feature functions are used in our log-linear model, which include: |
Abstract | Most statistical machine translation (SMT) systems are modeled using a log-linear framework. |
Abstract | Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. |
Abstract | additive neural networks, for SMT to go beyond the log-linear translation model. |
Introduction | Recently, great progress has been achieved in SMT, especially since Och and Ney (2002) proposed the log-linear model: almost all the state-of-the-art SMT systems are based on the log-linear model. |
Introduction | Regardless of how successful the log-linear model is in SMT, it still has some shortcomings. |
Introduction | Compared with the log-linear model, it has more powerful expressive abilities and can deeply interpret and represent features with hidden units in neural networks. |
Abstract | This paper describes log-linear models for a general-purpose sentence realizer based on dependency structures. |
Abstract | Then the best linearizations compatible with the relative order are selected by log-linear models. |
Abstract | The log-linear models incorporate three types of feature functions, including dependency relations, surface words and headwords. |
Introduction | The other is log-linear model with different syntactic and semantic features (Velldal and Oepen, 2005; Nakanishi et al., 2005; Cahill et al., 2007). |
Introduction | Compared with n-gram model, log-linear model is more powerful in that it is easy to integrate a variety of features, and to tune feature weights to maximize the probability. |
Introduction | This paper presents a general-purpose realizer based on log-linear models for directly linearizing dependency relations given dependency structures. |
Log-linear Models | We use log-linear models for selecting the sequence with the highest probability from all the possible linearizations of a subtree. |
Log-linear Models | 4.1 The Log-linear Model |
Log-linear Models | Log-linear models employ a set of feature functions to describe properties of the data, and a set of learned weights to determine the contribution of each feature. |
Argument Identification | where p9 is a log-linear model normalized over the set Ry, with features described in Table 1. |
Argument Identification | Inference Although our learning mechanism uses a local log-linear model, we perform inference globally on a per-frame basis by applying hard structural constraints. |
Discussion | We believe that the WSABIE EMBEDDING model performs better than the LOG-LINEAR EMBEDDING baseline (that uses the same input representation) because the former setting allows examples with different labels and confusion sets to share information; this is due to the fact that all labels live in the same label space, and a single projection matrix is shared across the examples to map the input features to this space. |
Discussion | Consequently, the WSABIE EMBEDDING model can share more information between different examples in the training data than the LOG-LINEAR EMBEDDING model. |
Discussion | Since the LOG-LINEAR WORDS model always performs better than the LOG-LINEAR EMBEDDING model, we conclude that the primary benefit does not come from the input embedding representation.15 |
Experiments | The baselines use a log-linear model that models the following probability at training time: |
Experiments | For comparison with our model from §3, which we call WSABIE EMBEDDING, we implemented two baselines with the log-linear model. |
Experiments | We call this baseline LOG-LINEAR WORDS. |
Abstract | Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for Ll-regularized log-linear models. |
Introduction | Log-linear models (a.k.a maximum entropy models) are one of the most widely-used probabilistic models in the field of natural language processing (NLP). |
Introduction | Log-linear models have a major advantage over other |
Introduction | Kazama and Tsujii (2003) describe a method for training a Ll-regularized log-linear model with a bound constrained version of the BFGS algorithm (Nocedal, 1980). |
Log-Linear Models | In this section, we briefly describe log-linear models used in NLP tasks and L1 regularization. |
Log-Linear Models | A log-linear model defines the following probabilistic distribution over possible structure y for input x: |
Log-Linear Models | The weights of the features in a log-linear model are optimized in such a way that they maximize the regularized conditional log-likelihood of the training data: |
Abstract | We investigate the influence of information status (IS) on constituent order in German, and integrate our findings into a log-linear surface realisation ranking model. |
Abstract | We build a log-linear model that incorporates these asymmetries for ranking German string reali-sations from input LFG F-structures. |
Conclusions | By calculating strong asymmetries between pairs of IS labels, and establishing the most frequent syntactic characteristics of these asymmetries, we designed a new set of features for a log-linear ranking model. |
Generation Ranking | (2007), a log-linear model based on the Lexical Functional Grammar (LFG) Framework (Kaplan and Bresnan, 1982). |
Generation Ranking | (2007) describe a log-linear model that uses linguistically motivated features and improves over a simple trigram language model baseline. |
Generation Ranking | We take this log-linear model as our starting point.3 |
Generation Ranking Experiments | These are all automatically removed from the list of features to give a total of 130 new features for the log-linear ranking model. |
Generation Ranking Experiments | We train the log-linear ranking model on 7759 F-structures from the TIGER treebank. |
Generation Ranking Experiments | We tune the parameters of the log-linear model on a small development set of 63 sentences, and carry out the final evaluation on 261 unseen sentences. |
Experimental Evaluation | The features are combined in a log-linear way. |
Experimental Evaluation | In Table 5 we can see that the performance of the heuristic phrase model can be increased by 0.6 BLEU on TEST by filtering the phrase table to contain the same phrases as the count model and reoptimizing the log-linear model weights. |
Experimental Evaluation | Log-linear interpolation of the count model with the heuristic yields a further increase, showing an improvement of 1.3 BLEU on DEV and 1.4 BLEU on TEST over the baseline. |
Introduction | The translation process is implemented as a weighted log-linear combination of several models hm(e{, sf , ff) including the logarithm of the phrase probability in source-to-target as well as in target-to-source direction. |
Phrase Model Training | The log-linear interpolations pint(f|é) of the phrase translation probabilities are estimated as |
Phrase Model Training | As a generalization of the fixed interpolation of the two phrase tables we also experimented with adding the two trained phrase probabilities as additional features to the log-linear framework. |
Phrase Model Training | With good log-linear feature weights, feature-wise combination should perform at least as well as fixed interpolation. |
Baselines | 2.1 Log-Linear Mixture |
Baselines | Log-linear translation model (TM) mixtures are of the form: |
Baselines | where m ranges over IN and OUT, pm(é| f) is an estimate from a component phrase table, and each Am is a weight in the top-level log-linear model, set so as to maximize dev-set BLEU using minimum error rate training (Och, 2003). |
Ensemble Decoding | In the typical log-linear model SMT, the posterior |
Ensemble Decoding | log-linear mixture). |
Ensemble Decoding | Since in log-linear models, the model scores are not normalized to form probability distributions, the scores that different models assign to each phrase-pair may not be in the same scale. |
Experiments & Results 4.1 Experimental Setup | It was filtered to retain the top 20 translations for each source phrase using the TM part of the current log-linear model. |
Introduction | In addition to the basic approach of concatenation of in-domain and out-of-domain data, we also trained a log-linear mixture model (Foster and Kuhn, 2007) |
Related Work 5.1 Domain Adaptation | Two famous examples of such methods are linear mixtures and log-linear mixtures (Koehn and Schroeder, 2007; Civera and Juan, 2007; Foster and Kuhn, 2007) which were used as baselines and discussed in Section 2. |
Discriminative Synchronous Transduction | 3.1 A global log-linear model |
Discriminative Synchronous Transduction | Our log-linear translation model defines a conditional probability distribution over the target translations of a given source sentence. |
Discriminative Synchronous Transduction | Our findings echo those observed for latent variable log-linear models successfully used in monolingual parsing (Clark and Curran, 2007; Petrov et al., 2007). |
Discussion and Further Work | Such approaches have been shown to be effective in log-linear word-alignment models Where only a small supervised corpus is available (Blunsom and Cohn, 2006). |
Introduction | First, we develop a log-linear model of translation which is globally trained on a significant number of parallel sentences. |
A Generic Phrase Training Procedure | Note that under the log-linear model, applying threshold for filtering is equivalent to comparing the “likelihood” ratio. |
Conclusions | In this paper, the problem of extracting phrase translation is formulated as an information retrieval process implemented with a log-linear model aiming for a balanced precision and recall. |
Discussions | The generic phrase training algorithm follows an information retrieval perspective as in (Venugopal et al., 2003) but aims to improve both precision and recall with the trainable log-linear model. |
Discussions | Under the general framework, one can put as many features as possible together under the log-linear model to evaluate the quality of a phrase and a phase pair. |
Experimental Results | Our decoder is a phrase-based multi-stack implementation of the log-linear model similar to Pharaoh (Koehn et al., 2003). |
Experimental Results | Like other log-linear model based decoders, active features in our translation engine include translation models in two directions, lexicon weights in two directions, language model, lexicalized distortion models, sentence length penalty and other heuristics. |
Experimental Results | Since the translation engine implements a log-linear model, the discriminative training of feature weights in the decoder should be embedded in the whole end-to-end system jointly with the discriminative phrase table training process. |
Product of Experts | These features have to be included in estimating pkn-d, which has log-linear component models (Eq. |
Product of Experts | For these bigram or trigram overlap features, a similar log-linear model has to be normalized with a partition function, which considers the (unnormalized) scores of all possible target sentences, given the source sentence. |
QG for Paraphrase Modeling | 5 We use log-linear models three times: for the configuration, the lexical semantics class, and the word. |
QG for Paraphrase Modeling | (2007),6 we employ a 14-feature log-linear model over all logically possible combinations of the 14 WordNet relations (Miller, 1995).7 Similarly to Eq. |
QG for Paraphrase Modeling | 14, we normalize this log-linear model based on the set of relations that are nonempty in WordNet for the word 3360-). |
Abstract | We present a method to jointly learn features and weights directly from distributional data in a log-linear framework. |
Abstract | The model uses an Indian Buffet Process prior to learn the feature values used in the log-linear method, and is the first algorithm for learning phonological constraints without presupposing constraint structure. |
Introduction | These constraint-driven decisions can be modeled with a log-linear system. |
Introduction | We consider this question by examining the dominant framework in modern phonology, Optimality Theory (Prince and Smolensky, 1993, OT), implemented in a log-linear framework, MaXEnt OT (Goldwater and Johnson, 2003), with output forms’ probabilities based on a weighted sum of |
Phonology and Optimality Theory 2.1 OT structure | In IBPOT, we use the log-linear EVAL developed by Goldwater and J ohn-son (2003) in their MaxEnt OT system. |
The IBPOT Model | The weight vector w provides weight for both F and M. Probabilities of output forms are given by a log-linear function: |
Data | While the character clustering stage is essentially performing proper noun coreference resolution, approximately 74% of references to characters in books come in the form of pronouns.5 To resolve this more difficult class at the scale of an entire book, we train a log-linear discriminative classifier only on the task of resolving pronominal anaphora (i.e., ignoring generic noun phrases such as the paint or the rascal). |
Data | To manage the degrees of freedom in the model described in §4, we perform dimensionality reduction on the vocabulary by learning word embed-dings with a log-linear continuous skip-gram language model (Mikolov et al., 2013) on the entire collection of 15,099 books. |
Experiments | A Basic persona model, which ablates author information but retains the same log-linear architecture; here, the n-vector is of size P + 1 and does not model author effects. |
Model | In order to separate out the effects that a character’s persona has on the words that are associated with them (as opposed to other factors, such as time period, genre, or author), we adopt a hierarchical Bayesian approach in which the words we observe are generated conditional on a combination of different effects captured in a log-linear (or “maximum entropy”) distribution. |
Model | This SAGE model can be understood as a log-linear distribution with three kinds of features (metadata, persona, and back- |
Model | Number of personas (hyperparameter) D Number of documents Cd Number of characters in document d Wd,c Number of (cluster, role) tuples for character 0 md Metadata for document d (ranges over M authors) 0d Document d’s distribution over personas pd,c Character C’s persona j An index for a <7“, w) tuple in the data 1113' Word cluster ID for tuple j rj Role for tuple j 6 {agent, patient, poss, pred} 77 Coefficients for the log-linear language model M, A Laplace mean and scale (for regularizing 77) a Dirichlet concentration parameter |
Expected BLEU Training | We integrate the recurrent neural network language model as an additional feature into the standard log-linear framework of translation (Och, 2003). |
Expected BLEU Training | We summarize the weights of the recurrent neural network language model as 6 = {U, W, V} and add the model as an additional feature to the log-linear translation model using the simplified notation 89(10):) 2 8(wt|w1...wt_1,ht_1): |
Experiments | Log-linear weights are tuned with MERT. |
Experiments | Log-linear weights are estimated on the 2009 data set comprising 2525 sentences. |
Experiments | ther lattices or the unique 100-best output of the phrase-based decoder and reestimate the log-linear weights by running a further iteration of MERT on the n-best list of the development set, augmented by scores corresponding to the neural network models. |
Introduction | The expected BLEU objective provides an efficient way of achieving this for machine translation (Rosti et al., 2010; Rosti et al., 2011; He and Deng, 2012; Gao and He, 2013; Gao et al., 2014) instead of solely relying on traditional optimizers such as Minimum Error Rate Training (MERT) that only adjust the weighting of entire component models within the log-linear framework of machine translation (§3). |
Experiments | We evaluate the performance of adding new topic-related features to the log-linear model and compare the translation accuracy with the method in (Xiao et al., 2012). |
Introduction | We integrate topic similarity features in the log-linear model and evaluate the performance on the NIST Chinese-to-English translation task. |
Topic Similarity Model with Neural Network | The similarity scores are integrated into the standard log-linear model for making translation decisions. |
Topic Similarity Model with Neural Network | We incorporate the learned topic similarity scores into the standard log-linear framework for SMT. |
Topic Similarity Model with Neural Network | In addition to traditional SMT features, we add new topic-related features into the standard log-linear framework. |
A Joint Model for Two Formalisms | As is standard in such settings, the distribution will be log-linear in a set of features of these parses. |
A Joint Model for Two Formalisms | Instead, we assume that the distribution over yCFG is a log-linear model with parameters 601:0 (i.e., a sub-vector of 6) , namely: |
Evaluation Setup | In this setup, the model reduces to a normal log-linear model for the target formalism. |
Experiment and Analysis | It’s not surprising that Cahill’s model outperforms our log-linear model because it relies heavily on handcrafted rules optimized for the dataset. |
Features | Feature functions in log-linear models are designed to capture the characteristics of each derivation in the tree. |
Introduction | Word embedding is used as the input to learn translation confidence score, which is combined with commonly used features in the conventional log-linear model. |
Our Model | The difference between our model and the conventional log-linear model includes: |
Phrase Pair Embedding | Instead of integrating the sparse features directly into the log-linear model, we use them as the input to learn a phrase pair embedding. |
Phrase Pair Embedding | To train the neural network, we add the confidence scores to the conventional log-linear model as features. |
Related Work | Together with other commonly used features, the translation confidence score is integrated into a conventional log-linear model. |
Experimental Setup | Determining h for each predicate yields a regular log-linear binary classification model. |
Our Proposal: A Latent LC Approach | We address the task with a latent variable log-linear model, representing the LCs of the predicates. |
Our Proposal: A Latent LC Approach | The introduction of latent variables into the log-linear model leads to a non-convex objective function. |
Our Proposal: A Latent LC Approach | Once h has been fixed, the model collapses to a convex log-linear model. |
Markov Topic Regression - MTR | log-linear models with parameters, AiméRM , is |
Markov Topic Regression - MTR | trained to predict Blgikfik, for each w, of a tag 81.32 BEE) = exp(f(wl; A?» (2) where the log-linear function f is: n23} = m; A295 = 9531905771 (3) |
Markov Topic Regression - MTR | labeled data, 712?, based on the log-linear model in Eq. |
Semi-Supervised Semantic Labeling | The a: is used as the input matrix of the kth log-linear model (corresponding to kth semantic tag (topic)) to infer the [3 hyper-parameter of MTR in Eq. |
Translation Model Architecture | Our translation model is embedded in a log-linear model as is common for SMT, and treated as a single translation model in this log-linear combination. |
Translation Model Architecture | Log-linear weights are optimized using MERT (Och and Ney, 2003). |
Translation Model Architecture | Future work could involve merging our translation model framework with the online adaptation of other models, or the log-linear weights. |
Comparative Study | (4) The orientation probability is modeled in a log-linear framework using a set of N feature func-tiOIlS €{,i,j, de-IJH), n = 1, . |
Comparative Study | Finally, in the log-linear framework (Equation 2) a new jump model is added which uses the reordered source sentence to calculate the cost. |
Tagging-style Reordering Model | The number of source words that have inconsistent labels is the penalty and is then added into the log-linear framework as a new feature. |
Translation System Overview | We model Pr(e{|fi]) directly using a log-linear combination of several models (Och and Ney, |
Inference | We then report the corresponding chains 0(a) as the system output.3 For learning, the gradient takes the standard form of the gradient of a log-linear model, a difference of expected feature counts under the gold annotation and under no annotation. |
Introduction | We use a log-linear model that can be expressed as a factor graph. |
Models | Each unary factor A,- has a log-linear form with features examining mention 2', its selected antecedent ai, and the document context at. |
Models | The final log-linear model is given by the following formula: |
A semantic span can include one or more eus. | Following Och and Ney (2002), our model is framed as a log-linear model: |
A semantic span can include one or more eus. | courage the decoder to generate transitional words and phrases; the score is utilized as an additional feature hk (es, ft) in the log-linear model. |
A semantic span can include one or more eus. | In general, according to formula (3), the translation quality based on the log-linear model is related tightly with the features chosen. |
Conclusion | Our contributions can be summarized as: l) the new translation rules are more discriminative and sensitive to cohesive information by converting the source string into a CSS-based tagged-flattened string; 2) the new additional features embedded in the log-linear model can encourage the decoder to produce transitional expressions. |
Batch Models | Our goal is assign to a category each user of interest 2),- based on f Here we focus on a binary assignment into the categories Democratic D or Republican R. The log-linear |
Batch Models | 7We use log-linear models over reasonable alternatives such as perceptron or SVM, following the practice of a wide range of previous work in related areas (Smith, 2004; Liu et a1., 2005; Poon et a1., 2009) including text classification in social media (Van Durme, 2012b; Yang and Eisenstein, 2013). |
Batch Models | The corresponding log-linear model is defined as: |
Experimental Setup | We experiment with log-linear models defined in Eq. |
A Log-Linear Model for Actions | Given a state 3 = (5, d, j, W), the space of possible next actions is defined by enumerating sub-spans of unused words in the current sentence (i.e., subspans of the jth sentence of d not in W), and the possible commands and parameters in environment state 5.4 We model the policy distribution p(a|s; 6) over this action space in a log-linear fashion (Della Pietra et al., 1997; Lafferty et al., 2001), giving us the flexibility to incorporate a diverse range of features. |
Abstract | We use a policy gradient algorithm to estimate the parameters of a log-linear model for action selection. |
Introduction | Our policy is modeled in a log-linear fashion, allowing us to incorporate features of both the instruction text and the environment. |
Reinforcement Learning | which is the derivative of a log-linear distribution. |
Conclusion and Future Work | The consensus statistics are integrated into the conventional log-linear model as features. |
Experiments and Results | Instead of using graph-based consensus confidence as features in the log-linear model, we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)). |
Features and Training | Therefore, we can alternatively update graph-based consensus features and feature weights in the log-linear model. |
Graph-based Translation Consensus | Our MT system with graph-based translation consensus adopts the conventional log-linear model. |
Abstract | We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features. |
Introduction | Our model is conceptually similar to those used in speech recognition and other applications: we assume the intended tokens are generated from a bigram language model and then distorted by a noisy channel, in particular a log-linear model of phonetic variability. |
Lexical-phonetic model | (2008), we parameterize these distributions with a log-linear model. |
Lexical-phonetic model | In modern phonetics and phonology, these generalizations are usually expressed as Optimality Theory constraints; log-linear models such as ours have previously been used to implement stochas- |
Dependency Parsing | Given the training set {(Xi, MHz-1:1, parameter estimation for log-linear models generally resolve around optimization of a regularized conditional |
Dependency Parsing | In this paper we use the dual exponenti-ated gradient (EG)2 descent, which is a particularly effective optimization algorithm for log-linear models (Collins et al., 2008). |
Experiments | Some previous studies also found a log-linear relationship between unlabeled data (Suzuki and Isozaki, 2008; Suzuki et al., 2009; Bergsma et al., 2010; Pitler et al., 2010). |
Web-Derived Selectional Preference Features | Log-linear dependency parsing model is sensitive to inappropriately scaled feature. |
Collaborative Decoding | In our work, any Maximum A Posteriori (MAP) SMT model with log-linear formulation (Och, 2002) can be a qualified candidate for a baseline model. |
Collaborative Decoding | The requirement for a log-linear model aims to provide a natural way to integrate the new co-decoding features. |
Collaborative Decoding | Referring to the log-linear model formulation, the translation posterior P(e'|dk) can be computed as: |
Conclusion | In this paper, we present a framework of collaborative decoding, in which multiple MT decoders are coordinated to search for better translations by re-ranking partial hypotheses using augmented log-linear models with translation consensus -based features. |
Analysis | We want to further study the happenings after we integrate the constraint feature (our SDB model and Marton and Resnik’s XP+) into the log-linear translation model. |
Introduction | These constituent matching/violation counts are used as a feature in the decoder’s log-linear model and their weights are tuned via minimal error rate training (MERT) (Och, 2003). |
Introduction | Similar to previous methods, our SDB model is integrated into the decoder’s log-linear model as a feature so that we can inherit the idea of soft constraints. |
The Syntax-Driven Bracketing Model 3.1 The Model | new feature into the log-linear translation model: PSDB (b|T, This feature is computed by the SDB model described in equation (3) or equation (4), which estimates a probability that a source span is to be translated as a unit within particular syntactic contexts. |
PCS Induction | The feature-based model replaces the emission distribution with a log-linear model, such that: |
PCS Induction | This locally normalized log-linear model can look at various aspects of the observation :5, incorporating overlapping features of the observation. |
PCS Induction | We adopted this state-of-the-art model because it makes it easy to experiment with various ways of incorporating our novel constraint feature into the log-linear emission model. |
Cohesive Decoding | This count becomes a feature in the decoder’s log-linear model, the weight of which is trained with MERT. |
Experiments | Weights for the log-linear model are set using MERT, as implemented by Venugopal and Vogel (2005). |
Experiments | Since adding features to the decoder’s log-linear model is straightforward, we also experiment with a combined system that uses both the cohesion constraint and a lexical reordering model. |
Experimental Setup | The decoder’s log-linear model includes a standard feature set. |
Experimental Setup | The decoder’s log-linear model is tuned with MERT (Och, 2003). |
Experimental Setup | Both the decoder’s log-linear model and the re-ranking models are trained on the same development set. |
Abstract | Our approach defines a log-linear model over latent extraction predicates, which select lists of entities from the web page. |
Approach | Given a query cc and a web page 212, we define a log-linear distribution over all extraction predicates z E Z(w) as |
Approach | To construct the log-linear model, we define a feature vector gb(:c, w, z) for each query at, web page 212, and extraction predicate z. |
Related Work | (2013) went beyond the log-linear model for SMT and proposed a novel additive neural networks based translation model, which overcome some of the shortcomings suffered by the log-linear model: linearity and the lack of deep interpretation and representation in features. |
Semi-Supervised Deep Auto-encoder Features Learning for SMT | Each translation rule in the phrase-based translation model has a set number of features that are combined in the log-linear model (Och and Ney, 2002), and our semi-supervised DAE features can also be combined in this model. |
Semi-Supervised Deep Auto-encoder Features Learning for SMT | To combine these learned features (DBN and DAB feature) into the log-linear model, we need to eliminate the impact of the nonlinear learning mechanism. |
Consensus Decoding Algorithms | The distribution P(e| f) can be induced from a translation system’s features and weights by expo-nentiating with base I) to form a log-linear model: |
Experimental Results | The log-linear model weights were trained using MIRA, a margin-based optimization procedure that accommodates many features (Crammer and Singer, 2003; Chiang et al., 2008). |
Experimental Results | We tuned b, the base of the log-linear model, to optimize consensus decoding performance. |
Related Work | Both are close to our work; however, our model generates reordering features that are integrated into the log-linear translation model during decoding. |
Unified Linguistic Reordering Models | For models with syntactic reordering, we add two new features (i.e., one for the leftmost reordering model and the other for the rightmost reordering model) into the log-linear translation model in Eq. |
Unified Linguistic Reordering Models | For the semantic reordering models, we also add two new features into the log-linear translation model. |
Method | (2013a) propose two log-linear models, namely the Skip-gram and CBOW model, to efficiently induce word embeddings. |
Method | The Skip-gram model adopts log-linear classifiers to predict context words given the current word w(t) as input. |
Method | Then, log-linear classifiers are employed, taking the embedding as input and predict w(t)’s context words within a certain range, e.g. |
Algorithm | Specifically, we modify the log-linear policy p(a|s; q, 6) by adding lookahead features gb(s, a, q) which complement the local features used in the previous model. |
Background | A Log-Linear Parameterization The policy |
Background | function used for action selection is defined as a log-linear distribution over actions: €9-¢(s,a) |
Background | where Pr(e| f) is the probability that e is the translation of the given source string f. To model the posterior probability Pr(e| f) , most of the state-of-the-art SMT systems utilize the log-linear model proposed by Och and Ney (2002), as follows, |
Background | In this paper, u denotes a log-linear model that has Mfixed features {h1(f,e), ..., hM(f,e)}, ,1 = {3.1, ..., AM} denotes the M parameters of u, and u(/1) denotes a SMT system based on u with parameters ,1. |
Background | In this paper, we use the term training set to emphasize the training of log-linear model. |
Learning | The noise model that gbc parameterizes is a local log-linear model, so we follow the approach of Berg-Kirkpatrick et al. |
Model | logistic( Z [M917 kw - Wedgie/D k’ :1 The fact that the parameterization is log-linear will ensure that, during the unsupervised learning process, updating the shape parameters gbc is simple and feasible. |
Results and Analysis | (2010), we use a regularization term in the optimization of the log-linear model parameters (15¢ during the M-step. |
Detailed generative story | This is a conditional log-linear model parameterized by qb, where gbk, ~ N(0, 0,3). |
Detailed generative story | When 6L is the special end-of-string symbol #, the only allowed edits are the insertion (g) and the substitution We define the edit probability using a locally normalized log-linear model: |
Experiments | We leave other hyperparameters fixed: 16 latent topics, and Gaussian priors N (0, l) on all log-linear parameters. |
Conclusions | Future work directions include investigating the impact of hierarchical phrases for our models as well as any gains from additional features in the log-linear decoding model. |
Experiments | The induced joint translation model can be used to recover arg maxe p(e|f), as it is equal to arg maxe p(e, f We employ the induced probabilistic HR-SCFG G as the backbone of a log-linear , feature based translation model, with the derivation probability p(D) under the grammar estimate being |
Experiments | We train the feature weights under MERT and decode with the resulting log-linear model. |
Experimental Setup | But instead of using just the PMI scores of bilingual NE pairs, as in our work, they employed a feature-rich log-linear model to capture bilingual correlations. |
Experimental Setup | Parameters in their log-linear model require training with bilingually annotated data, which is not readily available. |
Related Work | (2010a) presented a supervised learning method for performing joint parsing and word alignment using log-linear models over parse trees and an ITG model over alignment. |
Building Dialog Trees from Instructions | Given a single instruction 2' with category au, we use a log-linear model to represent the distri- |
Understanding Initial Queries | We employ a log-linear model and try to maximize initial dialog state distribution over the space of all nodes in a dialog network: |
Understanding Query Refinements | Dialog State Update Model We use a log-linear model to maximize a dialog state distribution over the space of all nodes in a dialog network: |
Abstract | Och (2003) proposed using a log-linear model to incorporate multiple features for translation, and proposed a minimum error rate training (MERT) method to train the feature weights to optimize a desirable translation metric. |
Abstract | While the log-linear model itself is discriminative, the phrase and lexicon translation features, which are among the most important components of SMT, are derived from either generative models or heuristics (Koehn et al., 2003, Brown et al., 1993). |
Abstract | In that work, multiple features, most of them are derived from generative models, are incorporated into a log-linear model, and the relative weights of them are tuned discriminatively on a small tuning set. |
Corpus Data and Baseline SMT | Our phrase-based decoder is similar to Moses (Koehn et al., 2007) and uses the phrase pairs and target LM to perform beam search stack decoding based on a standard log-linear model, the parameters of which were tuned with MERT (Och, 2003) on a held-out development set (3,534 sentence pairs, 45K words) using BLEU as the tuning metric. |
Incremental Topic-Based Adaptation | We add this feature to the log-linear translation model with its own weight, which is tuned with MERT. |
Introduction | Translation phrase pairs that originate in training conversations whose topic distribution is similar to that of the current conversation are given preference through a single similarity feature, which augments the standard phrase-based SMT log-linear model. |
Data and task | (2004) to train a log-linear model for projection. |
Data and task | Compared to the joint log-linear model of Burkett et al. |
Introduction | (2010a), our model can incorporate both monolingual and bilingual features in a log-linear framework. |
Experiments | One is the log-linear combination of TMs trained on each subcorpus (Koehn and Schroeder, 2007), with weights of each model tuned under minimal error rate training using MIRA. |
Introduction | Research on mixture models has considered both linear and log-linear mixtures. |
Introduction | (Koehn and Schroeder, 2007), instead, opted for combining the sub-models directly in the SMT log-linear framework. |
Learning and Inference | These features are incorporated as a log-linear dis- |
Learning and Inference | For each word in a mention 10, we introduced 12 binary features f for our featurized log-linear distribution (§3.1.2). |
Model | This uses a log-linear distribution with partition function Z. |