SciSurf: Index of 'Online Relative Margin Maximization for Statistical Machine Translation'

Topics

feature set (25)
BLEU (18)
TER (11)
Chinese-English (7)
feature space (6)
SVM (5)
latent variables (5)
optimization problem (5)
BLEU score (4)
machine learning (4)
LM (4)
machine translation (3)
NIST (3)
learning algorithm (3)
statistical machine translation (3)
Iter (3)
feature templates (3)
weight vector (3)
word order (3)

Topics

feature set (25)
BLEU (18)
TER (11)
Chinese-English (7)
feature space (6)
SVM (5)
latent variables (5)
optimization problem (5)
BLEU score (4)
machine learning (4)
LM (4)
machine translation (3)
NIST (3)
learning algorithm (3)
statistical machine translation (3)
Iter (3)
feature templates (3)
weight vector (3)
word order (3)

Online Relative Margin Maximization for Statistical Machine Translation

Eidelman, Vladimir and Marton, Yuval and Resnik, Philip

Published in Proc. ACL, 2013

Article Structure

Abstract

Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data.

Introduction

The desire to incorporate high-dimensional sparse feature representations into statistical machine translation (SMT) models has driven recent research away from Minimum Error Rate Training (MERT) (Och, 2003), and toward other discriminative methods that can optimize more features.

Learning in SMT

Given an input sentence in the source language cc 6 X, we want to produce a translation 3/ E 3/(53) using a linear model parameterized by a weight vector w:

The Relative Margin Machine in SMT

3.1 Relative Margin Machine

Experiments

4.1 Setup

Additional Experiments

In order to explore the applicability of our approach to a wider range of languages, we also evaluated its performance on Arabic-English translation.

Discussion

The trend of the results, summarized as RM gain over other optimizers averaged over all test sets, is presented in Table 5.

Conclusions and Future Work

We have introduced RM, a novel online margin-based algorithm designed for optimizing high-dimensional feature spaces, which introduces constraints into a large-margin optimizer that bound the spread of the projection of the data while maXimizing the margin.

Topics

feature set

Appears in 25 sentences as: feature set (15) Feature Sets (1) feature sets (5) feature setting (4) feature settings (3)

In Online Relative Margin Maximization for Statistical Machine Translation

We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set .
Page 1, “Abstract”
Chinese-English translation experiments show that our algorithm, RM, significantly outperforms strong state-of-the-art optimizers, in both a basic feature setting and high-dimensional (sparse) feature space (§4).
Page 2, “Introduction”
The instability of MERT in larger feature sets (Foster and Kuhn, 2009; Hopkins and May, 2011), has motivated many alternative tuning methods for SMT.
Page 2, “Learning in SMT”
To evaluate the advantage of explicitly accounting for the spread of the data, we conducted several experiments on two Chinese-English translation test sets, using two different feature sets in each.
Page 6, “Experiments”
We selected the bound step size D, based on performance on a held-out dev set, to be 0.01 for the basic feature set and 0.1 for the sparse feature set .
Page 6, “Experiments”
4.2 Feature Sets
Page 6, “Experiments”
We experimented with a small (basic) feature set, and a large (sparse) feature set .
Page 6, “Experiments”
For the small feature set , we use 14 features, including a language model, 5 translation model features, penalties for unknown words, the glue rule, and rule arity.
Page 6, “Experiments”
For experiments with a larger feature set , we introduced additional lexical and non-lexical sparse Boolean features of the form commonly found in the literature (Chiang et al., 2009; Watan-
Page 6, “Experiments”
The results are especially notable for the basic feature setting — up to 1.2 BLEU and 4.6 TER improvement over MERT — since MERT has been shown to be competitive with small numbers of features compared to high-dimensional optimizers such as MIRA (Chiang et al., 2008).
Page 7, “Experiments”
5In the small feature set RAMPION yielded similar best BLEU scores, but worse TER.
Page 7, “Experiments”

See all papers in Proc. ACL 2013 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

BLEU

Appears in 18 sentences as: BLEU (23)

In Online Relative Margin Maximization for Statistical Machine Translation

We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.
Page 1, “Abstract”
We used cdec (Dyer et al., 2010) as our hierarchical phrase-based decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 corpus.
Page 6, “Experiments”
The bound constraint B was set to 1.4 The approximate sentence-level BLEU cost A, is computed in a manner similar to (Chiang et al., 2009), namely, in the context of previous 1-best translations of the tuning set.
Page 6, “Experiments”
We explored alternative values for B, as well as scaling it by the current candidate’s cost, and found that the optimizer is fairly insensitive to these changes, resulting in only minor differences in BLEU .
Page 6, “Experiments”
As can be seen from the results in Table 3, our RM method was the best performer in all Chinese-English tests according to all measures — up to 1.9 BLEU and 6.6 TER over MIRA — even though we only optimized for BLEU.5 Surprisingly, it seems that MIRA did not benefit as much from the sparse features as RM.
Page 7, “Experiments”
The results are especially notable for the basic feature setting — up to 1.2 BLEU and 4.6 TER improvement over MERT — since MERT has been shown to be competitive with small numbers of features compared to high-dimensional optimizers such as MIRA (Chiang et al., 2008).
Page 7, “Experiments”
5In the small feature set RAMPION yielded similar best BLEU scores, but worse TER.
Page 7, “Experiments”
In preliminary experiments with a smaller trigram LM, our RM method consistently yielded the highest scores in all Chinese-English tests — up to 1.6 BLEU and 6.4 TER from MIRA, the second best performer.
Page 7, “Experiments”
As can be seen in Table 4, in the smaller feature set, RM and MERT were the best performers, with the exception that on MT08, MIRA yielded somewhat better (+0.7) BLEU but a somewhat worse (-0.9) TER score than RM.
Page 7, “Additional Experiments”
On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Page 7, “Additional Experiments”
Interestingly, RM achieved substantially higher BLEU precision scores in all tests for both language pairs.
Page 7, “Additional Experiments”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

TER

Appears in 11 sentences as: TER (13)

In Online Relative Margin Maximization for Statistical Machine Translation

We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.
Page 1, “Abstract”
As can be seen from the results in Table 3, our RM method was the best performer in all Chinese-English tests according to all measures — up to 1.9 BLEU and 6.6 TER over MIRA — even though we only optimized for BLEU.5 Surprisingly, it seems that MIRA did not benefit as much from the sparse features as RM.
Page 7, “Experiments”
The results are especially notable for the basic feature setting — up to 1.2 BLEU and 4.6 TER improvement over MERT — since MERT has been shown to be competitive with small numbers of features compared to high-dimensional optimizers such as MIRA (Chiang et al., 2008).
Page 7, “Experiments”
5In the small feature set RAMPION yielded similar best BLEU scores, but worse TER .
Page 7, “Experiments”
In preliminary experiments with a smaller trigram LM, our RM method consistently yielded the highest scores in all Chinese-English tests — up to 1.6 BLEU and 6.4 TER from MIRA, the second best performer.
Page 7, “Experiments”
As can be seen in Table 4, in the smaller feature set, RM and MERT were the best performers, with the exception that on MT08, MIRA yielded somewhat better (+0.7) BLEU but a somewhat worse (-0.9) TER score than RM.
Page 7, “Additional Experiments”
On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Page 7, “Additional Experiments”
RM’s loss was only up to 0.8 BLEU (0.7 TER) from MERT or MIRA, while its gains were up to 1.7 BLEU and 2.1 TER over MIRA.
Page 7, “Discussion”
Small set Large set Optimizer BLEU TER BLEU TER MERT 0.4 2.6 - -MIRA 0.5 3.0 1.4 4.3 PRO 1.4 2.9 2.0 1.7 RAMPION 0.6 1.6 1.2 2.8
Page 8, “Discussion”
Error Analysis: The inconclusive advantage of RM over MIRA (in BLEU vs. TER scores) on Arabic-English MT08 calls for a closer look.
Page 8, “Discussion”
Experimentation in statistical MT yielded significant improvements over several other state-of-the-art optimizers, especially in a high-dimensional feature space (up to 2 BLEU and 4.3 TER on average).
Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2013 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

Chinese-English

Appears in 7 sentences as: Chinese-English (7)

In Online Relative Margin Maximization for Statistical Machine Translation

We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.
Page 1, “Abstract”
Chinese-English translation experiments show that our algorithm, RM, significantly outperforms strong state-of-the-art optimizers, in both a basic feature setting and high-dimensional (sparse) feature space (§4).
Page 2, “Introduction”
To evaluate the advantage of explicitly accounting for the spread of the data, we conducted several experiments on two Chinese-English translation test sets, using two different feature sets in each.
Page 6, “Experiments”
As can be seen from the results in Table 3, our RM method was the best performer in all Chinese-English tests according to all measures — up to 1.9 BLEU and 6.6 TER over MIRA — even though we only optimized for BLEU.5 Surprisingly, it seems that MIRA did not benefit as much from the sparse features as RM.
Page 7, “Experiments”
In preliminary experiments with a smaller trigram LM, our RM method consistently yielded the highest scores in all Chinese-English tests — up to 1.6 BLEU and 6.4 TER from MIRA, the second best performer.
Page 7, “Experiments”
In both Arabic-English feature sets, MIRA seems to take the second place, while RAMPION lags behind, unlike in Chinese-English (§4).6
Page 7, “Additional Experiments”
Spread analysis: For RM, the average spread of the projected data in the Chinese-English small feature set was 0.9i3.6 for all tuning iterations, and 0.7j:2.9 for the iteration with the highest decoder performance.
Page 8, “Discussion”

See all papers in Proc. ACL 2013 that mention Chinese-English.

See all papers in Proc. ACL that mention Chinese-English.

Back to top.

feature space

Appears in 6 sentences as: feature space (3) feature spaces (3)

In Online Relative Margin Maximization for Statistical Machine Translation

However, as the dimension of the feature space increases, generalization becomes increasingly difficult.
Page 1, “Introduction”
This criterion performs well in practice at finding a linear separator in high-dimensional feature spaces (Tsochantaridis et al., 2004; Crammer et al., 2006).
Page 1, “Introduction”
Chinese-English translation experiments show that our algorithm, RM, significantly outperforms strong state-of-the-art optimizers, in both a basic feature setting and high-dimensional (sparse) feature space (§4).
Page 2, “Introduction”
Online large-margin algorithms, such as MIRA, have also gained prominence in SMT, thanks to their ability to learn models in high-dimensional feature spaces (Watanabe et al., 2007; Chiang et al., 2009).
Page 2, “Learning in SMT”
We have introduced RM, a novel online margin-based algorithm designed for optimizing high-dimensional feature spaces , which introduces constraints into a large-margin optimizer that bound the spread of the projection of the data while maXimizing the margin.
Page 9, “Conclusions and Future Work”
Experimentation in statistical MT yielded significant improvements over several other state-of-the-art optimizers, especially in a high-dimensional feature space (up to 2 BLEU and 4.3 TER on average).
Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2013 that mention feature space.

See all papers in Proc. ACL that mention feature space.

Back to top.

SVM

Appears in 5 sentences as: SVM (5)

In Online Relative Margin Maximization for Statistical Machine Translation

We focus on large-margin methods such as SVM (Joachims, 1998) and passive-aggressive algorithms such as MIRA.
Page 1, “Introduction”
It is maximized by minimizing the norm in SVM , or analogously, the proximity constraint in MIRA: arg minW — wt||2.
Page 3, “The Relative Margin Machine in SMT”
RMM was introduced as a generalization over SVM that incorporates both the margin constraint
Page 3, “The Relative Margin Machine in SMT”
Nonetheless, since structured RMM is a generalization of Structured SVM , which shares its underlying objective with MIRA, our intuition is that SMT should be able to benefit as well.
Page 4, “The Relative Margin Machine in SMT”
The dual in Equation (5) can be optimized using a cutting plane algorithm, an effective method for solving a relaxed optimization problem in the dual, used in Structured SVM , MIRA, and RMM (Tsochantaridis et al., 2004; Chiang, 2012; Shivaswamy and Jebara, 2009a).
Page 5, “The Relative Margin Machine in SMT”

See all papers in Proc. ACL 2013 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

latent variables

Appears in 5 sentences as: latent variable (1) latent variables (4)

In Online Relative Margin Maximization for Statistical Machine Translation

Unfortunately, not all advances in machine learning are easy to apply to structured prediction problems such as SMT; the latter often involve latent variables and surrogate references, resulting in loss functions that have not been well explored in machine learning (Mcallester and Keshet, 2011; Gimpel and Smith, 2012).
Page 2, “Introduction”
The contributions of this paper include (1) introduction of a loss function for structured RMM in the SMT setting, with surrogate reference translations and latent variables ; (2) an online gradient-based solver, RM, with a closed-form parameter update to optimize the relative margin loss; and (3) an efficient implementation that integrates well with the open source cdec SMT system (Dyer et al., 2010).1 In addition, (4) as our solution is not dependent on any specific QP solver, it can be easily incorporated into practically any gradient-based learning algorithm.
Page 2, “Introduction”
First, we introduce RMM (§3.1) and propose a latent structured relative margin objective which incorporates cost-augmented hypothesis selection and latent variables .
Page 2, “Introduction”
While many derivations d E D(:c) can produce a given translation, we are only able to observe 3/; thus we model d as a latent variable .
Page 2, “Learning in SMT”
The closed-form online update for our relative margin solution accounts for surrogate references and latent variables .
Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2013 that mention latent variables.

See all papers in Proc. ACL that mention latent variables.

Back to top.

optimization problem

Appears in 5 sentences as: optimization problem (5)

In Online Relative Margin Maximization for Statistical Machine Translation

The usual presentation of MIRA’s optimization problem is given as a quadratic program:
Page 2, “Learning in SMT”
While solving the optimization problem relies on computing the margin between the correct output yi, and y’, in SMT our decoder is often incapable of producing the reference translation, i.e.
Page 3, “Learning in SMT”
In this setting, the optimization problem becomes:
Page 3, “Learning in SMT”
The online latent structured soft relative margin optimization problem is then:
Page 4, “The Relative Margin Machine in SMT”
The dual in Equation (5) can be optimized using a cutting plane algorithm, an effective method for solving a relaxed optimization problem in the dual, used in Structured SVM, MIRA, and RMM (Tsochantaridis et al., 2004; Chiang, 2012; Shivaswamy and Jebara, 2009a).
Page 5, “The Relative Margin Machine in SMT”

See all papers in Proc. ACL 2013 that mention optimization problem.

See all papers in Proc. ACL that mention optimization problem.

Back to top.

BLEU score

Appears in 4 sentences as: BLEU score (2) BLEU scores (2)

In Online Relative Margin Maximization for Statistical Machine Translation

5In the small feature set RAMPION yielded similar best BLEU scores , but worse TER.
Page 7, “Experiments”
On the large feature set, RM is again the best performer, except, perhaps, a tied BLEU score with MIRA on MT08, but with a clear 1.8 TER gain.
Page 7, “Additional Experiments”
This correlates with our observation that RM’s overall BLEU score is negatively impacted by the BP, as the BLEU precision scores are noticeably higher.
Page 9, “Discussion”
We also notice that while PRO had the lowest BLEU scores in Chinese, it was competitive in Arabic with the highest number of features.
Page 9, “Discussion”

See all papers in Proc. ACL 2013 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

machine learning

Appears in 4 sentences as: machine learning (5)

In Online Relative Margin Maximization for Statistical Machine Translation

In every SMT system, and in machine learning in general, the goal of learning is to find a
Page 1, “Introduction”
Now, recent advances in machine learning have shown that the generalization ability of these learners can be improved by utilizing second order information, as in the Second Order Percep-tron (Cesa-Bianchi et al., 2005), Gaussian Margin Machines (Crammer et al., 2009b), confidence-weighted learning (Dredze and Crammer, 2008), AROW (Crammer et al., 2009a; Chiang, 2012) and Relative Margin Machines (RMM) (Shivaswamy and Jebara, 2009b).
Page 1, “Introduction”
Unfortunately, not all advances in machine learning are easy to apply to structured prediction problems such as SMT; the latter often involve latent variables and surrogate references, resulting in loss functions that have not been well explored in machine learning (Mcallester and Keshet, 2011; Gimpel and Smith, 2012).
Page 2, “Introduction”
RAMPION aims to address the disconnect between MT and machine learning by optimizing a structured ramp loss with a concave-convex procedure.
Page 2, “Learning in SMT”

See all papers in Proc. ACL 2013 that mention machine learning.

See all papers in Proc. ACL that mention machine learning.

Back to top.

LM

Appears in 4 sentences as: LM (4)

In Online Relative Margin Maximization for Statistical Machine Translation

4- gram LM 24M 600M —
Page 6, “The Relative Margin Machine in SMT”
We trained a 4-gram LM on the
Page 6, “Experiments”
In preliminary experiments with a smaller trigram LM , our RM method consistently yielded the highest scores in all Chinese-English tests — up to 1.6 BLEU and 6.4 TER from MIRA, the second best performer.
Page 7, “Experiments”
6In our preliminary experiments with the smaller trigram LM , MERT did better on MT05 in the smaller feature set, and MIRA had a small advantage in two cases.
Page 7, “Discussion”

See all papers in Proc. ACL 2013 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

machine translation

Appears in 3 sentences as: machine translation (3)

In Online Relative Margin Maximization for Statistical Machine Translation

However, these solutions are impractical in complex structured prediction problems such as statistical machine translation .
Page 1, “Abstract”
The desire to incorporate high-dimensional sparse feature representations into statistical machine translation (SMT) models has driven recent research away from Minimum Error Rate Training (MERT) (Och, 2003), and toward other discriminative methods that can optimize more features.
Page 1, “Introduction”
Finally, although motivated by statistical machine translation , RM is a gradient-based method that can easily be applied to other problems.
Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

NIST

Appears in 3 sentences as: NIST (3)

In Online Relative Margin Maximization for Statistical Machine Translation

For training we used the non-UN and non-HK Hansards portions of the NIST training corpora, which was segmented using the Stanford segmenter (Tseng et al., 2005).
Page 6, “Experiments”
We used cdec (Dyer et al., 2010) as our hierarchical phrase-based decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 corpus.
Page 6, “Experiments”
For training, we used the non-UN portion of the NIST training corpora, which was segmented using an HMM segmenter (Lee et al., 2003).
Page 7, “Additional Experiments”

See all papers in Proc. ACL 2013 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

learning algorithm

Appears in 3 sentences as: learning algorithm (3)

In Online Relative Margin Maximization for Statistical Machine Translation

The contributions of this paper include (1) introduction of a loss function for structured RMM in the SMT setting, with surrogate reference translations and latent variables; (2) an online gradient-based solver, RM, with a closed-form parameter update to optimize the relative margin loss; and (3) an efficient implementation that integrates well with the open source cdec SMT system (Dyer et al., 2010).1 In addition, (4) as our solution is not dependent on any specific QP solver, it can be easily incorporated into practically any gradient-based learning algorithm .
Page 2, “Introduction”
After background discussion on learning in SMT (§2), we introduce a novel online learning algorithm for relative margin maximization suitable for SMT (§3).
Page 2, “Introduction”
We address the above-mentioned limitations by introducing a novel online learning algorithm for relative margin maximization, RM.
Page 4, “The Relative Margin Machine in SMT”

See all papers in Proc. ACL 2013 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.

statistical machine translation

Appears in 3 sentences as: statistical machine translation (3)

In Online Relative Margin Maximization for Statistical Machine Translation

However, these solutions are impractical in complex structured prediction problems such as statistical machine translation .
Page 1, “Abstract”
The desire to incorporate high-dimensional sparse feature representations into statistical machine translation (SMT) models has driven recent research away from Minimum Error Rate Training (MERT) (Och, 2003), and toward other discriminative methods that can optimize more features.
Page 1, “Introduction”
Finally, although motivated by statistical machine translation , RM is a gradient-based method that can easily be applied to other problems.
Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2013 that mention statistical machine translation.

See all papers in Proc. ACL that mention statistical machine translation.

Back to top.

Iter

Appears in 3 sentences as: Iter (3)

In Online Relative Margin Maximization for Statistical Machine Translation

17: for n <— 1...Maa: Iter do
Page 5, “The Relative Margin Machine in SMT”
27: forn <— 1...Maa: Iter do
Page 5, “The Relative Margin Machine in SMT”
36: forn <— 1...Maa: Iter do
Page 5, “The Relative Margin Machine in SMT”

See all papers in Proc. ACL 2013 that mention Iter.

See all papers in Proc. ACL that mention Iter.

Back to top.

feature templates

Appears in 3 sentences as: feature templates (3)

In Online Relative Margin Maximization for Statistical Machine Translation

Table 2: Active sparse feature templates
Page 7, “Experiments”
These feature templates resulted in a total of 3.4 million possible features, of which only a fraction were active for the respective tuning set and optimizer, as shown in Table 2.
Page 7, “Experiments”
The sparse feature templates resulted here in a total of 4.9 million possible features, of which again only a fraction were active, as shown in Table 2.
Page 7, “Additional Experiments”

See all papers in Proc. ACL 2013 that mention feature templates.

See all papers in Proc. ACL that mention feature templates.

Back to top.

weight vector

Appears in 3 sentences as: weight vector (3)

In Online Relative Margin Maximization for Statistical Machine Translation

distance between hypotheses when projected onto the line defined by the weight vector w.
Page 2, “Introduction”
Given an input sentence in the source language cc 6 X, we want to produce a translation 3/ E 3/(53) using a linear model parameterized by a weight vector w:
Page 2, “Learning in SMT”
More formally, the spread is the distance between y+, and the worst candidate (yw, d“’) <— arg min(y,d)€y($i),p($i) 8($i, y, d), after projecting both onto the line defined by the weight vector w. For each y’, this projection is conveniently given by s(:cZ-, y’, d), thus the spread is calculated as 68($i, y+, yw).
Page 3, “The Relative Margin Machine in SMT”

See all papers in Proc. ACL 2013 that mention weight vector.

See all papers in Proc. ACL that mention weight vector.

Back to top.

word order

Appears in 3 sentences as: word order (3)

In Online Relative Margin Maximization for Statistical Machine Translation

The categories were: function word drop, content word drop, syntactic error (with a reasonable meaning), semantic error (regardless of syntax), word order issues, and function word mistranslation and “hallucination”.
Page 8, “Discussion”
ticeably had more word order and excess/wrong function word issues in the basic feature setting than any optimizer.
Page 9, “Discussion”
However, RM seemed to benefit the most from the sparse features, as its bad word order rate dropped close to MIRA, and its ex-cess/wrong function word rate dropped below that of MIRA with sparse features (MIRA’s rate actually doubled from its basic feature set).
Page 9, “Discussion”

See all papers in Proc. ACL 2013 that mention word order.

See all papers in Proc. ACL that mention word order.

Back to top.