Learning Ensembles of Structured Prediction Rules

We present a series of algorithms with theoretical guarantees for learning accurate ensembles of several structured prediction rules for which no prior knowledge is assumed.

We study the problem of learning accurate ensembles of structured prediction experts.

As in standard supervised learning problems, we assume that the learner receives a training sample S = ((x1,y1),...,(xm,ym)) E X X y ofm labeled points drawn i.i.d.

In this section, we present an online learning solution to the ensemble structured prediction problem just discussed.

In this section, we devise a boosting-style algorithm for our ensemble structured prediction problem.

We used a number of artificial and real-world data sets for our experiments.

We presented a broad analysis of the problem of ensemble structured prediction, including a series of algorithms with learning guarantees and extensive experiments.

Appears in 10 sentences as: loss function (6) loss functions (4)

In *Learning Ensembles of Structured Prediction Rules*

- We will assume that the loss function considered 1dmits an additive decomposition over the sub-;tructures, as is common in structured prediction.Page 1, “Introduction”
- The quality of the predictions is measured by a loss function L: 3?Page 2, “Learning scenario”
- —> R+ that can be decomposed as a sum of loss functions 6k: M, —> R+ over the substructure sets 32k, that is, for all y = (y1,...,yl) E ywith yl" E 32;, andy’ =Page 2, “Learning scenario”
- We will assume in all that follows that the loss function L is bounded: L(y,y’) g M for all 1This paper is a modified version of (Cortes et al., 2014a)Page 2, “Learning scenario”
- A prototypical example of such loss functions is the normalized Hamming loss LHam, which is the fraction of substructures for which two labels y and y’ disagree, thus in that case €k(yk,y’k) = %ka7gylk and M = 1.Page 3, “Learning scenario”
- These include the algorithm of (Takimoto and Warmuth, 2003) denoted by WMWP, which is an extension of the (randomized) weighted-majority (WM) algorithm of (Littlestone and Warmuth, 1994) to more general bounded loss functions combined with the Weight Pushing (WP) algorithm of (Mohri, 1997); and the Follow the Perturbed Leader (FPL) algorithm of (Kalai and Vempala, 2005).Page 3, “Online learning approach”
- However, since the loss function is additive in the substructures and the updates are multiplicative, it suffices to maintain instead a weight 7.075(6) per transition 6, following the updatePage 4, “Online learning approach”
- Second, the loss function we consider is the normalized Hamming loss over the substructures predictions, which does not match the multi-class losses for the variants of AdaBoost.2 Finally, the natural base hypotheses for this problem admit a structure that can be exploited to devise a more efficient solution, which of course was not part of the original considerations for the design of these variants of AdaBoost.Page 5, “Boosting-style algorithm”
- 2(Schapire and Singer, 1999) also present an algorithm using the Hamming loss for multi-class classification, but that is a Hamming loss over the set of classes and differs from the loss function relevant to our problem.Page 5, “Boosting-style algorithm”
- A natural extension of this work consists of devising new algorithms and providing learning guarantees specific to other loss functions such as the edit-distance.Page 10, “Conclusion”

See all papers in *Proc. ACL 2014* that mention loss function.

See all papers in *Proc. ACL* that mention loss function.

Back to top.

Appears in 8 sentences as: scoring function (7) scoring functions (2)

In *Learning Ensembles of Structured Prediction Rules*

- A collection of distributions 1P can also be used to define a deterministic prediction rule based on the scoring function approach.Page 4, “Online learning approach”
- The majority vote scoring function is defined byPage 4, “Online learning approach”
- The predictor CHEW“ returned by our boosting algorithm is based on a scoring function h: X x y —> R, which, as for standard ensemble algorithms such as AdaBoost,Ai/s a~convex combination of base scoring functions ht: h 2 23:1 atht, with at 2 0.Page 5, “Boosting-style algorithm”
- The base scoring functions used in our algorithm have the formPage 5, “Boosting-style algorithm”
- Thus, the~score assigned to y by the base scoring function ht is the number of positions at which y matches the prediction of path expert ht given input X. CHEW“ is defined as follows in terms of h or hts:Page 5, “Boosting-style algorithm”
- We remark that the analysis and algorithm presented in this section are also applicable with a scoring function that is the product of the scoresPage 5, “Boosting-style algorithm”
- where h is the corresponding scoring function .Page 7, “Boosting-style algorithm”
- Let h denote the scoring function returned by ESPBoost after T Z 1 rounds.Page 7, “Boosting-style algorithm”

See all papers in *Proc. ACL 2014* that mention scoring function.

See all papers in *Proc. ACL* that mention scoring function.

Back to top.

Appears in 3 sentences as: machine learning (3)

In *Learning Ensembles of Structured Prediction Rules*

- Ensemble methods are widely used in machine learning and have been shown to be often very effective (Breiman, 1996; Freund and Schapire, 1997; Smyth and Wolpert, 1999; MacKay, 1991; Freund et al., 2004).Page 1, “Introduction”
- These models may have been derived using other machine learning algorithms or they may be based onPage 1, “Introduction”
- Variants of the ensemble problem just formulated have been studied in the past in the natural language processing and machine learning literature.Page 2, “Introduction”

See all papers in *Proc. ACL 2014* that mention machine learning.

See all papers in *Proc. ACL* that mention machine learning.

Back to top.

Appears in 3 sentences as: probabilistic models (4)

In *Learning Ensembles of Structured Prediction Rules*

- Furthermore, these methods typically assume the use of probabilistic models , which is not a requirement in our learning scenario.Page 2, “Introduction”
- Other ensembles of probabilistic models have also been considered in text and speech processing by forming a product of probabilistic models via the intersection of lattices (Mohri et al., 2008), or a straightforward combination of the posteriors from probabilistic grammars trained using EM with different starting points (Petrov, 2010), or some other rather intricate techniques in speech recognition (Fiscus, 1997).Page 2, “Introduction”
- This can be used for example in the case where the experts are derived from probabilistic models .Page 6, “Boosting-style algorithm”

See all papers in *Proc. ACL 2014* that mention probabilistic models.

See all papers in *Proc. ACL* that mention probabilistic models.

Back to top.

Appears in 3 sentences as: Treebank (2) ~treebank/ (1)

In *Learning Ensembles of Structured Prediction Rules*

- 5.4 Penn Treebank data setPage 9, “Experiments”
- The Penn Treebank 2 data set is available through LDC license at http: / /WWW .Page 9, “Experiments”
- edu/ ~treebank/ and contains 251,854 sentences with a total of 6,080,493 tokens and 45 different parts-of-speech.Page 9, “Experiments”

See all papers in *Proc. ACL 2014* that mention Treebank.

See all papers in *Proc. ACL* that mention Treebank.

Back to top.