Semi-Supervised Convex Training for Dependency Parsing
Wang, Qin Iris and Schuurmans, Dale and Lin, Dekang

Article Structure

Abstract

We present a novel semi-supervised training algorithm for learning dependency parsers.

Introduction

Supervised learning algorithms still represent the state of the art approach for inferring dependency parsers from data (McDonald et al., 2005a; McDonald and Pereira, 2006; Wang et al., 2007).

Dependency Parsing Model

Given a sentence X = (x1, ..., sun) (cc,- denotes each word in the sentence), we are interested in computing a directed dependency tree, Y, over X.

Supervised Structured Large Margin Training

Supervised structured large margin training approaches have been applied to parsing and produce promising results (Taskar et al., 2004; McDonald et al., 2005a; Wang et al., 2006).

Semi-supervised Structured Large Margin Objective

The objective of standard semi-supervised structured SVM is a combination of structured large margin losses on both labeled and unlabeled data.

Semi-supervised Convex Training for Structured SVM

Although semi-supervised structured SVM learning has been an active research area, semi-supervised structured SVMs have not been used in many real applications to date.

Efficient Optimization Strategy

To solve the convex optimization problem shown in Equation (10), we used a gradient descent approach which simply uses stochastic gradient steps.

Experimental Results

Given a convex approach to semi-supervised structured large margin training, and an efficient training

Conclusion and Future Work

In this paper, we have presented a novel algorithm for semi-supervised structured large margin training.

Topics

semi-supervised

Appears in 39 sentences as: Semi-supervised (3) semi-supervised (38)
In Semi-Supervised Convex Training for Dependency Parsing
  1. We present a novel semi-supervised training algorithm for learning dependency parsers.
    Page 1, “Abstract”
  2. By combining a supervised large margin loss with an unsupervised least squares loss, a dis-criminative, convex, semi-supervised learning algorithm can be obtained that is applicable to large-scale problems.
    Page 1, “Abstract”
  3. heavy dependence on annotated corpora—many researchers have investigated semi-supervised learning techniques that can take both labeled and unlabeled training data as input.
    Page 1, “Introduction”
  4. Unfortunately, although significant recent progress has been made in the area of semi-supervised learning, the performance of semi-supervised learning algorithms still fall far short of expectations, particularly in challenging real-world tasks such as natural language parsing or machine translation.
    Page 1, “Introduction”
  5. A large number of distinct approaches to semi-supervised training algorithms have been investigated in the literature (Bennett and Demiriz, 1998; Zhu et al., 2003; Altun et al., 2005; Mann and McCallum, 2007).
    Page 1, “Introduction”
  6. Among the most prominent approaches are self-training, generative models, semi-supervised support vector machines (S3VM), graph-based algorithms and multi-view algorithms (Zhu, 2005).
    Page 1, “Introduction”
  7. Self-training is a commonly used technique for semi-supervised learning that has been ap-
    Page 1, “Introduction”
  8. Consequently, most previous work that has attempted semi-supervised or unsupervised approaches to parsing have not produced results beyond the state of the art supervised results (Klein and Manning, 2002; Klein and Manning, 2004).
    Page 2, “Introduction”
  9. As a result, many researchers have put effort on developing algorithms for semi-supervised SVMs (S3VMs) (Bennett and Demiriz, 1998; Altun et al., 2005).
    Page 2, “Introduction”
  10. In particular, they present an algorithm for multi-class unsupervised and semi-supervised SVM learning, which relaxes the original non-convex objective into a close convex approximation, thereby allowing a global solution to be obtained.
    Page 2, “Introduction”
  11. Thus, in this paper, we focus on semi-supervised language learning, where we can make use of both labeled and unlabeled data.
    Page 2, “Introduction”

See all papers in Proc. ACL 2008 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

dependency parsing

Appears in 20 sentences as: dependency parser (1) dependency parsers (5) Dependency Parsing (3) dependency parsing (12)
In Semi-Supervised Convex Training for Dependency Parsing
  1. We present a novel semi-supervised training algorithm for learning dependency parsers .
    Page 1, “Abstract”
  2. To demonstrate the benefits of this approach, we apply the technique to learning dependency parsers from combined labeled and unlabeled corpora.
    Page 1, “Abstract”
  3. Supervised learning algorithms still represent the state of the art approach for inferring dependency parsers from data (McDonald et al., 2005a; McDonald and Pereira, 2006; Wang et al., 2007).
    Page 1, “Introduction”
  4. As we will demonstrate below, this approach admits an efficient training procedure that can find a global minimum, and, perhaps surprisingly, can systematically improve the accuracy of supervised training approaches for learning dependency parsers .
    Page 2, “Introduction”
  5. ing semi-supervised convex objective to dependency parsing , and obtain significant improvement over the corresponding supervised structured SVM.
    Page 3, “Introduction”
  6. Finally, we apply this algorithm to dependency parsing and show improved dependency parsing accuracy for both Chinese and English.
    Page 3, “Introduction”
  7. This formulation is sufficiently general to capture most dependency parsing models, including probabilistic dependency models (Eisner, 1996; Wang et al., 2005) as well as non-probabilistic models (McDonald et al., 2005a).
    Page 3, “Dependency Parsing Model”
  8. This approach corresponds to the training problem posed in (McDonald et al., 2005a) and has yielded the best published results for English dependency parsing .
    Page 4, “Supervised Structured Large Margin Training”
  9. This procedure works efficiently on the task of training a dependency parser .
    Page 6, “Efficient Optimization Strategy”
  10. algorithm for achieving a global optimum, we now investigate its effectiveness for dependency parsing .
    Page 6, “Experimental Results”
  11. We applied the resulting algorithm to learn dependency parsers for both English and Chinese.
    Page 6, “Experimental Results”

See all papers in Proc. ACL 2008 that mention dependency parsing.

See all papers in Proc. ACL that mention dependency parsing.

Back to top.

unlabeled data

Appears in 18 sentences as: unlabeled data (19)
In Semi-Supervised Convex Training for Dependency Parsing
  1. Following the common theme of “more data is better data” we also use both a limited labeled corpora and a plentiful unlabeled data resource.
    Page 1, “Introduction”
  2. (2006a) successfully applied self-training to parsing by exploiting available unlabeled data , and obtained remarkable results when the same technique was applied to parser adaptation (McClosky et al., 2006b).
    Page 2, “Introduction”
  3. However, the standard objective of an S3VM is non-convex on the unlabeled data , thus requiring sophisticated global optimization heuristics to obtain reasonable solutions.
    Page 2, “Introduction”
  4. We simply replace the non-convex loss on unlabeled data with an alternative loss that is jointly convex with respect to both the model parameters and (the encoding of) the self-trained prediction targets.
    Page 2, “Introduction”
  5. More specifically, for the loss on the unlabeled data part, we substitute the original unsupervised structured SVM loss with a least squares loss, but keep constraints on the inferred prediction targets, which avoids trivialization.
    Page 2, “Introduction”
  6. This loss function has the advantage that the entire training objective on both the labeled and unlabeled data now becomes convex, since it consists of a convex structured large margin loss on labeled data and a convex least squares loss on unlabeled data .
    Page 2, “Introduction”
  7. Thus, in this paper, we focus on semi-supervised language learning, where we can make use of both labeled and unlabeled data .
    Page 2, “Introduction”
  8. In particular, we investigate a semi-supervised approach for structured large margin training, where the objective is a combination of two convex functions, the structured large margin loss on labeled data and the least squares loss on unlabeled data .
    Page 2, “Introduction”
  9. The objective of standard semi-supervised structured SVM is a combination of structured large margin losses on both labeled and unlabeled data .
    Page 4, “Semi-supervised Structured Large Margin Objective”
  10. We introduce an efficient approximation—least squares loss—for the structured large margin loss on unlabeled data below.
    Page 4, “Semi-supervised Structured Large Margin Objective”
  11. for structured large margin training, whose objective is a combination of two convex terms: the supervised structured large margin loss on labeled data and the cheap least squares loss on unlabeled data .
    Page 5, “Semi-supervised Convex Training for Structured SVM”

See all papers in Proc. ACL 2008 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.

dependency tree

Appears in 9 sentences as: dependency tree (6) Dependency trees (1) dependency trees (2)
In Semi-Supervised Convex Training for Dependency Parsing
  1. Figure l: A dependency tree
    Page 3, “Introduction”
  2. Given a sentence X = (x1, ..., sun) (cc,- denotes each word in the sentence), we are interested in computing a directed dependency tree , Y, over X.
    Page 3, “Dependency Parsing Model”
  3. We assume that a directed dependency tree Y consists of ordered pairs (sci —> any) of words in X such that each word appears in at least one pair and each word has in-degree at most one.
    Page 3, “Dependency Parsing Model”
  4. Dependency trees are assumed to be projective here, which means that if there is an arc (cc,- —> 553-), then :0,- is an ancestor of all the words
    Page 3, “Dependency Parsing Model”
  5. 1We assume all the dependency trees are projective in our work (just as some other researchers do), although in the real word, most languages are non-projective.
    Page 3, “Supervised Structured Large Margin Training”
  6. We represent a dependency tree as a k x k adjacency matrix.
    Page 4, “Supervised Structured Large Margin Training”
  7. As mentioned in Section 3, a dependency tree Yj is represented as an adjacency matrix.
    Page 5, “Semi-supervised Convex Training for Structured SVM”
  8. Thus we need to enforce some constraints in the adjacency matrix to make sure that each Yj satisfies the dependency tree constraints.
    Page 5, “Semi-supervised Convex Training for Structured SVM”
  9. For experiment on English, we used the English Penn Treebank (PTB) (Marcus et al., 1993) and the constituency structures were converted to dependency trees using the same rules as (Yamada and Matsumoto, 2003).
    Page 6, “Experimental Results”

See all papers in Proc. ACL 2008 that mention dependency tree.

See all papers in Proc. ACL that mention dependency tree.

Back to top.

labeled data

Appears in 7 sentences as: labeled data (7)
In Semi-Supervised Convex Training for Dependency Parsing
  1. However, a key drawback of supervised training algorithms is their dependence on labeled data , which is usually very difficult to obtain.
    Page 1, “Introduction”
  2. This loss function has the advantage that the entire training objective on both the labeled and unlabeled data now becomes convex, since it consists of a convex structured large margin loss on labeled data and a convex least squares loss on unlabeled data.
    Page 2, “Introduction”
  3. In particular, we investigate a semi-supervised approach for structured large margin training, where the objective is a combination of two convex functions, the structured large margin loss on labeled data and the least squares loss on unlabeled data.
    Page 2, “Introduction”
  4. for structured large margin training, whose objective is a combination of two convex terms: the supervised structured large margin loss on labeled data and the cheap least squares loss on unlabeled data.
    Page 5, “Semi-supervised Convex Training for Structured SVM”
  5. By combining the convex structured SVM loss on labeled data (shown in Equation (5)) and the convex least squares loss on unlabeled data (shown in Equation (8)), we obtain a semi-supervised structured large margin loss
    Page 5, “Semi-supervised Convex Training for Structured SVM”
  6. 0 Step 2, based on the learned parameter weights from the labeled data , update 6 and Yj on each unlabeled sentence alternatively:
    Page 6, “Efficient Optimization Strategy”
  7. One obvious direction is to use the whole Penn Treebank as labeled data and use some other unannotated data source as unlabeled data for semi-supervised training.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2008 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

SVM

Appears in 7 sentences as: SVM (7)
In Semi-Supervised Convex Training for Dependency Parsing
  1. In particular, they present an algorithm for multi-class unsupervised and semi-supervised SVM learning, which relaxes the original non-convex objective into a close convex approximation, thereby allowing a global solution to be obtained.
    Page 2, “Introduction”
  2. More specifically, for the loss on the unlabeled data part, we substitute the original unsupervised structured SVM loss with a least squares loss, but keep constraints on the inferred prediction targets, which avoids trivialization.
    Page 2, “Introduction”
  3. ing semi-supervised convex objective to dependency parsing, and obtain significant improvement over the corresponding supervised structured SVM .
    Page 3, “Introduction”
  4. The objective of standard semi-supervised structured SVM is a combination of structured large margin losses on both labeled and unlabeled data.
    Page 4, “Semi-supervised Structured Large Margin Objective”
  5. Although semi-supervised structured SVM learning has been an active research area, semi-supervised structured SVMs have not been used in many real applications to date.
    Page 4, “Semi-supervised Convex Training for Structured SVM”
  6. By combining the convex structured SVM loss on labeled data (shown in Equation (5)) and the convex least squares loss on unlabeled data (shown in Equation (8)), we obtain a semi-supervised structured large margin loss
    Page 5, “Semi-supervised Convex Training for Structured SVM”
  7. Unlike previous proposed approaches, we introduce a convex objective for the semi-supervised learning algorithm by combining a convex structured SVM loss and a convex least square loss.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2008 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

learning algorithm

Appears in 6 sentences as: learning algorithm (3) learning algorithms (3)
In Semi-Supervised Convex Training for Dependency Parsing
  1. By combining a supervised large margin loss with an unsupervised least squares loss, a dis-criminative, convex, semi-supervised learning algorithm can be obtained that is applicable to large-scale problems.
    Page 1, “Abstract”
  2. Supervised learning algorithms still represent the state of the art approach for inferring dependency parsers from data (McDonald et al., 2005a; McDonald and Pereira, 2006; Wang et al., 2007).
    Page 1, “Introduction”
  3. Unfortunately, although significant recent progress has been made in the area of semi-supervised learning, the performance of semi-supervised learning algorithms still fall far short of expectations, particularly in challenging real-world tasks such as natural language parsing or machine translation.
    Page 1, “Introduction”
  4. The basic idea is to bootstrap a supervised learning algorithm by alternating between inferring the missing label information and retraining.
    Page 2, “Introduction”
  5. They also show connections between these algorithms and other related machine learning algorithms .
    Page 2, “Introduction”
  6. Unlike previous proposed approaches, we introduce a convex objective for the semi-supervised learning algorithm by combining a convex structured SVM loss and a convex least square loss.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2008 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.

loss function

Appears in 5 sentences as: loss function (4) loss functions (1)
In Semi-Supervised Convex Training for Dependency Parsing
  1. Instead of devising various techniques for coping with non-convex loss functions , we approach the problem from a different perspective.
    Page 2, “Introduction”
  2. Although using a least squares loss function for classification appears misguided, there is a precedent for just this approach in the early pattern recognition literature (Duda et al., 2000).
    Page 2, “Introduction”
  3. This loss function has the advantage that the entire training objective on both the labeled and unlabeled data now becomes convex, since it consists of a convex structured large margin loss on labeled data and a convex least squares loss on unlabeled data.
    Page 2, “Introduction”
  4. The resulting loss function has a hat shape (usually called hat-loss), which is non-convex.
    Page 4, “Semi-supervised Structured Large Margin Objective”
  5. These positive results are somewhat surprising since a very simple loss function was used on
    Page 7, “Experimental Results”

See all papers in Proc. ACL 2008 that mention loss function.

See all papers in Proc. ACL that mention loss function.

Back to top.

Treebank

Appears in 4 sentences as: Treebank (3) treebank (1)
In Semi-Supervised Convex Training for Dependency Parsing
  1. For experiment on English, we used the English Penn Treebank (PTB) (Marcus et al., 1993) and the constituency structures were converted to dependency trees using the same rules as (Yamada and Matsumoto, 2003).
    Page 6, “Experimental Results”
  2. For Chinese, we experimented on the Penn Chinese Treebank 4.0 (CTB4) (Palmer et al., 2004) and we used the rules in (Bikel, 2004) for conversion.
    Page 6, “Experimental Results”
  3. We evaluate parsing accuracy by comparing the directed dependency links in the parser output against the directed links in the treebank .
    Page 7, “Experimental Results”
  4. One obvious direction is to use the whole Penn Treebank as labeled data and use some other unannotated data source as unlabeled data for semi-supervised training.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2008 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

natural language

Appears in 3 sentences as: natural language (3)
In Semi-Supervised Convex Training for Dependency Parsing
  1. Unfortunately, although significant recent progress has been made in the area of semi-supervised learning, the performance of semi-supervised learning algorithms still fall far short of expectations, particularly in challenging real-world tasks such as natural language parsing or machine translation.
    Page 1, “Introduction”
  2. plied to several natural language processing tasks (Yarowsky, 1995; Charniak, 1997; Steedman et al., 2003).
    Page 2, “Introduction”
  3. Another direction is to apply the semi-supervised algorithm to other natural language problems, such as machine translation, topic segmentation and chunking.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2008 that mention natural language.

See all papers in Proc. ACL that mention natural language.

Back to top.

state of the art

Appears in 3 sentences as: state of the art (3)
In Semi-Supervised Convex Training for Dependency Parsing
  1. Supervised learning algorithms still represent the state of the art approach for inferring dependency parsers from data (McDonald et al., 2005a; McDonald and Pereira, 2006; Wang et al., 2007).
    Page 1, “Introduction”
  2. Consequently, most previous work that has attempted semi-supervised or unsupervised approaches to parsing have not produced results beyond the state of the art supervised results (Klein and Manning, 2002; Klein and Manning, 2004).
    Page 2, “Introduction”
  3. In recent years, SVMs have demonstrated state of the art results in many supervised learning tasks.
    Page 2, “Introduction”

See all papers in Proc. ACL 2008 that mention state of the art.

See all papers in Proc. ACL that mention state of the art.

Back to top.