Word Representations: A Simple and General Method for Semi-Supervised Learning
Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua

Article Structure

Abstract

If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features.

Introduction

By using unlabelled data to reduce data sparsity in the labeled training data, semi-supervised approaches improve generalization accuracy.

Distributional representations

Distributional word representations are based upon a cooccurrence matrix F of size WXC, where W is the vocabulary size, each row F w is the initial representation of word w, and each column F c is some context.

Clustering-based word representations

Another type of word representation is to induce a clustering over words.

Distributed representations

Another approach to word representation is to learn a distributed representation.

Supervised evaluation tasks

We evaluate the hypothesis that one can take an existing, near state-of-the-art, supervised NLP system, and improve its accuracy by including word representations as word features.

Unlabled Data

Unlabeled data is used for inducing the word representations.

Topics

embeddings

Appears in 51 sentences as: Embeddings (2) embeddings (54)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking.
    Page 1, “Abstract”
  2. word embeddings using unsupervised approaches.
    Page 2, “Introduction”
  3. (2009) about Collobert and Weston (2008) embeddings , given training improvements that we describe in Section 7.1.
    Page 2, “Introduction”
  4. Distributed word representations are called word embeddings .
    Page 3, “Distributed representations”
  5. Word embeddings are typically induced using neural language models, which use neural networks as the underlying predictive model (Bengio, 2008).
    Page 3, “Distributed representations”
  6. 4.1 Collobert and Weston (2008) embeddings
    Page 3, “Distributed representations”
  7. The model concatenates the learned embeddings of the n words, giving e(w1) 6 .
    Page 3, “Distributed representations”
  8. 0 We had a separate learning rate for the embeddings and for the neural network weights.
    Page 4, “Distributed representations”
  9. We found that the embeddings should have a learning rate generally 1000—32000 times higher than the neural network weights.
    Page 4, “Distributed representations”
  10. 4.2 HLBL embeddings
    Page 4, “Distributed representations”
  11. Given an n-gram, the model concatenates the embeddings of the n — 1 first words, and learns a linear model to predict the embedding of the last word.
    Page 4, “Distributed representations”

See all papers in Proc. ACL 2010 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

word representations

Appears in 46 sentences as: Word representation (1) word representation (9) Word Representations (1) word representations (38)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features.
    Page 1, “Abstract”
  2. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines.
    Page 1, “Abstract”
  3. We find further improvements by combining different word representations .
    Page 1, “Abstract”
  4. A word representation is a mathematical object associated with each word, often a vector.
    Page 1, “Introduction”
  5. These limitations of one-hot word representations have prompted researchers to investigate unsupervised methods for inducing word representations over large unlabeled corpora.
    Page 1, “Introduction”
  6. One common approach to inducing unsupervised word representation is to use clustering, perhaps hierarchical.
    Page 1, “Introduction”
  7. Unsupervised word representations have been used in previous NLP work, and have demonstrated improvements in generalization accuracy on a variety of tasks.
    Page 2, “Introduction”
  8. But different word representations have never been systematically compared in a controlled way.
    Page 2, “Introduction”
  9. In this work, we compare different techniques for inducing word representations , evaluating them on the tasks of named entity recognition (NER) and chunking.
    Page 2, “Introduction”
  10. Distributional word representations are based upon a cooccurrence matrix F of size WXC, where W is the vocabulary size, each row F w is the initial representation of word w, and each column F c is some context.
    Page 2, “Distributional representations”
  11. Hyperspace Analogue to Language (HAL) is another early distributional approach (Lund et al., 1995; Lund & Burgess, 1996) to inducing word representations .
    Page 2, “Distributional representations”

See all papers in Proc. ACL 2010 that mention word representations.

See all papers in Proc. ACL that mention word representations.

Back to top.

NER

Appears in 24 sentences as: NER (24)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking.
    Page 1, “Abstract”
  2. In this work, we compare different techniques for inducing word representations, evaluating them on the tasks of named entity recognition ( NER ) and chunking.
    Page 2, “Introduction”
  3. It is not well-understood what settings are appropriate to induce distributional word representations for structured prediction tasks (like parsing and MT) and sequence labeling tasks (like chunking and NER ).
    Page 2, “Distributional representations”
  4. Brown clusters have been used successfully in a variety of NLP applications: NER (Miller et al., 2004; Liang, 2005; Ratinov & Roth, 2009), PCFG parsing (Candito & Crabbe, 2009), dependency parsing (Koo et al., 2008; Suzuki et al., 2009), and semantic dependency parsing (Zhao et al., 2009).
    Page 3, “Clustering-based word representations”
  5. Lin and Wu (2009) finds that the representations that are good for NER are poor for search query classification, and Vice-versa.
    Page 4, “Supervised evaluation tasks”
  6. We apply clustering and distributed representations to NER and chunking, which allows us to compare our semi-supervised models to those of Ando and Zhang (2005) and Suzuki and Isozaki (2008).
    Page 4, “Supervised evaluation tasks”
  7. NER is typically treated as a sequence prediction problem.
    Page 5, “Supervised evaluation tasks”
  8. The standard evaluation benchmark for NER is the CoNLL03 shared task dataset drawn from the Reuters newswire.
    Page 5, “Supervised evaluation tasks”
  9. These postprocessing steps will adversely affect all NER models across-the-board, nonetheless allowing us to compare different models in a controlled manner.
    Page 6, “Supervised evaluation tasks”
  10. For this reason, NER results that use RCVl word representations are a form of transductive learning.
    Page 6, “Unlabled Data”
  11. (b) NER results.
    Page 7, “Unlabled Data”

See all papers in Proc. ACL 2010 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.

word embeddings

Appears in 17 sentences as: Word Embeddings (1) Word embeddings (1) word embeddings (15)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. word embeddings using unsupervised approaches.
    Page 2, “Introduction”
  2. Distributed word representations are called word embeddings .
    Page 3, “Distributed representations”
  3. Word embeddings are typically induced using neural language models, which use neural networks as the underlying predictive model (Bengio, 2008).
    Page 3, “Distributed representations”
  4. The word embeddings also required a scaling hyperparameter, as described in Section 7.2.
    Page 5, “Supervised evaluation tasks”
  5. For rare words, which are typically updated only 143 times per epochz, and given that our embedding learning rate was typically 1e-6 or 1e-7, this means that rare word embeddings will be concentrated around zero, instead of spread out randomly.
    Page 6, “Unlabled Data”
  6. 7.2 Scaling of Word Embeddings
    Page 7, “Unlabled Data”
  7. The word embeddings , however, are real numbers that are not necessarily in a bounded range.
    Page 7, “Unlabled Data”
  8. If the range of the word embeddings is too large, they will exert more influence than the binary features.
    Page 7, “Unlabled Data”
  9. There are capacity controls for the word representations: number of Brown clusters, and number of dimensions of the word embeddings .
    Page 7, “Unlabled Data”
  10. (2009), we hypothesized on the basis of solely the HLBL NER curve that higher-dimensional word embeddings would give higher accuracy.
    Page 7, “Unlabled Data”
  11. For NER, the C&W curve is almost flat, and we were suprised to find the even 25-dimensional C&W word embeddings work so well.
    Page 7, “Unlabled Data”

See all papers in Proc. ACL 2010 that mention word embeddings.

See all papers in Proc. ACL that mention word embeddings.

Back to top.

semi-supervised

Appears in 11 sentences as: Semi-supervised (1) semi-supervised (10)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. By using unlabelled data to reduce data sparsity in the labeled training data, semi-supervised approaches improve generalization accuracy.
    Page 1, “Introduction”
  2. Semi-supervised models such as Ando and Zhang (2005), Suzuki and Isozaki (2008), and Suzuki et al.
    Page 1, “Introduction”
  3. It can be tricky and time-consuming to adapt an existing supervised NLP system to use these semi-supervised techniques.
    Page 1, “Introduction”
  4. It is preferable to use a simple and general method to adapt existing supervised NLP systems to be semi-supervised .
    Page 1, “Introduction”
  5. This technique for turning a supervised approach into a semi-supervised one is general and task-agnostic.
    Page 4, “Supervised evaluation tasks”
  6. We apply clustering and distributed representations to NER and chunking, which allows us to compare our semi-supervised models to those of Ando and Zhang (2005) and Suzuki and Isozaki (2008).
    Page 4, “Supervised evaluation tasks”
  7. Ando and Zhang (2005) present a semi-supervised learning algorithm called alternating structure optimization (ASO).
    Page 9, “Unlabled Data”
  8. Suzuki and Isozaki (2008) present a semi-supervised extension of CRFs.
    Page 9, “Unlabled Data”
  9. (2009), they extend their semi-supervised approach to more general conditional models.)
    Page 9, “Unlabled Data”
  10. One of the advantages of the semi-supervised learning approach that we use is that it is simpler and more general than that of Ando and Zhang (2005) and Suzuki and Isozaki (2008).
    Page 9, “Unlabled Data”
  11. The disadvantage, however, is that accuracy might not be as high as a semi-supervised method that includes task-specific information
    Page 9, “Unlabled Data”

See all papers in Proc. ACL 2010 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

distributed representation

Appears in 8 sentences as: distributed representation (3) distributed representations (2) distributional representation (1) distributional representations (2)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. LSA (Dumais et al., 1988; Landauer et al., 1998), LSI, and LDA (Blei et al., 2003) induce distributional representations over F in which each column is a document context.
    Page 2, “Distributional representations”
  2. However, like all the works cited above, Sahlgren (2006) only uses distributional representation to improve existing systems for one-shot classification tasks, such as IR, WSD, semantic knowledge tests, and text categorization.
    Page 2, “Distributional representations”
  3. Previous research has achieved repeated successes on these tasks using clustering representations (Section 3) and distributed representations (Section 4), so we focus on these representations in our work.
    Page 2, “Distributional representations”
  4. Another approach to word representation is to learn a distributed representation .
    Page 3, “Distributed representations”
  5. (Not to be confused with distributional representations .)
    Page 3, “Distributed representations”
  6. A distributed representation is dense, low-dimensional, and real-valued.
    Page 3, “Distributed representations”
  7. A distributed representation is compact, in the sense that it can represent an exponential number of clusters in the number of dimensions.
    Page 3, “Distributed representations”
  8. We apply clustering and distributed representations to NER and chunking, which allows us to compare our semi-supervised models to those of Ando and Zhang (2005) and Suzuki and Isozaki (2008).
    Page 4, “Supervised evaluation tasks”

See all papers in Proc. ACL 2010 that mention distributed representation.

See all papers in Proc. ACL that mention distributed representation.

Back to top.

language models

Appears in 8 sentences as: language model (3) language modeling (1) language models (4)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. Neural language models (Bengio et al., 2001; Schwenk & Gauvain, 2002; Mnih & Hinton, 2007; Collobert & Weston, 2008), on the other hand, induce dense real-valued low-dimensional
    Page 1, “Introduction”
  2. (See Bengio (2008) for a more complete list of references on neural language models .)
    Page 2, “Introduction”
  3. So it is a class-based bigram language model .
    Page 3, “Clustering-based word representations”
  4. Deschacht and Moens (2009) use a latent-variable language model to improve semantic role labeling.
    Page 3, “Clustering-based word representations”
  5. Word embeddings are typically induced using neural language models , which use neural networks as the underlying predictive model (Bengio, 2008).
    Page 3, “Distributed representations”
  6. Historically, training and testing of neural language models has been slow, scaling as the size of the vocabulary for each model computation (Bengio et al., 2001; Bengio et al., 2003).
    Page 3, “Distributed representations”
  7. Collobert and Weston (2008) presented a neural language model that could be trained over billions of words, because the gradient of the loss was computed stochastically over a small sample of possible outputs, in a spirit similar to Bengio and Sénecal (2003).
    Page 3, “Distributed representations”
  8. These auxiliary tasks are sometimes specific to the supervised task, and sometimes general language modeling tasks like “predict the missing word”.
    Page 9, “Unlabled Data”

See all papers in Proc. ACL 2010 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

unlabeled data

Appears in 8 sentences as: Unlabeled data (1) unlabeled data (6) unlabelled data (1)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. By using unlabelled data to reduce data sparsity in the labeled training data, semi-supervised approaches improve generalization accuracy.
    Page 1, “Introduction”
  2. Unlabeled data is used for inducing the word representations.
    Page 6, “Unlabled Data”
  3. (2009), we found that all word representations performed better on the supervised task when they were induced on the clean unlabeled data , both embeddings and Brown clusters.
    Page 6, “Unlabled Data”
  4. Note that cleaning is applied only to the unlabeled data , not to the labeled data used in the supervised tasks.
    Page 6, “Unlabled Data”
  5. 0 1 10 100 1K 10K 100K 1M Frequency of word in unlabeled data
    Page 8, “Unlabled Data”
  6. 0 1 10 100 1K 10K 100K 1M Frequency of word in unlabeled data
    Page 8, “Unlabled Data”
  7. Figure 32 For word tokens that have different frequency in the unlabeled data , what is the total number of per-token errors incurred on the test set?
    Page 8, “Unlabled Data”
  8. Figure 3 shows the total number of per-token errors incurred on the test set, depending upon the frequency of the word token in the unlabeled data .
    Page 8, “Unlabled Data”

See all papers in Proc. ACL 2010 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.

development set

Appears in 7 sentences as: development set (7)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. training partition sentences, and evaluated their F1 on the development set .
    Page 5, “Supervised evaluation tasks”
  2. After each epoch over the training set, we measured the accuracy of the model on the development set .
    Page 5, “Supervised evaluation tasks”
  3. Training was stopped after the accuracy on the development set did not improve for 10 epochs, generally about 50—80 epochs total.
    Page 5, “Supervised evaluation tasks”
  4. The epoch that performed best on the development set was chosen as the final model.
    Page 5, “Supervised evaluation tasks”
  5. Unlike in our chunking experiments, after we chose the best model on the development set , we used that model on the test set too.
    Page 5, “Supervised evaluation tasks”
  6. (In chunking, after finding the best hyperparameters on the development set , we would combine the dev and training set and training a model over this combined set, and then evaluate on test.)
    Page 5, “Supervised evaluation tasks”
  7. The training set contains 204K words (14K sentences, 946 documents), the test set contains 46K words (3.5K sentences, 231 documents), and the development set contains 51K words (3.3K sentences, 216 documents).
    Page 5, “Supervised evaluation tasks”

See all papers in Proc. ACL 2010 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

CRF

Appears in 6 sentences as: CRF (5) CRF++ (1)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. However, the CRF chunker in Huang and Yates (2009), which uses their HMM word clusters as extra features, achieves F1 lower than
    Page 3, “Clustering-based word representations”
  2. a baseline CRF chunker (Sha & Pereira, 2003).
    Page 3, “Clustering-based word representations”
  3. The linear CRF chunker of Sha and Pereira (2003) is a standard near-state-of—the-art baseline chunker.
    Page 4, “Supervised evaluation tasks”
  4. In fact, many off-the-shelf CRF implementations now replicate Sha and Pereira (2003), including their choice of feature set:
    Page 4, “Supervised evaluation tasks”
  5. 0 CRF++ by Taku Kudo (http://crfpp.
    Page 4, “Supervised evaluation tasks”
  6. Table 1: Features templates used in the CRF chunker.
    Page 5, “Supervised evaluation tasks”

See all papers in Proc. ACL 2010 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

hyperparameter

Appears in 5 sentences as: hyperparameter (3) hyperparameters (3)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. After choosing hyperparameters to maximize the dev Fl, we would retrain the model using these hyperparameters on the full 8936 sentence training set, and evaluate on test.
    Page 5, “Supervised evaluation tasks”
  2. One hyperparameter was l2-regularization sigma, which for most models was optimal at 2 or 3.2.
    Page 5, “Supervised evaluation tasks”
  3. The word embeddings also required a scaling hyperparameter , as described in Section 7.2.
    Page 5, “Supervised evaluation tasks”
  4. (In chunking, after finding the best hyperparameters on the development set, we would combine the dev and training set and training a model over this combined set, and then evaluate on test.)
    Page 5, “Supervised evaluation tasks”
  5. We can scale the embeddings by a hyperparameter , to control their standard deviation.
    Page 7, “Unlabled Data”

See all papers in Proc. ACL 2010 that mention hyperparameter.

See all papers in Proc. ACL that mention hyperparameter.

Back to top.

n-gram

Appears in 5 sentences as: n-gram (5)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. For each training update, we read an n-gram x = (W1, .
    Page 3, “Distributed representations”
  2. We also create a corrupted or noise n-gram J?
    Page 3, “Distributed representations”
  3. 0 We corrupt the last word of each n-gram .
    Page 4, “Distributed representations”
  4. Given an n-gram , the model concatenates the embeddings of the n — 1 first words, and learns a linear model to predict the embedding of the last word.
    Page 4, “Distributed representations”
  5. n-gram is corrupted.
    Page 4, “Distributed representations”

See all papers in Proc. ACL 2010 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

neural network

Appears in 5 sentences as: neural network (4) neural networks (1)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. Word embeddings are typically induced using neural language models, which use neural networks as the underlying predictive model (Bengio, 2008).
    Page 3, “Distributed representations”
  2. We predict a score s(x) for x by passing e(x) through a single hidden layer neural network .
    Page 4, “Distributed representations”
  3. We minimize this loss stochastically over the n-grams in the corpus, doing gradient descent simultaneously over the neural network parameters and the embedding lookup table.
    Page 4, “Distributed representations”
  4. 0 We had a separate learning rate for the embeddings and for the neural network weights.
    Page 4, “Distributed representations”
  5. We found that the embeddings should have a learning rate generally 1000—32000 times higher than the neural network weights.
    Page 4, “Distributed representations”

See all papers in Proc. ACL 2010 that mention neural network.

See all papers in Proc. ACL that mention neural network.

Back to top.

bigram

Appears in 4 sentences as: bigram (3) bigrams (1)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. The Brown algorithm is a hierarchical clustering algorithm which clusters words to maximize the mutual information of bigrams (Brown et al., 1992).
    Page 3, “Clustering-based word representations”
  2. So it is a class-based bigram language model.
    Page 3, “Clustering-based word representations”
  3. One downside of Brown clustering is that it is based solely on bigram statistics, and does not consider word usage in a wider context.
    Page 3, “Clustering-based word representations”
  4. (1998) presents algorithms for inducing hierarchical clusterings based upon word bigram and trigram statistics.
    Page 3, “Clustering-based word representations”

See all papers in Proc. ACL 2010 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

POS tagging

Appears in 3 sentences as: POS tagging (2) POS tags (1)
In Word Representations: A Simple and General Method for Semi-Supervised Learning
  1. Ushioda (1996) presents an extension to the Brown clustering algorithm, and learn hierarchical clusterings of words as well as phrases, which they apply to POS tagging .
    Page 3, “Clustering-based word representations”
  2. Li and McCallum (2005) use an HMM-LDA model to improve POS tagging and Chinese Word Segmentation.
    Page 3, “Clustering-based word representations”
  3. (2009) use an HMM to assign POS tags to words, which in turns improves the accuracy of the PCFG—based Hebrew parser.
    Page 3, “Clustering-based word representations”

See all papers in Proc. ACL 2010 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.