Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
Huang, Fei and Yates, Alexander

Article Structure

Abstract

Supervised sequence-labeling systems in natural language processing often suffer from data sparsity because they use word types as features in their prediction tasks.

Introduction

Data sparsity and high dimensionality are the twin curses of statistical natural language processing (NLP).

Smoothing Natural Language Sequences

To smooth a dataset is to find an approximation of it that retains the important patterns of the original data while hiding the noise or other complicating factors.

Experiments

We tested the following hypotheses in our experiments:

Related Work

To our knowledge, only one previous system —the REALM system for sparse information extrac-

Conclusion and Future Work

Our study of smoothing techniques demonstrates that by aggregating information across many unannotated examples, it is possible to find accurate distributional representations that can provide highly informative features to supervised sequence labelers.

Topics

distributional representations

Appears in 18 sentences as: distributional representation (4) distributional representations (14)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. We demonstrate that distributional representations of word types, trained on unannotated text, can be used to improve performance on rare words.
    Page 1, “Abstract”
  2. We investigate the use of distributional representations , which model the probability distribution of a word’s context, as techniques for finding smoothed representations of word sequences.
    Page 1, “Introduction”
  3. That is, we use the distributional representations to share information across unannotated examples of the same word type.
    Page 1, “Introduction”
  4. We then compute features of the distributional representations , and provide them as input to our supervised sequence labelers.
    Page 1, “Introduction”
  5. We provide empirical evidence that shows how distributional representations improve sequence-labeling in the face of data sparsity.
    Page 1, “Introduction”
  6. Importantly, we seek distributional representations that will provide features that are common in both training and test data, to avoid data sparsity.
    Page 2, “Smoothing Natural Language Sequences”
  7. In the next three sections, we develop three techniques for smoothing text using distributional representations .
    Page 2, “Smoothing Natural Language Sequences”
  8. This gives greater weight to words with more idiosyncratic distributions and may improve the informativeness of a distributional representation .
    Page 2, “Smoothing Natural Language Sequences”
  9. To supply a sequence-labeling algorithm with information from these distributional representations , we compute real-valued features of the context distributions.
    Page 3, “Smoothing Natural Language Sequences”
  10. Any patterns learned for the more common “red lamp” will then also apply to the less common “magenta tablecloth.” Our second distributional representation aggregates information from multiple context words by grouping together the distributions P($i_1 = v|$i = w) and P($i_1 = v’|$i = 7.0) if v and 21’ appear together with many of the same words 212.
    Page 3, “Smoothing Natural Language Sequences”
  11. Latent variable language models (LVLMs) can be used to produce just such a distributional representation .
    Page 3, “Smoothing Natural Language Sequences”

See all papers in Proc. ACL 2009 that mention distributional representations.

See all papers in Proc. ACL that mention distributional representations.

Back to top.

POS tagging

Appears in 16 sentences as: POS tagger (1) POS tagging (9) POS tags (6)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. effects of our smoothing techniques on two sequence-labeling tasks, POS tagging and chunking, to answer the following: I.
    Page 1, “Introduction”
  2. Our best smoothing technique improves a POS tagger by 11% on OOV words, and a chunker by an impressive 21% on OOV words.
    Page 1, “Introduction”
  3. We investigate the use of smoothing in two test systems, conditional random field (CRF) models for POS tagging and chunking.
    Page 4, “Experiments”
  4. Our baseline CRF system for POS tagging follows the model described by Lafferty et al.
    Page 4, “Experiments”
  5. In addition to the transition, word-level, and orthographic features, we include features relating automatically-generated POS tags and the chunk labels.
    Page 4, “Experiments”
  6. For the tagging experiments, we train and test using the gold standard POS tags contained in the Penn Treebank.
    Page 4, “Experiments”
  7. For the chunking experiments, we train and test with POS tags that are automatically generated by a standard tagger (Brill, 1994).
    Page 4, “Experiments”
  8. We tested the accuracy of our models for chunking and POS tagging on section 20 of the Penn Treebank, which corresponds to the test set from the CoNLL 2000 task.
    Page 4, “Experiments”
  9. For our POS tagging experiments, we measured the accuracy of the tagger on “rare” words, or words that appear at most twice in the training data.
    Page 5, “Experiments”
  10. While several systems have achieved slightly higher accuracy on supervised POS tagging , they are usually trained on larger training sets.
    Page 5, “Experiments”
  11. For our experiment on domain adaptation, we focus on NP chunking and POS tagging , and we use the labeled training data from the CoNLL 2000 shared task as before.
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

CRF

Appears in 11 sentences as: CRF (13)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. We investigate the use of smoothing in two test systems, conditional random field ( CRF ) models for POS tagging and chunking.
    Page 4, “Experiments”
  2. Finally, we train the CRF model on the annotated training set and apply it to the test set.
    Page 4, “Experiments”
  3. We use an open source CRF software package designed by Sunita Sajarwal and William W. Cohen to implement our CRF models.1 We use a set of boolean features listed in Table 1.
    Page 4, “Experiments”
  4. Our baseline CRF system for POS tagging follows the model described by Lafferty et al.
    Page 4, “Experiments”
  5. CRF Feature Set
    Page 4, “Experiments”
  6. Table 1: Features used in our CRF systems.
    Page 4, “Experiments”
  7. We found that including such features does improve chunking Fl by approximately 2%, but it also significantly slows down CRF training.
    Page 4, “Experiments”
  8. Table 3: Chunking F1: our HMM-smoothed chunker outperforms the baseline CRF chunker by 0.21 on chunks that begin with 00V words, and 0.10 on chunks that begin with rare words.
    Page 5, “Experiments”
  9. Table 4: On biochemistry journal data from the OANC, our HMM-smoothed NP chunker outperforms the baseline CRF chunker by 0.12 (F1) on chunks that begin with 00V words, and by 0.05 (F1) on all chunks.
    Page 6, “Experiments”
  10. Our complete system consists of two learned components, a supervised CRF system and an unsupervised smoothing model.
    Page 7, “Experiments”
  11. To measure the sample complexity of the supervised CRF, we use the same experimental setup as in the chunking experiment on WSJ text, but we vary the amount of labeled data available to the CRF .
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

sequence labeler

Appears in 11 sentences as: sequence labeler (5) sequence labelers (4) sequence labeling (3)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. Furthermore, our system improves significantly over a baseline system when applied to text from a different domain, and it reduces the sample complexity of sequence labeling .
    Page 1, “Abstract”
  2. We then compute features of the distributional representations, and provide them as input to our supervised sequence labelers .
    Page 1, “Introduction”
  3. In particular, for every word :0, in a sequence, we provide the sequence labeler with a set of features of the left and right contexts indexed by ’U E V2 FgeftCEi) = P(Xi_1 = and ngghtwi) = P(Xi+1 = For example, the left context for “reformulated” in our example above would contain a nonzero probability for the word “of.” Using the features a sequence labeler can learn patterns such as, if :0, has a high probability of following “of,” it is a good candidate for the start of a noun phrase.
    Page 3, “Smoothing Natural Language Sequences”
  4. After experimenting with different choices for the number of dimensions to reduce our vectors to, we choose a value of 10 dimensions as the one that maximizes the performance of our supervised sequence labelers on held-out data.
    Page 3, “Smoothing Natural Language Sequences”
  5. The output of this process is an integer (ranging from 1 to S) for every word :10, in the corpus; we include a new boolean feature for each possible value of 3/, in our sequence labelers .
    Page 3, “Smoothing Natural Language Sequences”
  6. Smoothing can improve the performance of a supervised sequence labeling system on words that are rare or nonexistent in the training data.
    Page 4, “Experiments”
  7. A supervised sequence labeler achieves greater accuracy on new domains with smoothing.
    Page 4, “Experiments”
  8. A supervised sequence labeler has a better sample complexity with smoothing.
    Page 4, “Experiments”
  9. Following the CoNLL shared task from 2000, we use sections 15-18 of the Penn Treebank for our labeled training data for the supervised sequence labeler in all experiments (Tjong et al., 2000).
    Page 4, “Experiments”
  10. Our study of smoothing techniques demonstrates that by aggregating information across many unannotated examples, it is possible to find accurate distributional representations that can provide highly informative features to supervised sequence labelers .
    Page 8, “Conclusion and Future Work”
  11. These features help improve sequence labeling performance on rare word types, on domains that differ from the training set, and on smaller training sets.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2009 that mention sequence labeler.

See all papers in Proc. ACL that mention sequence labeler.

Back to top.

Treebank

Appears in 9 sentences as: Treebank (10)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. For these experiments, we use the Wall Street Journal portion of the Penn Treebank (Marcus et al., 1993).
    Page 4, “Experiments”
  2. Following the CoNLL shared task from 2000, we use sections 15-18 of the Penn Treebank for our labeled training data for the supervised sequence labeler in all experiments (Tjong et al., 2000).
    Page 4, “Experiments”
  3. For the tagging experiments, we train and test using the gold standard POS tags contained in the Penn Treebank .
    Page 4, “Experiments”
  4. We tested the accuracy of our models for chunking and POS tagging on section 20 of the Penn Treebank , which corresponds to the test set from the CoNLL 2000 task.
    Page 4, “Experiments”
  5. Our distributional representations are trained on sections 2-22 of the Penn Treebank .
    Page 4, “Experiments”
  6. We used sections 15-18 of the Penn Treebank as our labeled training set, including the gold standard POS tags.
    Page 6, “Experiments”
  7. We use our best-performing smoothing model, the HMM, and train it on sections 13 through 19 of the Penn Treebank , plus the written portion of the OANC that contains journal articles from biochemistry (40,727 sentences).
    Page 6, “Experiments”
  8. : 40,000 manually tagged sentences from the Penn Treebank for our labeled training data, and all of the unlabeled text from the Penn Treebank plus their MEDLINE corpus of 71,306 sentences to train our HMM.
    Page 6, “Experiments”
  9. At minimum, we use the text available in the labeled training and test sets, and then add random subsets of the Penn Treebank , sections 2-22.
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

Penn Treebank

Appears in 9 sentences as: Penn Treebank (10)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. For these experiments, we use the Wall Street Journal portion of the Penn Treebank (Marcus et al., 1993).
    Page 4, “Experiments”
  2. Following the CoNLL shared task from 2000, we use sections 15-18 of the Penn Treebank for our labeled training data for the supervised sequence labeler in all experiments (Tjong et al., 2000).
    Page 4, “Experiments”
  3. For the tagging experiments, we train and test using the gold standard POS tags contained in the Penn Treebank .
    Page 4, “Experiments”
  4. We tested the accuracy of our models for chunking and POS tagging on section 20 of the Penn Treebank , which corresponds to the test set from the CoNLL 2000 task.
    Page 4, “Experiments”
  5. Our distributional representations are trained on sections 2-22 of the Penn Treebank .
    Page 4, “Experiments”
  6. We used sections 15-18 of the Penn Treebank as our labeled training set, including the gold standard POS tags.
    Page 6, “Experiments”
  7. We use our best-performing smoothing model, the HMM, and train it on sections 13 through 19 of the Penn Treebank , plus the written portion of the OANC that contains journal articles from biochemistry (40,727 sentences).
    Page 6, “Experiments”
  8. : 40,000 manually tagged sentences from the Penn Treebank for our labeled training data, and all of the unlabeled text from the Penn Treebank plus their MEDLINE corpus of 71,306 sentences to train our HMM.
    Page 6, “Experiments”
  9. At minimum, we use the text available in the labeled training and test sets, and then add random subsets of the Penn Treebank , sections 2-22.
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

data sparsity

Appears in 8 sentences as: Data sparsity (1) data sparsity (7)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. Supervised sequence-labeling systems in natural language processing often suffer from data sparsity because they use word types as features in their prediction tasks.
    Page 1, “Abstract”
  2. Data sparsity and high dimensionality are the twin curses of statistical natural language processing (NLP).
    Page 1, “Introduction”
  3. The negative effects of data sparsity have been well-documented in the NLP literature.
    Page 1, “Introduction”
  4. Our technique is particularly well-suited to handling data sparsity because it is possible to improve performance on rare words by supplementing the training data with additional unannotated text containing more examples of the rare words.
    Page 1, “Introduction”
  5. We provide empirical evidence that shows how distributional representations improve sequence-labeling in the face of data sparsity .
    Page 1, “Introduction”
  6. For supervised sequence-labeling problems in NLP, the most important “complicating factor” that we seek to avoid through smoothing is the data sparsity associated with word-based representations.
    Page 2, “Smoothing Natural Language Sequences”
  7. Importantly, we seek distributional representations that will provide features that are common in both training and test data, to avoid data sparsity .
    Page 2, “Smoothing Natural Language Sequences”
  8. Sophisticated smoothing techniques like modified Kneser-Ney and Katz smoothing (Chen and Goodman, 1996) smooth together the predictions of unigram, bi-gram, trigram, and potentially higher n-gram sequences to obtain accurate probability estimates in the face of data sparsity .
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention data sparsity.

See all papers in Proc. ACL that mention data sparsity.

Back to top.

baseline system

Appears in 7 sentences as: baseline system (6) baseline system’s (1)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. Furthermore, our system improves significantly over a baseline system when applied to text from a different domain, and it reduces the sample complexity of sequence labeling.
    Page 1, “Abstract”
  2. As expected, the drop-off in the baseline system’s performance from all words to rare words is impressive for both tasks.
    Page 5, “Experiments”
  3. in F1 over the baseline system on all words, it in fact outperforms our baseline NP chunker on the WSJ data.
    Page 7, “Experiments”
  4. This chunker achieves 0.91 F1 on OANC data, and 0.93 F1 on WSJ data, outperforming the baseline system in both cases.
    Page 7, “Experiments”
  5. On rare chunks, the smoothed system reaches 0.78 Fl using only 87 labeled training sentences, a level that the baseline system never reaches, even with 6933
    Page 7, “Experiments”
  6. With 434 labeled sentences, the smoothed system reaches 0.88 Fl, which the baseline system does not reach until it has 5200 labeled samples.
    Page 7, “Experiments”
  7. However, the smoothed system requires 25,000 more sentences before it outperforms the baseline system on all chunks.
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

latent variable

Appears in 7 sentences as: Latent Variable (1) Latent variable (1) latent variable (3) latent variables (2)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. 2.3 Latent Variable Language Model Representation
    Page 3, “Smoothing Natural Language Sequences”
  2. Latent variable language models (LVLMs) can be used to produce just such a distributional representation.
    Page 3, “Smoothing Natural Language Sequences”
  3. We use Hidden Markov Models (HMMs) as the main example in the discussion and as the LVLMs in our experiments, but the smoothing technique can be generalized to other forms of LVLMs, such as factorial HMMs and latent variable maximum entropy models (Ghahramani and Jordan, 1997; Smith and Eisner, 2005).
    Page 3, “Smoothing Natural Language Sequences”
  4. An HMM is a generative probabilistic model that generates each word :10, in the corpus conditioned on a latent variable Yi.
    Page 3, “Smoothing Natural Language Sequences”
  5. Each Y,- in the model takes on integral values from 1 to S, and each one is generated by the latent variable for the preceding word, Yi_1.
    Page 3, “Smoothing Natural Language Sequences”
  6. Sparsity for low-order contexts has recently spurred interest in using latent variables to represent distributions over contexts in language models.
    Page 8, “Related Work”
  7. Several authors investigate neural network models that learn not just one latent state, but rather a vector of latent variables , to represent each word in a language model (Bengio et al., 2003; Emami et al., 2003; Morin and Bengio, 2005).
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention latent variable.

See all papers in Proc. ACL that mention latent variable.

Back to top.

domain adaptation

Appears in 6 sentences as: Domain Adaptation (1) domain adaptation (5)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. 3.3 Domain Adaptation
    Page 6, “Experiments”
  2. For our experiment on domain adaptation , we focus on NP chunking and POS tagging, and we use the labeled training data from the CoNLL 2000 shared task as before.
    Page 6, “Experiments”
  3. (2006): the semi-supervised Alternating Structural Optimization (ASO) technique and the Structural Correspondence Learning (SCL) technique for domain adaptation .
    Page 6, “Experiments”
  4. One of the benefits of our smoothing technique is that it allows for domain adaptation , a topic that has received a great deal of attention from the NLP community recently.
    Page 8, “Related Work”
  5. HMM-smoothing improves on the most closely related work, the Structural Correspondence Learning technique for domain adaptation (Blitzer et al., 2006), in experiments.
    Page 8, “Related Work”
  6. One particularly promising area for further study is the combination of smoothing and instance weighting techniques for domain adaptation .
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2009 that mention domain adaptation.

See all papers in Proc. ACL that mention domain adaptation.

Back to top.

labeled data

Appears in 6 sentences as: labeled data (6)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. In chunking, there is a clear trend toward larger increases in performance as words become rarer in the labeled data set, from a 0.02 improvement on words of frequency 2, to an improvement of 0.21 on OOV words.
    Page 5, “Experiments”
  2. To measure the sample complexity of the supervised CRF, we use the same experimental setup as in the chunking experiment on WSJ text, but we vary the amount of labeled data available to the CRF.
    Page 7, “Experiments”
  3. Thus smoothing is optimizing performance for the case where unlabeled data is plentiful and labeled data is scarce, as we would hope.
    Page 7, “Experiments”
  4. Several researchers have previously studied methods for using unlabeled data for tagging and chunking, either alone or as a supplement to labeled data .
    Page 8, “Related Work”
  5. Our technique lets the HMM find parameters that maximize cross-entropy, and then uses labeled data to learn the best mapping from the HMM categories to the POS categories.
    Page 8, “Related Work”
  6. Our technique uses unlabeled training data from the target domain, and is thus applicable more generally, including in web processing, where the domain and vocabulary is highly variable, and it is extremely difficult to obtain labeled data that is representative of the test distribution.
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

CoNLL

Appears in 5 sentences as: CoNLL (5)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. Following the CoNLL shared task from 2000, we use sections 15-18 of the Penn Treebank for our labeled training data for the supervised sequence labeler in all experiments (Tjong et al., 2000).
    Page 4, “Experiments”
  2. We tested the accuracy of our models for chunking and POS tagging on section 20 of the Penn Treebank, which corresponds to the test set from the CoNLL 2000 task.
    Page 4, “Experiments”
  3. The chunker’s accuracy is roughly in the middle of the range of results for the original CoNLL 2000 shared task (Tjong et al., 2000) .
    Page 5, “Experiments”
  4. For our experiment on domain adaptation, we focus on NP chunking and POS tagging, and we use the labeled training data from the CoNLL 2000 shared task as before.
    Page 6, “Experiments”
  5. Ando and Zhang develop a semi-supervised chunker that outperforms purely supervised approaches on the CoNLL 2000 dataset (Ando and Zhang, 2005).
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention CoNLL.

See all papers in Proc. ACL that mention CoNLL.

Back to top.

unlabeled data

Appears in 5 sentences as: unlabeled data (5)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. Using Expectation-Maximization (Dempster et al., 1977), it is possible to estimate the distributions for and P |yi_1) from unlabeled data .
    Page 3, “Smoothing Natural Language Sequences”
  2. N0 peak in performance is reached, so further improvements are possible with more unlabeled data .
    Page 7, “Experiments”
  3. Thus smoothing is optimizing performance for the case where unlabeled data is plentiful and labeled data is scarce, as we would hope.
    Page 7, “Experiments”
  4. Several researchers have previously studied methods for using unlabeled data for tagging and chunking, either alone or as a supplement to labeled data.
    Page 8, “Related Work”
  5. Unlike these systems, our efforts are aimed at using unlabeled data to find distributional representations that work well on rare terms, making the supervised systems more applicable to other domains and decreasing their sample complexity.
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.

language models

Appears in 5 sentences as: Language Model (1) language model (1) language modeling (1) language models (2)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. 2.3 Latent Variable Language Model Representation
    Page 3, “Smoothing Natural Language Sequences”
  2. Latent variable language models (LVLMs) can be used to produce just such a distributional representation.
    Page 3, “Smoothing Natural Language Sequences”
  3. Sparsity for low-order contexts has recently spurred interest in using latent variables to represent distributions over contexts in language models .
    Page 8, “Related Work”
  4. While n-gram models have traditionally dominated in language modeling , two recent efforts de-
    Page 8, “Related Work”
  5. Several authors investigate neural network models that learn not just one latent state, but rather a vector of latent variables, to represent each word in a language model (Bengio et al., 2003; Emami et al., 2003; Morin and Bengio, 2005).
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

semi-supervised

Appears in 4 sentences as: semi-supervised (4)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. (2006): the semi-supervised Alternating Structural Optimization (ASO) technique and the Structural Correspondence Learning (SCL) technique for domain adaptation.
    Page 6, “Experiments”
  2. Ando and Zhang develop a semi-supervised chunker that outperforms purely supervised approaches on the CoNLL 2000 dataset (Ando and Zhang, 2005).
    Page 8, “Related Work”
  3. Recent projects in semi-supervised (Toutanova and Johnson, 2007) and unsupervised (Biemann et al., 2007; Smith and Eisner, 2005) tagging also show significant progress.
    Page 8, “Related Work”
  4. HMMs have been used many times for POS tagging and chunking, in supervised, semi-supervised , and in unsupervised settings (Banko and Moore, 2004; Goldwater and Griffiths, 2007; Johnson, 2007; Zhou, 2004).
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

n-gram

Appears in 4 sentences as: n-gram (4)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. Smoothing in NLP usually refers to the problem of smoothing n-gram models.
    Page 8, “Related Work”
  2. Sophisticated smoothing techniques like modified Kneser-Ney and Katz smoothing (Chen and Goodman, 1996) smooth together the predictions of unigram, bi-gram, trigram, and potentially higher n-gram sequences to obtain accurate probability estimates in the face of data sparsity.
    Page 8, “Related Work”
  3. While n-gram models have traditionally dominated in language modeling, two recent efforts de-
    Page 8, “Related Work”
  4. velop latent-variable probabilistic models that rival and even surpass n-gram models in accuracy (Blitzer et al., 2005; Mnih and Hinton, 2007).
    Page 8, “Related Work”

See all papers in Proc. ACL 2009 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

probability distribution

Appears in 3 sentences as: probability distribution (3)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. We investigate the use of distributional representations, which model the probability distribution of a word’s context, as techniques for finding smoothed representations of word sequences.
    Page 1, “Introduction”
  2. If V is the vocabulary, or the set of word types, and X is a sequence of random variables over V, the left and right context of Xi = 2) may each be represented as a probability distribution over V: P(XZ-_1|XZ- = v) and P(Xi+1|X = 2)) respectively.
    Page 2, “Smoothing Natural Language Sequences”
  3. We then normalize each vector to form a probability distribution .
    Page 2, “Smoothing Natural Language Sequences”

See all papers in Proc. ACL 2009 that mention probability distribution.

See all papers in Proc. ACL that mention probability distribution.

Back to top.

shared task

Appears in 3 sentences as: shared task (3)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. Following the CoNLL shared task from 2000, we use sections 15-18 of the Penn Treebank for our labeled training data for the supervised sequence labeler in all experiments (Tjong et al., 2000).
    Page 4, “Experiments”
  2. The chunker’s accuracy is roughly in the middle of the range of results for the original CoNLL 2000 shared task (Tjong et al., 2000) .
    Page 5, “Experiments”
  3. For our experiment on domain adaptation, we focus on NP chunking and POS tagging, and we use the labeled training data from the CoNLL 2000 shared task as before.
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention shared task.

See all papers in Proc. ACL that mention shared task.

Back to top.

learning algorithm

Appears in 3 sentences as: learning algorithm (3)
In Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
  1. Formally, we define the smoothing task as follows: let D = {(X, z)|x is a word sequence, z is a label sequence} be a labeled dataset of word sequences, and let M be a machine learning algorithm that will learn a function f to predict the correct labels.
    Page 2, “Smoothing Natural Language Sequences”
  2. As an example, consider the string “Researchers test reformulated gasolines on newer engines.” In a common dataset for NP chunking, the word “reformulated” never appears in the training data, but appears four times in the test set as part of the NP “reformulated gasolines.” Thus, a learning algorithm supplied with word-level features would
    Page 2, “Smoothing Natural Language Sequences”
  3. In particular, we seek to represent each word by a distribution over its contexts, and then provide the learning algorithm with features computed from this distribution.
    Page 2, “Smoothing Natural Language Sequences”

See all papers in Proc. ACL 2009 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.