Tagging The Web: Building A Robust Web Tagger with Neural Network
Ma, Ji and Zhang, Yue and Zhu, Jingbo

Article Structure

Abstract

In this paper, we address the problem of web-domain POS tagging using a two-phase approach.

Introduction

Analysing and extracting useful information from the web has become an increasingly important research direction for the NLP community, where many tasks require part-of-speech (POS) tagging as a fundamental preprocessing step.

Learning from Web Text

Unsupervised learning is often used for training encoders that convert the input data to abstract representations (i.e.

Neural Network for POS Disambiguation

We integrate the learned WRRBM into a neural network, which serves as a scorer for POS disambiguation.

Easy-first POS tagging with Neural Network

The neural network proposed in Section 3 is used for POS disambiguation by the easy-first POS tagger.

Experiments

5.1 Setup

Related Work

Learning representations has been intensively studied in computer vision tasks (Bengio et al., 2007; Lee et al., 2009a).

Conclusion

We built a web-domain POS tagger using a two-phase approach.

Topics

POS tagging

Appears in 20 sentences as: POS tag (1) POS tagger (6) POS taggers (1) POS tagging (11) POS tags (2)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. In this paper, we address the problem of web-domain POS tagging using a two-phase approach.
    Page 1, “Abstract”
  2. The representation is integrated as features into a neural network that serves as a scorer for an easy-first POS tagger .
    Page 1, “Abstract”
  3. However, state-of-the-art POS taggers in the literature (Collins, 2002; Shen et al., 2007) are mainly optimized on the the Penn Treebank (PTB), and when shifted to web data, tagging accuracies drop significantly (Petrov and McDonald, 2012).
    Page 1, “Introduction”
  4. We integrate the learned encoder with a set of well-established features for POS tagging (Ratnaparkhi, 1996; Collins, 2002) in a single neural network, which is applied as a scorer to an easy-first POS tagger .
    Page 1, “Introduction”
  5. We choose the easy-first tagging approach since it has been demonstrated to give higher accuracies than the standard left-to-right POS tagger (Shen et al., 2007; Ma et al., 2013).
    Page 1, “Introduction”
  6. This may partly be due to the fact that unlike computer vision tasks, the input structure of POS tagging or other sequential labelling tasks is relatively simple, and a single nonlinear layer is enough to model the interactions within the input (Wang and Manning, 2013).
    Page 3, “Learning from Web Text”
  7. The main challenge to designing the neural network structure is: on the one hand, we hope that the model can take the advantage of information provided by the learned WRRBM, which reflects general properties of web texts, so that the model generalizes well in the web domain; on the other hand, we also hope to improve the model’s discriminative power by utilizing well-established POS tagging features, such as those of Ratnaparkhi (1996).
    Page 3, “Neural Network for POS Disambiguation”
  8. Under the output layer, the network consists of two modules: the web-feature module, which incorporates knowledge from the pre-trained WRRBM, and the sparse-feature module, which makes use of other POS tagging features.
    Page 3, “Neural Network for POS Disambiguation”
  9. For POS tagging , we found that a simple linear layer yields satisfactory accuracies.
    Page 4, “Neural Network for POS Disambiguation”
  10. The web-feature and sparse-feature modules are combined by a linear output layer, as shown in the upper part of Figure l. The value of each unit in this layer denotes the score of the corresponding POS tag .
    Page 4, “Neural Network for POS Disambiguation”
  11. In some circumstances, probability distribution over POS tags might be a more preferable form of output.
    Page 4, “Neural Network for POS Disambiguation”

See all papers in Proc. ACL 2014 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

neural network

Appears in 15 sentences as: neural network (13) neural networks (2)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. The representation is integrated as features into a neural network that serves as a scorer for an easy-first POS tagger.
    Page 1, “Abstract”
  2. Parameters of the neural network are trained using guided learning in the second phase.
    Page 1, “Abstract”
  3. We integrate the learned encoder with a set of well-established features for POS tagging (Ratnaparkhi, 1996; Collins, 2002) in a single neural network , which is applied as a scorer to an easy-first POS tagger.
    Page 1, “Introduction”
  4. To our knowledge, we are the first to investigate guided learning for neural networks .
    Page 1, “Introduction”
  5. We integrate the learned WRRBM into a neural network , which serves as a scorer for POS disambiguation.
    Page 3, “Neural Network for POS Disambiguation”
  6. The main challenge to designing the neural network structure is: on the one hand, we hope that the model can take the advantage of information provided by the learned WRRBM, which reflects general properties of web texts, so that the model generalizes well in the web domain; on the other hand, we also hope to improve the model’s discriminative power by utilizing well-established POS tagging features, such as those of Ratnaparkhi (1996).
    Page 3, “Neural Network for POS Disambiguation”
  7. Our approach is to leverage the two sources of information in one neural network by combining them though a shared output layer, as shown in Figure 1.
    Page 3, “Neural Network for POS Disambiguation”
  8. Figure l: The proposed neural network .
    Page 3, “Neural Network for POS Disambiguation”
  9. The neural network proposed in Section 3 is used for POS disambiguation by the easy-first POS tagger.
    Page 4, “Easy-first POS tagging with Neural Network”
  10. At each step, the algorithm adopts a scorer, the neural network in our case, to assign a score to each possible word-tag pair (212, t) , and then selects the highest score one (If), f) to tag (i.e., tag 21“) with f).
    Page 4, “Easy-first POS tagging with Neural Network”
  11. While previous work (Shen et al., 2007; Zhang and Clark, 2011; Goldberg and Elhadad, 2010) apply guided learning to train a linear classifier by using variants of the percep-tron algorithm, we are the first to combine guided learning with a neural network , by using a margin loss and a modified back-propagation algorithm.
    Page 5, “Easy-first POS tagging with Neural Network”

See all papers in Proc. ACL 2014 that mention neural network.

See all papers in Proc. ACL that mention neural network.

Back to top.

unlabelled data

Appears in 11 sentences as: unlabelled data (11)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. The problem we face here can be considered as a special case of domain adaptation, where we have access to labelled data on the source domain (PTB) and unlabelled data on the target domain (web data).
    Page 1, “Introduction”
  2. The idea of learning representations from unlabelled data and then fine-tuning a model with such representations according to some supervised criterion has been studied before (Turian et al., 2010; Collobert et al., 2011; Glorot et al., 2011).
    Page 1, “Introduction”
  3. In addition to labelled data, a large amount of unlabelled data on the web domain is also provided.
    Page 5, “Experiments”
  4. about labelled and unlabelled data are summarized in Table 1 and Table 2, respectively.
    Page 5, “Experiments”
  5. unlabelled data .
    Page 5, “Experiments”
  6. (2010), we also low-ercased all the unlabelled data and removed those sentences that contain less than 90% az letters.
    Page 6, “Experiments”
  7. The data sets are generated by first concatenating all the cleaned unlabelled data , then selecting sentences evenly across the concatenated file.
    Page 6, “Experiments”
  8. Table 5: Effect of unlabelled data .
    Page 8, “Experiments”
  9. (2011) propose to learn representations from the mixture of both source and target domain unlabelled data to improve cross-domain sentiment classification.
    Page 8, “Related Work”
  10. Such high dimensional input gives rise to high computational cost and it is not clear whether those approaches can be applied to large scale unlabelled data , with hundreds of millions of training examples.
    Page 8, “Related Work”
  11. The new representations are induced based on the auxiliary tasks defined on unlabelled data together with a dimensionality reduction technique.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention unlabelled data.

See all papers in Proc. ACL that mention unlabelled data.

Back to top.

n-gram

Appears in 8 sentences as: N-gram (1) n-gram (7)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. The basic idea is to share word representations across different positions in the input n-gram while using position-dependent weights to distinguish between different word orders.
    Page 2, “Learning from Web Text”
  2. Let V0) represents the j-th visible variable of the WRRBM, which is a vector of length Then V0) 2 wk; means that the j-th word in the n-gram is wk.
    Page 2, “Learning from Web Text”
  3. The input for the this module is the word n-gram (7,0,4, .
    Page 3, “Neural Network for POS Disambiguation”
  4. representations of the input n-gram .
    Page 4, “Neural Network for POS Disambiguation”
  5. For each data set, we investigate an extensive set of combinations of hyper-parameters: the n-gram window (l,r) in {(1, 1), (2,1), (1,2), (2,2)}; the hidden layer size in {200, 300, 400}; the learning rate in {0.1, 0.01, 0.001}.
    Page 6, “Experiments”
  6. 5.3.2 Word and N-gram Representation
    Page 7, “Experiments”
  7. By contrast, using n-gram representations improves the performance on both oov and non-oov.
    Page 8, “Experiments”
  8. While those approaches mainly explore token-level representations (word or character embeddings), using WRRBM is able to utilize both word and n-gram representations.
    Page 8, “Related Work”

See all papers in Proc. ACL 2014 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

embeddings

Appears in 7 sentences as: embeddings (8)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. This result illustrates that the ngram-level knowledge captures more complex interactions of the web text, which cannot be recovered by using only word embeddings .
    Page 7, “Experiments”
  2. (2012), who found that using both the word embeddings and the hidden units of a trigram WRRBM as additional features for a CRF chunker yields larger improvements than using word embeddings only.
    Page 7, “Experiments”
  3. (2010) learn word embeddings to improve the performance of in-domain POS tagging, named entity recognition, chunking and semantic role labelling.
    Page 8, “Related Work”
  4. (2013) induce bilingual word embeddings for word alignment.
    Page 8, “Related Work”
  5. (2013) investigate Chinese character embeddings for joint word segmentation and POS tagging.
    Page 8, “Related Work”
  6. While those approaches mainly explore token-level representations (word or character embeddings ), using WRRBM is able to utilize both word and n-gram representations.
    Page 8, “Related Work”
  7. In particular, we both use a nonlinear layer to model complex relations underling word embeddings .
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

word representations

Appears in 7 sentences as: Word Representation (1) word representation (1) word representations (5)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. We utilize the Word Representation RBM (WRRBM) factorization proposed by Dahl et al.
    Page 2, “Learning from Web Text”
  2. The basic idea is to share word representations across different positions in the input n-gram while using position-dependent weights to distinguish between different word orders.
    Page 2, “Learning from Web Text”
  3. ,W(n)} can be trained using a Metropolis-Hastings-based CD variant and the learned word representations also capture certain syntactic information; see Dahl et al.
    Page 3, “Learning from Web Text”
  4. We can choose to use only the word representations of the learned WRRBM.
    Page 3, “Neural Network for POS Disambiguation”
  5. As mentioned in Section 3.1, the knowledge learned from the WRRBM can be investigated incrementally, using word representation , which corresponds to initializing only the projection layer of web-feature module with the projection matrix of the learned WRRBM, or ngram-level representation, which corresponds to initializing both the projection and sigmoid layers of the web-feature module by the learned WRRBM.
    Page 6, “Experiments”
  6. “word” and “ngram” denote using word representations and n—gram representations, respectively.
    Page 7, “Experiments”
  7. From Figures 2, 3 and 4, we can see that adopting the ngram-level representation consistently achieves better performance compared with using word representations only (“word-fixed” vs “ngram-fixed”, “word-adjusted” vs “ngram-adjusted”).
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention word representations.

See all papers in Proc. ACL that mention word representations.

Back to top.

word embeddings

Appears in 5 sentences as: word embeddings (6)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. This result illustrates that the ngram-level knowledge captures more complex interactions of the web text, which cannot be recovered by using only word embeddings .
    Page 7, “Experiments”
  2. (2012), who found that using both the word embeddings and the hidden units of a trigram WRRBM as additional features for a CRF chunker yields larger improvements than using word embeddings only.
    Page 7, “Experiments”
  3. (2010) learn word embeddings to improve the performance of in-domain POS tagging, named entity recognition, chunking and semantic role labelling.
    Page 8, “Related Work”
  4. (2013) induce bilingual word embeddings for word alignment.
    Page 8, “Related Work”
  5. In particular, we both use a nonlinear layer to model complex relations underling word embeddings .
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention word embeddings.

See all papers in Proc. ACL that mention word embeddings.

Back to top.

development sets

Appears in 4 sentences as: development set (1) development sets (3)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. While emails and weblogs are used as the development sets , reviews, news groups and Yahoo!Answers are used as the final test sets.
    Page 5, “Experiments”
  2. All these parameters are selected according to the averaged accuracy on the development set .
    Page 6, “Experiments”
  3. Experimental results under the 4 combined settings on the development sets are illustrated in Figure 2, 3 and 4, where the
    Page 6, “Experiments”
  4. Tagging performance and lexicon coverages of each data set on the development sets are shown in Table 5.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention development sets.

See all papers in Proc. ACL that mention development sets.

Back to top.

hidden layer

Appears in 4 sentences as: hidden layer (3) hidden layers (1)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. The affine form of E with respect to v and h implies that the visible variables are conditionally independent with each other given the hidden layer units, and vice versa.
    Page 2, “Learning from Web Text”
  2. For each position j, there is a weight matrix W0) 6 RHXD, which is used to model the interaction between the hidden layer and the word projection in position j.
    Page 2, “Learning from Web Text”
  3. The web-feature module, shown in the lower left part of Figure 1, consists of a input layer and two hidden layers .
    Page 3, “Neural Network for POS Disambiguation”
  4. For each data set, we investigate an extensive set of combinations of hyper-parameters: the n-gram window (l,r) in {(1, 1), (2,1), (1,2), (2,2)}; the hidden layer size in {200, 300, 400}; the learning rate in {0.1, 0.01, 0.001}.
    Page 6, “Experiments”

See all papers in Proc. ACL 2014 that mention hidden layer.

See all papers in Proc. ACL that mention hidden layer.

Back to top.

labelled data

Appears in 4 sentences as: labelled data (4)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. The problem we face here can be considered as a special case of domain adaptation, where we have access to labelled data on the source domain (PTB) and unlabelled data on the target domain (web data).
    Page 1, “Introduction”
  2. The data set consists of labelled data for both the source (Wall Street Journal portion of the Penn Treebank) and target (web) domains.
    Page 5, “Experiments”
  3. Participants are not allowed to use web-domain labelled data for training.
    Page 5, “Experiments”
  4. In addition to labelled data , a large amount of unlabelled data on the web domain is also provided.
    Page 5, “Experiments”

See all papers in Proc. ACL 2014 that mention labelled data.

See all papers in Proc. ACL that mention labelled data.

Back to top.

gold standard

Appears in 3 sentences as: gold standard (3)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. The training algorithm repeats for several iterations over the training data, which is a set of sentences labelled with gold standard POS tags.
    Page 4, “Easy-first POS tagging with Neural Network”
  2. (w, i) denotes the word-tag pair that has the highest model score among those that are inconsistent with the gold standard , while (213,7?)
    Page 4, “Easy-first POS tagging with Neural Network”
  3. denotes the one that has the highest model score among those that are consistent with the gold standard .
    Page 4, “Easy-first POS tagging with Neural Network”

See all papers in Proc. ACL 2014 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

in-domain

Appears in 3 sentences as: in-domain (3)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. While most previous work focus on in-domain sequential labelling or cross-domain classification tasks, we are the first to learn representations for web-domain structured prediction.
    Page 1, “Introduction”
  2. Our results suggest that while both strategies improve in-domain tagging accuracies, keeping the learned representation unchanged consistently results in better cross-domain accuracies.
    Page 1, “Introduction”
  3. (2010) learn word embeddings to improve the performance of in-domain POS tagging, named entity recognition, chunking and semantic role labelling.
    Page 8, “Related Work”

See all papers in Proc. ACL 2014 that mention in-domain.

See all papers in Proc. ACL that mention in-domain.

Back to top.

sequential labelling

Appears in 3 sentences as: sequential labelling (3)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. While most previous work focus on in-domain sequential labelling or cross-domain classification tasks, we are the first to learn representations for web-domain structured prediction.
    Page 1, “Introduction”
  2. This may partly be due to the fact that unlike computer vision tasks, the input structure of POS tagging or other sequential labelling tasks is relatively simple, and a single nonlinear layer is enough to model the interactions within the input (Wang and Manning, 2013).
    Page 3, “Learning from Web Text”
  3. Regarding using neural networks for sequential labelling , our approach shares similarity with that of Collobert et al.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention sequential labelling.

See all papers in Proc. ACL that mention sequential labelling.

Back to top.

shared task

Appears in 3 sentences as: shared task (3)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. Experiment on the SANCL 2012 shared task show that our approach achieves 93.15% average tagging accuracy, which is the best accuracy reported so far on this data set, higher than those given by ensembled syntactic parsers.
    Page 1, “Abstract”
  2. We conduct experiments on the official data set provided by the SANCL 2012 shared task (Petrov and McDonald, 2012).
    Page 1, “Introduction”
  3. Our experiments are conducted on the data set provided by the SANCL 2012 shared task , which aims at building a single robust syntactic analysis system across the web-domain.
    Page 5, “Experiments”

See all papers in Proc. ACL 2014 that mention shared task.

See all papers in Proc. ACL that mention shared task.

Back to top.

syntactic parsers

Appears in 3 sentences as: syntactic parsers (2) syntactic parsing (1)
In Tagging The Web: Building A Robust Web Tagger with Neural Network
  1. Experiment on the SANCL 2012 shared task show that our approach achieves 93.15% average tagging accuracy, which is the best accuracy reported so far on this data set, higher than those given by ensembled syntactic parsers .
    Page 1, “Abstract”
  2. set, higher than those given by ensembled syntactic parsers .
    Page 2, “Introduction”
  3. For future work, we would like to investigate the two-phase approach to more challenging tasks, such as web domain syntactic parsing .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention syntactic parsers.

See all papers in Proc. ACL that mention syntactic parsers.

Back to top.