Phrase Clustering for Discriminative Learning
Lin, Dekang and Wu, Xiaoyun

Article Structure

Abstract

We present a simple and scalable algorithm for clustering tens of millions of phrases and use the resulting clusters as features in discriminative classifiers.

Introduction

Over the past decade, supervised learning algorithms have gained widespread acceptance in natural language processing (NLP).

Distributed K-Means clustering

K—Means clustering (MacQueen 1967) is one of the simplest and most well-known clustering algorithms.

Named Entity Recognition

Named entity recognition (NER) is one of the first steps in many applications of information extraction, information retrieval, question answering and other applications of NLP.

Query Classification

We now look at the use of phrasal clusters in a very different application: query classification.

Discussion and Related Work

In earlier work on semi-supervised learning, e. g., (Blum and Mitchell 1998), the classifiers learned from unlabeled data were used directly.

Conclusions

We presented a simple and scalable algorithm to cluster tens of millions of phrases and we used the resulting clusters as features in discriminative classifiers.

Topics

feature vectors

Appears in 13 sentences as: feature vector (5) feature vectors (7) features vectors (1)
In Phrase Clustering for Discriminative Learning
  1. Given a set of elements represented as feature vectors and a number, k, of desired clusters, the K-Means algorithm consists of the following steps:
    Page 2, “Distributed K-Means clustering”
  2. Before describing our parallel implementation of the K-Means algorithm, we first describe the phrases to be clusters and how their feature vectors are constructed.
    Page 2, “Distributed K-Means clustering”
  3. Following previous approaches to distributional clustering of words, we represent the contexts of a phrase as a feature vector .
    Page 2, “Distributed K-Means clustering”
  4. The context feature vector of a phrase is constructed by first aggregating the frequency counts of the words in the context windows of different instances of the
    Page 2, “Distributed K-Means clustering”
  5. Given two feature vectors , we compute the similarity between two vectors as the cosine function of the angle between the vectors.
    Page 3, “Distributed K-Means clustering”
  6. We impose an upper limit on the number of instances of each phrase when constructing its feature vector .
    Page 3, “Distributed K-Means clustering”
  7. Second, it reduces the variance in the sizes of the feature vectors of the phrases.
    Page 3, “Distributed K-Means clustering”
  8. The keys of intermediate pairs are cluster ids and the values are feature vectors of elements assigned to the corresponding cluster.
    Page 3, “Distributed K-Means clustering”
  9. Furthermore, when summing up a large number of features vectors , numerical underflow becomes a potential problem.
    Page 3, “Distributed K-Means clustering”
  10. In a naive implementation of Step ii of K—Means, one would compute the similarities between a feature vector and all the centroids in order to find the closest one.
    Page 3, “Distributed K-Means clustering”
  11. Fortunately, the high-dimensional and sparse nature of our feature vectors can also be exploited.
    Page 3, “Distributed K-Means clustering”

See all papers in Proc. ACL 2009 that mention feature vectors.

See all papers in Proc. ACL that mention feature vectors.

Back to top.

CoNLL

Appears in 11 sentences as: CoNLL (11)
In Phrase Clustering for Discriminative Learning
  1. Our NER system achieves the best current result on the widely used CoNLL benchmark.
    Page 1, “Abstract”
  2. Our named entity recognition system achieves an Fl-score of 90.90 on the CoNLL 2003 English data set, which is about 1 point higher than the previous best result.
    Page 2, “Introduction”
  3. The CoNLL 2003 Shared Task (Tjong Kim Sang and Meulder 2003) offered a standard experimental platform for NER.
    Page 4, “Named Entity Recognition”
  4. The CoNLL data set consists of news articles from Reutersl.
    Page 4, “Named Entity Recognition”
  5. We adopted the same evaluation criteria as the CoNLL 2003 Shared Task.
    Page 4, “Named Entity Recognition”
  6. Run K—Means clustering on the phrases that appeared in the CoNLL training data to obtain K centroids.
    Page 4, “Named Entity Recognition”
  7. Part-of-speech tags were used in the top-ranked systems in CoNLL 2003, as well as in many follow up studies that used the data set (Ando and Zhang 2005; Suzuki and Isozaki 2008).
    Page 5, “Named Entity Recognition”
  8. Table 4 summarizes the evaluation results for our NER system and compares it with the two best results on the data set in the literature, as well the top-3 systems in CoNLL 2003.
    Page 5, “Named Entity Recognition”
  9. Table 4 CoNLL NER test set results
    Page 6, “Named Entity Recognition”
  10. The Top CoNLL 2003 systems all employed gazetteers or other types of specialized resources (e.g., lists of words that tend to co-occur with certain named entity types) in addition to part-of-speech tags.
    Page 6, “Named Entity Recognition”
  11. Our system achieved the best current result on the CoNLL NER data set.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2009 that mention CoNLL.

See all papers in Proc. ACL that mention CoNLL.

Back to top.

named entity

Appears in 11 sentences as: Named entity (1) named entity (10)
In Phrase Clustering for Discriminative Learning
  1. To demonstrate the power and generality of this approach, we apply the method in two very different applications: named entity recognition and query classification.
    Page 1, “Abstract”
  2. They have become the workhorse in almost all subareas and components of NLP, including part-of-speech tagging, chunking, named entity recognition and parsing.
    Page 1, “Introduction”
  3. This method has been shown to be quite successful in named entity recognition (Miller et al.
    Page 1, “Introduction”
  4. We demonstrate the advantages of phrase-based clusters over word-based ones with experimental results from two distinct application domains: named entity recognition and query classification.
    Page 2, “Introduction”
  5. Our named entity recognition system achieves an Fl-score of 90.90 on the CoNLL 2003 English data set, which is about 1 point higher than the previous best result.
    Page 2, “Introduction”
  6. For example, in the named entity recognition task, categorical clusters are more successful, whereas in query categorization, the topical clusters are much more beneficial.
    Page 3, “Distributed K-Means clustering”
  7. Named entity recognition (NER) is one of the first steps in many applications of information extraction, information retrieval, question answering and other applications of NLP.
    Page 4, “Named Entity Recognition”
  8. Here, 5 denotes a position in the input sequence; ys is a label that indicates whether the token at position s is a named entity as well as its type; w“
    Page 4, “Named Entity Recognition”
  9. The Top CoNLL 2003 systems all employed gazetteers or other types of specialized resources (e.g., lists of words that tend to co-occur with certain named entity types) in addition to part-of-speech tags.
    Page 6, “Named Entity Recognition”
  10. The named entity problem in Section 3 and the query classification problem in Section 4 exemplify the two scenarios.
    Page 8, “Discussion and Related Work”
  11. We demonstrated the power and generality of this approach on two very different applications: named entity recognition and query classification.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2009 that mention named entity.

See all papers in Proc. ACL that mention named entity.

Back to top.

NER

Appears in 10 sentences as: NER (10)
In Phrase Clustering for Discriminative Learning
  1. Our NER system achieves the best current result on the widely used CoNLL benchmark.
    Page 1, “Abstract”
  2. Named entity recognition ( NER ) is one of the first steps in many applications of information extraction, information retrieval, question answering and other applications of NLP.
    Page 4, “Named Entity Recognition”
  3. 2001) is one of the most competitive NER algorithms.
    Page 4, “Named Entity Recognition”
  4. The CoNLL 2003 Shared Task (Tjong Kim Sang and Meulder 2003) offered a standard experimental platform for NER .
    Page 4, “Named Entity Recognition”
  5. NER systems often have global features to capture discourse-level regularities (Chieu and Ng 2003).
    Page 5, “Named Entity Recognition”
  6. An important advantage of not needing a POS tagger as a preprocessor is that the system is much easier to adapt to other languages, since training a tagger often requires a larger amount of more extensively annotated data than the training data for NER .
    Page 5, “Named Entity Recognition”
  7. We used hard clustering with l-word context windows for NER .
    Page 5, “Named Entity Recognition”
  8. Table 4 summarizes the evaluation results for our NER system and compares it with the two best results on the data set in the literature, as well the top-3 systems in CoNLL 2003.
    Page 5, “Named Entity Recognition”
  9. Table 4 CoNLL NER test set results
    Page 6, “Named Entity Recognition”
  10. Our system achieved the best current result on the CoNLL NER data set.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2009 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.

unlabeled data

Appears in 7 sentences as: unlabeled data (7)
In Phrase Clustering for Discriminative Learning
  1. two-stage strategy: first create word clusters with unlabeled data and then use the clusters as features in supervised training.
    Page 1, “Introduction”
  2. In earlier work on semi-supervised learning, e. g., (Blum and Mitchell 1998), the classifiers learned from unlabeled data were used directly.
    Page 8, “Discussion and Related Work”
  3. Recent research shows that it is better to use whatever is learned from the unlabeled data as features in a discriminative classifier.
    Page 8, “Discussion and Related Work”
  4. Wong and Ng (2007) and Suzuki and Isozaki (2008) are similar in that they run a baseline discriminative classifier on unlabeled data to generate pseudo examples, which are then used to train a different type of classifier for the same problem.
    Page 8, “Discussion and Related Work”
  5. Ando and Zhang (2005) defined an objective function that combines the original problem on the labeled data with a set of auxiliary problems on unlabeled data .
    Page 8, “Discussion and Related Work”
  6. The combined objective function is then alternatingly optimized with the labeled and unlabeled data .
    Page 8, “Discussion and Related Work”
  7. This training regime puts pressure on the discriminative learner to exploit the structures uncovered from the unlabeled data .
    Page 8, “Discussion and Related Work”

See all papers in Proc. ACL 2009 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.

objective function

Appears in 7 sentences as: objective function (7)
In Phrase Clustering for Discriminative Learning
  1. The learning algorithm then optimizes a regularized, convex objective function that is expressed in terms of these features.
    Page 1, “Introduction”
  2. distributed clustering algorithm with a similar objective function as the Brown algorithm.
    Page 2, “Introduction”
  3. We made a small modification to the objective function for logistic regression to take into account the prior distribution and to use 50% as a uniform decision boundary for all the classes.
    Page 7, “Query Classification”
  4. When training the classifier for a class with [9 positive examples out of a total of n examples, we change the objective function to:
    Page 7, “Query Classification”
  5. We suspect that such features make the optimization of the objective function much more difficult.
    Page 7, “Query Classification”
  6. Ando and Zhang (2005) defined an objective function that combines the original problem on the labeled data with a set of auxiliary problems on unlabeled data.
    Page 8, “Discussion and Related Work”
  7. The combined objective function is then alternatingly optimized with the labeled and unlabeled data.
    Page 8, “Discussion and Related Work”

See all papers in Proc. ACL 2009 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

CRF

Appears in 7 sentences as: CRF (7)
In Phrase Clustering for Discriminative Learning
  1. Conditional Random Fields ( CRF ) (Lafferty et.
    Page 4, “Named Entity Recognition”
  2. We employed a linear chain CRF with L2 regularization as the baseline algorithm to which we added phrase cluster features.
    Page 4, “Named Entity Recognition”
  3. The features in our baseline CRF classifier are a subset of the conventional features.
    Page 4, “Named Entity Recognition”
  4. If a phrase belonging to cluster c is found at positions 19 to e (inclusive), we add the following features to the CRF classifier:
    Page 5, “Named Entity Recognition”
  5. We did not observe a similar phenomenon with our CRF .
    Page 5, “Named Entity Recognition”
  6. We include all words as features and rely on the regularized CRF to select from them.
    Page 5, “Named Entity Recognition”
  7. Baseline CRF (Sec.
    Page 6, “Named Entity Recognition”

See all papers in Proc. ACL 2009 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

labeled data

Appears in 5 sentences as: labeled data (5)
In Phrase Clustering for Discriminative Learning
  1. While the labeled data is generally very costly to obtain, there is a vast amount of unlabeled textual data freely available on the web.
    Page 1, “Introduction”
  2. Under this approach, even if a word is not found in the training data, it may still fire cluster-based features as long as it shares cluster assignments with some words in the labeled data .
    Page 1, “Introduction”
  3. Since the clusters are obtained without any labeled data , they may not correspond directly to concepts that are useful for decision making in the problem domain.
    Page 1, “Introduction”
  4. Ando and Zhang (2005) defined an objective function that combines the original problem on the labeled data with a set of auxiliary problems on unlabeled data.
    Page 8, “Discussion and Related Work”
  5. One is to leverage a large amount of unsupervised data to train an adequate classifier with a small amount of labeled data .
    Page 8, “Discussion and Related Work”

See all papers in Proc. ACL 2009 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

logistic regression

Appears in 5 sentences as: logistic regression (5)
In Phrase Clustering for Discriminative Learning
  1. For each query class, we train a logistic regression classifier (Vapnik 1999) with L2 regularization.
    Page 6, “Query Classification”
  2. Given an input x, represented as a vector of m features: (x1, x2, xm), a logistic regression classifier with parameter vector w =(w1, w2, wm) computes the posterior probability of the output y, which is either 1 or -1, as 1 190’ Ix) - W
    Page 7, “Query Classification”
  3. We made a small modification to the objective function for logistic regression to take into account the prior distribution and to use 50% as a uniform decision boundary for all the classes.
    Page 7, “Query Classification”
  4. Normally, training a logistic regression classifier amounts to solving:
    Page 7, “Query Classification”
  5. For each of these clusters, we sum the cluster’s similarity to all the phrases in the query and select the top-N as features for the logistic regression classifier (N=150 in our experiments).
    Page 7, “Query Classification”

See all papers in Proc. ACL 2009 that mention logistic regression.

See all papers in Proc. ACL that mention logistic regression.

Back to top.

clusterings

Appears in 5 sentences as: clusterings (5)
In Phrase Clustering for Discriminative Learning
  1. We can easily use multiple clusterings in feature extraction.
    Page 5, “Named Entity Recognition”
  2. When we extract features from multiple clusterings , the selection of the top-N clusters is done separately for each clustering.
    Page 7, “Query Classification”
  3. The best result is achieved with multiple phrasal clusterings .
    Page 7, “Query Classification”
  4. One advantage of the two-stage approach is that the same clusterings may be used for different problems or different components of the same system.
    Page 8, “Discussion and Related Work”
  5. One nagging issue with K—Means clustering is how to set k. We show that this question may not need to be answered because we can use clusterings with different k’s at the same time and let the discriminative classifier cherry-pick the clusters at different granularities according to the supervised data.
    Page 8, “Discussion and Related Work”

See all papers in Proc. ACL 2009 that mention clusterings.

See all papers in Proc. ACL that mention clusterings.

Back to top.

learning algorithm

Appears in 4 sentences as: learning algorithm (2) learning algorithms (2)
In Phrase Clustering for Discriminative Learning
  1. Over the past decade, supervised learning algorithms have gained widespread acceptance in natural language processing (NLP).
    Page 1, “Introduction”
  2. The learning algorithm then optimizes a regularized, convex objective function that is expressed in terms of these features.
    Page 1, “Introduction”
  3. However, the supervised learning algorithms can typically identify useful clusters and assign proper weights to them, effectively adapting the clusters to the domain.
    Page 1, “Introduction”
  4. In this paper, we present a semi-supervised learning algorithm that goes a step further.
    Page 1, “Introduction”

See all papers in Proc. ACL 2009 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.

natural language

Appears in 4 sentences as: natural language (4)
In Phrase Clustering for Discriminative Learning
  1. Over the past decade, supervised learning algorithms have gained widespread acceptance in natural language processing (NLP).
    Page 1, “Introduction”
  2. The long-tailed distribution of natural language words implies that most of the word types will be either unseen or seen very few times in the labeled training data, even if the data set is a relatively large one (e. g., the Penn Treebank).
    Page 1, “Introduction”
  3. Out of context, natural language words are often ambiguous.
    Page 1, “Introduction”
  4. For natural language words and
    Page 3, “Distributed K-Means clustering”

See all papers in Proc. ACL 2009 that mention natural language.

See all papers in Proc. ACL that mention natural language.

Back to top.

best result

Appears in 4 sentences as: best result (3) best results (1)
In Phrase Clustering for Discriminative Learning
  1. Our named entity recognition system achieves an Fl-score of 90.90 on the CoNLL 2003 English data set, which is about 1 point higher than the previous best result .
    Page 2, “Introduction”
  2. Table 4 summarizes the evaluation results for our NER system and compares it with the two best results on the data set in the literature, as well the top-3 systems in CoNLL 2003.
    Page 5, “Named Entity Recognition”
  3. The best F—score of 90.90, which is about 1 point higher than the previous best result , is obtained with a combination of clusters.
    Page 6, “Named Entity Recognition”
  4. The best result is achieved with multiple phrasal clusterings.
    Page 7, “Query Classification”

See all papers in Proc. ACL 2009 that mention best result.

See all papers in Proc. ACL that mention best result.

Back to top.

part-of-speech

Appears in 4 sentences as: Part-of-speech (1) part-of-speech (3)
In Phrase Clustering for Discriminative Learning
  1. They have become the workhorse in almost all subareas and components of NLP, including part-of-speech tagging, chunking, named entity recognition and parsing.
    Page 1, “Introduction”
  2. Part-of-speech tags were used in the top-ranked systems in CoNLL 2003, as well as in many follow up studies that used the data set (Ando and Zhang 2005; Suzuki and Isozaki 2008).
    Page 5, “Named Entity Recognition”
  3. LDC refers to the clusters created with the smaller LDC corpus and +pos indicates the use of part-of-speech tags as features.
    Page 5, “Named Entity Recognition”
  4. The Top CoNLL 2003 systems all employed gazetteers or other types of specialized resources (e.g., lists of words that tend to co-occur with certain named entity types) in addition to part-of-speech tags.
    Page 6, “Named Entity Recognition”

See all papers in Proc. ACL 2009 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

part-of-speech tags

Appears in 4 sentences as: part-of-speech tagging (1) Part-of-speech tags (1) part-of-speech tags (2)
In Phrase Clustering for Discriminative Learning
  1. They have become the workhorse in almost all subareas and components of NLP, including part-of-speech tagging , chunking, named entity recognition and parsing.
    Page 1, “Introduction”
  2. Part-of-speech tags were used in the top-ranked systems in CoNLL 2003, as well as in many follow up studies that used the data set (Ando and Zhang 2005; Suzuki and Isozaki 2008).
    Page 5, “Named Entity Recognition”
  3. LDC refers to the clusters created with the smaller LDC corpus and +pos indicates the use of part-of-speech tags as features.
    Page 5, “Named Entity Recognition”
  4. The Top CoNLL 2003 systems all employed gazetteers or other types of specialized resources (e.g., lists of words that tend to co-occur with certain named entity types) in addition to part-of-speech tags .
    Page 6, “Named Entity Recognition”

See all papers in Proc. ACL 2009 that mention part-of-speech tags.

See all papers in Proc. ACL that mention part-of-speech tags.

Back to top.

phrase-based

Appears in 3 sentences as: phrase-based (3)
In Phrase Clustering for Discriminative Learning
  1. With phrase-based clustering, “Land of Odds” is grouped with many names that are labeled as company names, which is a strong indication that it is a company name as well.
    Page 1, “Introduction”
  2. The disambiguation power of phrases is also evidenced by the improvements of phrase-based machine translation systems (Koehn et.
    Page 1, “Introduction”
  3. We demonstrate the advantages of phrase-based clusters over word-based ones with experimental results from two distinct application domains: named entity recognition and query classification.
    Page 2, “Introduction”

See all papers in Proc. ACL 2009 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

semi-supervised

Appears in 3 sentences as: semi-supervised (3)
In Phrase Clustering for Discriminative Learning
  1. In this paper, we present a semi-supervised learning algorithm that goes a step further.
    Page 1, “Introduction”
  2. In earlier work on semi-supervised learning, e. g., (Blum and Mitchell 1998), the classifiers learned from unlabeled data were used directly.
    Page 8, “Discussion and Related Work”
  3. There are two main scenarios that motivate semi-supervised learning.
    Page 8, “Discussion and Related Work”

See all papers in Proc. ACL 2009 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

bag-of-words

Appears in 3 sentences as: bag-of-words (3)
In Phrase Clustering for Discriminative Learning
  1. The baseline system uses only the words in the queries as features (the bag-of-words representation), treating the query classification problem as a typical text categorization problem.
    Page 7, “Query Classification”
  2. Here, bow indicates the use of bag-of-words features; WN refers to word clusters of size N; and PN refers to phrase clusters of size N. All the clusters are soft clusters created with the web corpus using 3-word context windows.
    Page 7, “Query Classification”
  3. The bag-of-words features alone have dismal performance.
    Page 7, “Query Classification”

See all papers in Proc. ACL 2009 that mention bag-of-words.

See all papers in Proc. ACL that mention bag-of-words.

Back to top.