Experiments and Evaluations | We first describe our experimental settings and define evaluation metrics to evaluate induced soft clusterings of verb classes. |
Experiments and Evaluations | This kind of normalization for soft clusterings was performed for other evaluation metrics as in Springorum et al. |
Experiments and Evaluations | (2003) evaluated hard clusterings based on a gold standard with multiple classes per verb. |
Introduction | Moreover, to the best of our knowledge, none of the following approaches attempt to quantitatively evaluate soft clusterings of verb classes induced by polysemy-aware unsupervised approaches (Korhonen et al., 2003; Lapata and Brew, 2004; Li and Brew, 2007; Schulte im Walde et al., 2008). |
Analysis | Figure 4 shows MAP transition Dirichlet hyperparameters of the CLUST model, when trained |
Analysis | Finally, we examine the relationship between the induced clusters and language families in Table 3, for the trigram consonant vs. vowel CLUST model with 20 clusters. |
Experiments | EM 93.37 74.59 ft SYMM 95.99 80.72 MERGE 97.14 86.13 CLUST 98.85 89.37 33 EM 94.50 74.53 E SYMM 96.18 78.13 E MERGE 97.66 86.47 CLUST 98.55 89.07 .5 EM 92.93 78.26 3 SYMM 95.90 79.04 g MERGE 96.06 83.78 2 CLUST 97.03 85.79 |
Experiments | Finally, we consider the full version of our model, CLUST , with 20 language clusters. |
Results | Figure 3: Confusion matrix for CLUST (left) and EM (right). |
Results | Both MERGE and CLUST break symmetries over tags by way of the asymmetric posterior over transition Dirichlet parameters. |
Abstract | The resulting clusterings are then used in training partially class—based language models. |
Distributed Clustering | The clusterings generated in each iteration as well as the initial clustering are stored as the set of words in each cluster, the total number of occurrences of each cluster in the training corpus, and the list of words preceeding each cluster. |
Distributed Clustering | The quality of class-based models trained using the resulting clusterings did not differ noticeably from those trained using clusterings for which the full vocabulary was considered in each iteration. |
Experiments | We trained a number of predictive class-based language models on different Arabic and English corpora using clusterings trained on the complete data of the same corpus. |
Experiments | For the first experiment we trained predictive class-based 5-gram models using clusterings with 64, 128, 256 and 512 clusters1 on the eniarget data. |
Discussion and Related Work | One advantage of the two-stage approach is that the same clusterings may be used for different problems or different components of the same system. |
Discussion and Related Work | One nagging issue with K—Means clustering is how to set k. We show that this question may not need to be answered because we can use clusterings with different k’s at the same time and let the discriminative classifier cherry-pick the clusters at different granularities according to the supervised data. |
Named Entity Recognition | We can easily use multiple clusterings in feature extraction. |
Query Classification | When we extract features from multiple clusterings , the selection of the top-N clusters is done separately for each clustering. |
Query Classification | The best result is achieved with multiple phrasal clusterings . |
Evaluation | We evaluate our methods in two quantitative ways by measuring the degree to which we recover two different sets of gold-standard clusterings . |
Evaluation | To measure the similarity between the two clusterings of movie characters, gold clusters Q and induced latent persona clusters C, we calculate the variation of information (Meila, 2007): |
Evaluation | VI measures the information-theoretic distance between the two clusterings : a lower value means greater similarity, and VI = 0 if they are identical. |
Experimental Setup | We run four different clusterings for each base set size (except for the large sets, see below). |
Experimental Setup | The unique-event clusterings are motivated by the fact that in the Dupont—Rosenfeld model, frequent events are handled by discounted ML estimates. |
Experimental Setup | As we will see below, rare-event clusterings perform better than all-event clusterings . |
Results | When comparing all-event and unique-event clusterings , a clear tendency is apparent. |
Clustering phrase pairs directly using the K-means algorithm | Using multiple word clusterings simultaneously, each based on a different number of classes, could turn this global, hard tradeoff into a local, soft one, informed by the number of phrase pair instances available for a given granularity. |
Clustering phrase pairs directly using the K-means algorithm | In the same fashion, we can incorporate multiple tagging schemes (e.g., word clusterings of different gran-ularities) into the same feature vector. |
Experiments | Figure 1 (left) shows the performance of the distributional clustering model ( ‘Clust’ ) and its morphology-sensitive extension (‘Clust—morph’) according to this score for varying values of N = l, . |
Experiments | , 36 (the number Penn treebank POS tags, used for the ‘POS’ models, is 36).6 For ‘Clust’ , we see a comfortably wide plateau of nearly-identical scores from N = 7,. . |
Inference | For each pair of predicates, we search for clusterings to maximize the sum of the log-probability and the negated penalty term. |
Introduction | For predicates present in both sides of a bitext, we guide models in both languages to prefer clusterings which maximize agreement between predicate argument structures predicted for each aligned predicate pair. |
Monolingual Model | Now, when parameters and argument key clusterings are chosen, we can summarize the remainder of the generative story as follows. |
Problem Definition | The objective of this work is to improve argument key clusterings by inducing them simultaneously in two languages. |
Experiments | Finally, we consider mention properties derived from unsupervised clusterings ; these properties are designed to target semantic properties of nominals that should behave more like the oracle features than the phi features do. |
Experiments | We consider clusterings that take as input pairs (n, 7“) of a noun head n and a string 7“ which contains the semantic role of n (or some approximation thereof) conjoined with its governor. |
Experiments | We use four different clusterings in our experiments, each with twenty clusters: dependency-parse-derived NAIVEBAYES clusters, semantic-role-derived CONDITIONAL clusters, SRL-derived NAIVEBAYES clusters generating a NOVERB token when 7“ cannot be determined, and SRL-derived NAIVEBAYES clusters with all pronoun tuples discarded. |
Related Work | Their system could be extended to handle property information like we do, but our system has many other advantages, such as freedom from a pre-specified list of entity types, the ability to use multiple input clusterings , and discriminative projection of clusters. |
Experimental Setup | In the True Clusterings setting, we use the annotations to create perfect partitions of the DAs for input to the system; in the System |
Experimental Setup | Clusterings setting, we employ a hierarchical ag-glomerative clustering algorithm used for this task in (Wang and Cardie, 2011). |
Results | Table 3 indicates that, with both true clusterings and system clusterings , our system trained on out-of—domain data achieves comparable performance with the same system trained on in-domain data. |
Results | We randomly select 15 decision and 15 problem DA clusters (true clusterings ). |
Consensus Clustering | Our model gives a distribution over phylogenies p (given observations :13 and learned parameters (ID—and thus gives a posterior distribution over clusterings e, which can be used to answer various queries. |
Consensus Clustering | More similar clusterings achieve larger R, with R(e’, e) = 1 iff e’ = e. In all cases, 0 S R(e’,e) = R(e,e’) g 1. |
Consensus Clustering | As explained above, the sij are coreference probabilities sij that can be estimated from a sample of clusterings 6. |
Experiments | For PHYLO, the entity clustering is the result of (1) training the model using EM, (2) sampling from the posterior to obtain a distribution over clusterings , and (3) finding a consensus clustering. |
Background 2.1 Dependency parsing | By using prefixes of various lengths, we can produce clusterings of different granularities (Miller et al., 2004). |
Feature design | (2004), we use prefixes of the Brown cluster hierarchy to produce clusterings of varying granularity. |
Feature design | One possible explanation is that the clusterings generated by the Brown algorithm can be noisy or only weakly relevant to syntax; thus, the clusters are best exploited when “anchored” to words or parts of speech. |
Clustering Methods, Evaluation Metrics and Experimental Setup | We make use of these processes in all our experiments and systematically compute cluster labelling and feature maximisation on the output clusterings . |
Clustering Methods, Evaluation Metrics and Experimental Setup | As we shall see, this permits distinguishing between clusterings with similar F-measure but lower “linguistic plausibility” (cf. |
Clustering Methods, Evaluation Metrics and Experimental Setup | Following (Sun et al., 2010), we use modified purity (mPUR); weighted class accuracy (ACC) and F-measure to evaluate the clusterings produced. |
Word Clustering | This joint minimization for the clusterings for both languages clearly has no benefit since the two terms of the objective are independent. |
Word Clustering | Using this weighted vocabulary alignment, we state an objective that encourages clusterings to have high average mutual information when alignment links are followed; that is, on average how much information does knowing the cluster of a word cc 6 E impart about the clustering of y E Q, and vice-versa? |
Word Clustering | We compare two different clusterings of a two-sentence Arabic-English parallel corpus (the English half of the corpus contains the same sentence, twice, while the Arabic half has two variants with the same meaning). |
Related Work 2.1 WordNet-based Approach | The key idea is to use the latent clusterings to take the place of WordNet semantic classes. |
Related Work 2.1 WordNet-based Approach | Where the latent clusterings are automatically derived from distributional data based on EM algorithm. |
Related Work 2.1 WordNet-based Approach | Recently, more sophisticated methods are innovated for SP based on topic models, where the latent variables (topics) take the place of semantic classes and distributional clusterings (Seaghdha, 2010; Ritter et al., 2010). |