SciSurf: Index of 'Post-Retrieval Clustering Using Third-Order Similarity Measures'

Post-Retrieval Clustering Using Third-Order Similarity Measures

Moreno, Jose G. and Dias, Gaël and Cleuziou, Guillaume

Published in Proc. ACL, 2013

Article Structure

Abstract

Post-retrieval clustering is the task of clustering Web search results.

Introduction

Post-retrieval clustering (PRC), also known as search results clustering or ephemeral clustering, is the task of clustering Web search results.

Polythetic Post-Retrieval Clustering

The K -means is a geometric clustering algorithm (Lloyd, 1982).

Stopping Criterion

Once clustering has been processed, selecting the best number of clusters still remains to be decided.

Evaluation

Evaluating PRC systems is a difficult task as stated in (Carpineto et al., 2009).

Conclusions

In this paper, we proposed a new PRC approach which (1) is based on the adaptation of the K -means algorithm to third-order similarity measures and (2) proposes a coherent stopping criterion.

Topics

similarity measure

Appears in 12 sentences as: similarity measure (7) similarity measures (7)

In Post-Retrieval Clustering Using Third-Order Similarity Measures

Within this context, we propose a new methodology that adapts the classical K -means algorithm to a third-order similarity measure initially developed for NLP tasks.
Page 1, “Abstract”
feature vectors are hard to define in small collections of short text fragments (Timonen, 2013), (2) existing second-order similarity measures such as the cosine are unadapted to capture the semantic similarity between small texts, (3) Latent Semantic Analysis has evidenced inconclusive results (Osinski and Weiss, 2005) and (4) the labeling process is a surprisingly hard extra task (Carpineto et al., 2009).
Page 1, “Introduction”
For that purpose, we propose a new methodology that adapts the classical K -means algorithm to a third-order similarity measure initially developed for Topic Segmentation (Dias et al., 2007).
Page 1, “Introduction”
Within the context of PRC, the K -means algorithm needs to be adapted to integrate third-order similarity measures (Mihalcea et al., 2006; Dias et al., 2007).
Page 2, “Polythetic Post-Retrieval Clustering”
Third-order similarity measures, also called weighted second-order similarity measures, do not rely on exact matches of word features as classical second-order similarity measures (6. g. the cosine metric), but rather evaluate similarity based on related matches.
Page 2, “Polythetic Post-Retrieval Clustering”
In this paper, we propose to use the third-order similarity measure called InfoSimba introduced in (Dias et al., 2007) for Topic Segmentation and implement its simplified version 83 in Equation 2.
Page 2, “Polythetic Post-Retrieval Clustering”
Given two Web snippets Xi and X 3, their similarity is evaluated by the similarity of its constituents based on any symmetric similarity measure S(., where VVz-k, (resp.
Page 2, “Polythetic Post-Retrieval Clustering”
A direct consequence of the change in similarity measure is the definition of a new objective function Q53 to ensure convergence.
Page 2, “Polythetic Post-Retrieval Clustering”
So, for each word 21) E V and any symmetric similarity measure S(., .
Page 2, “Polythetic Post-Retrieval Clustering”
In particular, p is the size of the word feature vectors representing both Web snippets and centroids (p = 2.5), K is the number of clusters to be found (K = 2..10) and 8(Wik, 14/31) is the collocation measure integrated in the InfoSimba similarity measure .
Page 4, “Evaluation”
In this paper, we proposed a new PRC approach which (1) is based on the adaptation of the K -means algorithm to third-order similarity measures and (2) proposes a coherent stopping criterion.
Page 5, “Conclusions”

See all papers in Proc. ACL 2013 that mention similarity measure.

See all papers in Proc. ACL that mention similarity measure.

gold standard

Appears in 5 sentences as: gold standard (5)

In Post-Retrieval Clustering Using Third-Order Similarity Measures

Results obtained with the definition of a new stopping criterion over the GDP-239 and the MORESQUE gold standard datasets evidence that our proposal outperforms all reported text-based approaches.
Page 1, “Abstract”
Second, we propose a comparative evaluation with existing state-of-the-art algorithms over gold standard datasets and recent clustering evaluation metrics.
Page 4, “Evaluation”
In order to perform this task, we evaluate performance based on the Fbs measure defined in (Amigo et al., 2009) over the GDP-239 gold standard dataset proposed in (Carpineto and Romano,
Page 4, “Evaluation”
We propose comparative experiments over two gold standard datasets (GDP-239 (Carpineto and Romano, 2010) and MORESQUE (Di Marco and Navigli, 2013)) for STC (Za-mir and Etzioni, 1998), LINGO (Osinski and Weiss, 2005), OPTIMSRC (Carpineto and Romano, 2010) and the Bisecting Incremental K -means (BIK) which may be seen as a baseline for the polythetic paradigm.
Page 5, “Evaluation”
Results evidenced clear improvements over the evaluated state-of-the-art text-based approaches for two gold standard datasets.
Page 5, “Conclusions”

See all papers in Proc. ACL 2013 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

feature vectors

Appears in 4 sentences as: feature vectors (4)

In Post-Retrieval Clustering Using Third-Order Similarity Measures

On the other hand, the polythetic approach which main idea is to represent Web snippets as word feature vectors has received less attention, the only relevant work being (Osinski and Weiss, 2005).
Page 1, “Introduction”
feature vectors are hard to define in small collections of short text fragments (Timonen, 2013), (2) existing second-order similarity measures such as the cosine are unadapted to capture the semantic similarity between small texts, (3) Latent Semantic Analysis has evidenced inconclusive results (Osinski and Weiss, 2005) and (4) the labeling process is a surprisingly hard extra task (Carpineto et al., 2009).
Page 1, “Introduction”
Before the clustering process takes place, Web snippets are represented as word feature vectors .
Page 4, “Evaluation”
In particular, p is the size of the word feature vectors representing both Web snippets and centroids (p = 2.5), K is the number of clusters to be found (K = 2..10) and 8(Wik, 14/31) is the collocation measure integrated in the InfoSimba similarity measure.
Page 4, “Evaluation”

See all papers in Proc. ACL 2013 that mention feature vectors.

See all papers in Proc. ACL that mention feature vectors.

objective function

Appears in 4 sentences as: objective function (4)

In Post-Retrieval Clustering Using Third-Order Similarity Measures

Finally, the evolution of the objective function of the adapted K -means is modeled to automatically define the “best” number of clusters.
Page 1, “Introduction”
To assure convergence, an objective function Q is defined which decreases at each processing step.
Page 2, “Polythetic Post-Retrieval Clustering”
The classical objective function is defined in Equation (1) where wk, is a cluster labeled k, xi 6 wk, is an object in the cluster, mm is the centroid of the cluster wk, and E(., is the Euclidean distance.
Page 2, “Polythetic Post-Retrieval Clustering”
A direct consequence of the change in similarity measure is the definition of a new objective function Q53 to ensure convergence.
Page 2, “Polythetic Post-Retrieval Clustering”

See all papers in Proc. ACL 2013 that mention objective function.

See all papers in Proc. ACL that mention objective function.