Semi-Supervised SimHash for Efficient Document Similarity Search
Jiang, Qixia and Sun, Maosong

Article Structure

Abstract

Searching documents that are similar to a query document is an important component in modern information retrieval.

Introduction

Document Similarity Search (DSS) is to find similar documents to a query doc in a text corpus or on the web.

Background and Related Works

Suppose we are given a set of N documents, 26 2 {xi | x,- E RMl.

Semi-Supervised SimHash

In this section, we present our hashing method, named Semi-Supervised SimHash (83H).

Experiments

We use two datasets 20 Newsgroups and Open Directory Project (ODP) in our experiments.

The direction is determined by concatenating w L times.

Topics

Semi-Supervised

Appears in 8 sentences as: Semi-Supervised (8)
In Semi-Supervised SimHash for Efficient Document Similarity Search
  1. This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (83H) for high—dimensional data similarity search.
    Page 1, “Abstract”
  2. vated by this, some supervised methods are proposed to derive effective hash functions from prior knowledge, i.e., Spectral Hashing (Weiss et al., 2009) and Semi-Supervised Hashing (SSH) (Wang et al., 2010a).
    Page 2, “Introduction”
  3. This paper proposes a novel (semi-)supervised hashing method, Semi-Supervised SimHash (S3H), for high-dimensional data similarity search.
    Page 2, “Introduction”
  4. In Section 3, we describe our proposed Semi-Supervised SimHash (S3H).
    Page 2, “Introduction”
  5. 2.3 Semi-Supervised Hashing
    Page 2, “Background and Related Works”
  6. Semi-Supervised Hashing (SSH) (Wang et al., 2010a) is recently proposed to incorporate prior knowledge for better hashing.
    Page 2, “Background and Related Works”
  7. In this section, we present our hashing method, named Semi-Supervised SimHash (83H).
    Page 3, “Semi-Supervised SimHash”
  8. We have proposed a novel supervised hashing method named Semi-Supervised Simhash (83H) for high-dimensional data similarity search.
    Page 8, “The direction is determined by concatenating w L times.”

See all papers in Proc. ACL 2011 that mention Semi-Supervised.

See all papers in Proc. ACL that mention Semi-Supervised.

Back to top.

feature weights

Appears in 5 sentences as: feature weight (1) feature weights (4)
In Semi-Supervised SimHash for Efficient Document Similarity Search
  1. The basic idea of 83H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes.
    Page 1, “Abstract”
  2. Unlike SSH that tries to find a sequence of hash functions, S3H fixes the random projection directions and seeks the optimal feature weights from prior knowledge to relocate the objects such that similar objects have similar fingerprints.
    Page 2, “Introduction”
  3. where w 6 RM is the feature weight to be determined and d; is the l-th column of the matrix D.
    Page 3, “Semi-Supervised SimHash”
  4. Instead, S3H seeks the optimal feature weights via L—BFGS, which is still efficient even for very high-dimensional data.
    Page 7, “The direction is determined by concatenating w L times.”
  5. 83H learns the optimal feature weights from prior knowledge to relocate the data such that similar objects have similar fingerprints.
    Page 8, “The direction is determined by concatenating w L times.”

See all papers in Proc. ACL 2011 that mention feature weights.

See all papers in Proc. ACL that mention feature weights.

Back to top.

labeled data

Appears in 5 sentences as: labeled data (5)
In Semi-Supervised SimHash for Efficient Document Similarity Search
  1. This is implemented by maximizing the empirical accuracy on the prior knowledge ( labeled data ) and the entropy of hash functions (estimated over labeled and unlabeled data).
    Page 2, “Introduction”
  2. Let XL 2 {(X1,cl)...(xu,cu)} be the labeled data , c E {1...0}, X 6 RM, and XU = {xu+1...xN} the unlabeled data.
    Page 3, “Semi-Supervised SimHash”
  3. Given the labeled data XL, we construct two sets, attraction set @a and repulsion set 9?.
    Page 3, “Semi-Supervised SimHash”
  4. Furthermore, we also hope to maximize the empirical accuracy on the labeled data @a and @r and
    Page 3, “Semi-Supervised SimHash”
  5. This is implemented by maximizing the empirical accuracy on labeled data together with the entropy of hash functions.
    Page 8, “The direction is determined by concatenating w L times.”

See all papers in Proc. ACL 2011 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

LDA

Appears in 4 sentences as: LDA (3) LDA’s (1)
In Semi-Supervised SimHash for Efficient Document Similarity Search
  1. Clearly, Equation (12) is analogous to Linear Discriminant Analysis ( LDA ) (Duda et al., 2000) except for the difference: 1) measurement.
    Page 5, “Semi-Supervised SimHash”
  2. 83H uses similarity while LDA uses distance.
    Page 5, “Semi-Supervised SimHash”
  3. As a result, the objective function of 83H is just the reciprocal of LDA’s .
    Page 5, “Semi-Supervised SimHash”
  4. LDA seeks the best separative direction in the original attribute space.
    Page 5, “Semi-Supervised SimHash”

See all papers in Proc. ACL 2011 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

SVD

Appears in 3 sentences as: SVD (3)
In Semi-Supervised SimHash for Efficient Document Similarity Search
  1. empirical accuracy on prior knowledge and maximum entropy by finding the top L eigenvectors of an extended covariance matrix2 via PCA or SVD .
    Page 3, “Background and Related Works”
  2. However, despite of the potential problems of numerical stability, SVD requires massive computational space and 0(M3) computational time where M is feature dimension, which limits its usage for high-dimensional data (Trefethen et al., 1997).
    Page 3, “Background and Related Works”
  3. This is mainly because SSH requires SVD to find the optimal hashing functions which is computational expensive.
    Page 7, “The direction is determined by concatenating w L times.”

See all papers in Proc. ACL 2011 that mention SVD.

See all papers in Proc. ACL that mention SVD.

Back to top.

unlabeled data

Appears in 3 sentences as: unlabeled data (3)
In Semi-Supervised SimHash for Efficient Document Similarity Search
  1. This is implemented by maximizing the empirical accuracy on the prior knowledge (labeled data) and the entropy of hash functions (estimated over labeled and unlabeled data ).
    Page 2, “Introduction”
  2. Let XL 2 {(X1,cl)...(xu,cu)} be the labeled data, c E {1...0}, X 6 RM, and XU = {xu+1...xN} the unlabeled data .
    Page 3, “Semi-Supervised SimHash”
  3. the labeled and unlabeled data , 26,; and XU.
    Page 4, “Semi-Supervised SimHash”

See all papers in Proc. ACL 2011 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.