Two-Stage Hashing for Fast Document Retrieval
Li, Hao and Liu, Wei and Ji, Heng

Article Structure

Abstract

This work fulfills sublinear time Nearest Neighbor Search (NNS) in massive-scale document collections.

Introduction

A Nearest Neighbor Search (NNS) task aims at searching for top K objects (e. g., documents) which are most similar, based on predefined similarity metrics, to a given query object in an existing dataset.

Background and Terminology

o Binary Codes: A bit (a single bit is “0” or “1”) sequence assigned to represent a data object.

Document Retrieval with Hashing

In this section, we first provide an overview of applying hashing techniques to a document retrieval task, and then introduce two unsupervised hashing algorithms: LSH acts as a neighbor-candidate filter, while ITQ works towards precise reranking over the candidate pool returned by LSH.

Experiments

Data and Evaluations

Conclusion

In this paper, we proposed a novel two-stage unsuperVised hashing framework for efficient and effective nearest neighbor search in massive docu-

Topics

reranking

Appears in 11 sentences as: rerank (2) Reranking (1) reranking (8)
In Two-Stage Hashing for Fast Document Retrieval
  1. LSH accounts for neighbor candidate pruning, while ITQ provides an efficient and effective reranking over the neighbor pool captured by LSH.
    Page 1, “Abstract”
  2. In this section, we first provide an overview of applying hashing techniques to a document retrieval task, and then introduce two unsupervised hashing algorithms: LSH acts as a neighbor-candidate filter, while ITQ works towards precise reranking over the candidate pool returned by LSH.
    Page 2, “Document Retrieval with Hashing”
  3. Hamming Reranking
    Page 3, “Document Retrieval with Hashing”
  4. In this framework, LSH accounts for neighbor candidate pruning, while ITQ provides an efficient and effective reranking over the neighbor pool captured by LSH.
    Page 3, “Document Retrieval with Hashing”
  5. Therefore, a high hash lookup success rate is attained by the LSH stage, while maintaining high search precision due to the ITQ reranking stage.
    Page 4, “Document Retrieval with Hashing”
  6. Another crucial observation is that with ITQ reranking , a small number of LSH hash tables is needed in the pruning step.
    Page 4, “Experiments”
  7. Since the LSH pruning time can be ignored, the search time of the two-stage hashing scheme equals to the time of hamming distance reranking in ITQ codes for all candidates produced from LSH pruning step, e.g., LSH(48bits, 4 tables) +
    Page 4, “Experiments”
  8. 2 (f) shows the ITQ data reranking percentage for different LSH bit lengths and table numbers.
    Page 5, “Experiments”
  9. As the LSH bit length increases or the hash table number decreases, a lower percentage of the candidates will be selected for reranking , and thus costs less search time.
    Page 5, “Experiments”
  10. Generally, higher rerank percentage results in better top-K NNS Precision.
    Page 5, “Experiments”
  11. The reason of the lower performance with large K is that some true neighbors with the same topic label do not share high term similarities and may be filtered out in the LSH step when the rerank percentage is low.
    Page 5, “Experiments”

See all papers in Proc. ACL 2014 that mention reranking.

See all papers in Proc. ACL that mention reranking.

Back to top.

Cosine similarity

Appears in 6 sentences as: Cosine similarity (5) cosine similarity (1)
In Two-Stage Hashing for Fast Document Retrieval
  1. for example, one can store 250 million documents with 1.9G memory using only 64 bits for each document while a large news corpus such as the English Gigaword fifth edition1 stores 10 million documents in a 26G hard drive; 2) the time efficiency of manipulating binary codes, for example, computing the hamming distance between a pair of binary codes is several orders of magnitude faster than computing the real-valued cosine similarity over a pair of document vectors.
    Page 1, “Introduction”
  2. Given a query document vector q, we use the Cosine similarity measure to evaluate the similarity between q and a document a: in a dataset:
    Page 2, “Document Retrieval with Hashing”
  3. However, such a brute-force search does not scale to massive datasets since the search time complexity for each query is 0(n); additionally, the computational cost spent on Cosine similarity calculation is also nontrivial.
    Page 2, “Document Retrieval with Hashing”
  4. (1) It tries to preserve the Cosine similarity of the original data with a probabilistic guarantee (Charikar, 2002).
    Page 3, “Document Retrieval with Hashing”
  5. We first evaluate the quality of term vectors and ITQ binary codes by conducting the whole list Cosine similarity ranking and hamming distance ranking, respectively.
    Page 4, “Experiments”
  6. For each query document, the top-K candidate documents with highest Cosine similarity scores and shortest hamming distances are returned, then we calculate the average precision for each K. Fig.
    Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention Cosine similarity.

See all papers in Proc. ACL that mention Cosine similarity.

Back to top.

similarity measures

Appears in 4 sentences as: similarity measure (1) similarity measures (3)
In Two-Stage Hashing for Fast Document Retrieval
  1. Furthermore, we make the hashing framework applicable to combine different similarity measures in NNS.
    Page 2, “Introduction”
  2. Given a query document vector q, we use the Cosine similarity measure to evaluate the similarity between q and a document a: in a dataset:
    Page 2, “Document Retrieval with Hashing”
  3. Enable a hybrid hashing scheme combining two similarity measures .
    Page 4, “Document Retrieval with Hashing”
  4. Moreover, our approach can combine two similarity measures in a hybrid hashing scheme, which is beneficial to comprehensively modeling the document similarity.
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention similarity measures.

See all papers in Proc. ACL that mention similarity measures.

Back to top.