A Fast and Accurate Method for Approximate String Search

This paper proposes a new method for approximate string search, specifically candidate generation in spelling error correction, which is a task as follows.

This paper addresses the following problem, referred to as approximate string search.

Approximate string search has been studied by many researchers.

As an example of approximate string search, we consider candidate generation in spelling correction.

In this section, we introduce how to efficiently perform top k candidate generation.

We have experimentally evaluated our approach in spelling error correction of queries in web search.

In this paper, we have proposed a new method for approximate string search, including spelling error correction, which is both accurate and efficient.

Appears in 13 sentences as: Word Pair (1) word pair (8) Word Pairs (1) word pairs (4)

In *A Fast and Accurate Method for Approximate String Search*

- Figure 1: Example of rule extraction from word pairPage 3, “Model for Candidate Generation”
- If we can apply a set of rules to transform the misspelled word mm to a correct word we in the vocabulary, then we call the rule set a “transformation” for the word pair mm and we.Page 3, “Model for Candidate Generation”
- Note that for a given word pair , it is likely that there are multiple possible transformations for it.Page 3, “Model for Candidate Generation”
- Without loss of generality, we set the maximum number of rules applicable to a word pair to be a fixed number.Page 3, “Model for Candidate Generation”
- As a result, the number of possible transformations for a word pair is finite, and usually limited.Page 3, “Model for Candidate Generation”
- Given word pair (wm, we), let R(wm, we) denote one transformation (a set of rules) that can rewritePage 3, “Model for Candidate Generation”
- This is not a trivial problem, however, because the “true” transformation R* for each word pair and is not given in the training data.Page 4, “Model for Candidate Generation”
- At each step, we first find the best transformation for each word pair based on the current parameters A05)Page 4, “Model for Candidate Generation”
- 5.1 Word Pair MiningPage 6, “Experimental Results”
- Table 1 shows some examples of the mined word pairs .Page 6, “Experimental Results”
- Table 1: Examples of Word PairsPage 6, “Experimental Results”

See all papers in *Proc. ACL 2011* that mention word pair.

See all papers in *Proc. ACL* that mention word pair.

Back to top.

Appears in 6 sentences as: conditional probabilities (1) conditional probability (5)

In *A Fast and Accurate Method for Approximate String Search*

- The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction conditioned on the misspelled word.Page 1, “Abstract”
- The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction given the misspelled word.Page 2, “Introduction”
- We define the conditional probability distribution of we and R(wm, we) given mm as the following log linear model:Page 4, “Model for Candidate Generation”
- E V is a correction of The objective of training would be to maximize the conditional probability P0122, R(wfn, over the training data.Page 4, “Model for Candidate Generation”
- In this paper, we assume that the transformation that actually generates the correction among all the possible transformations is the one that can give the maximum conditional probability ; the exactly same criterion is also used for fast prediction.Page 4, “Model for Candidate Generation”
- We then choose the sum as a ranking score, which is equivalent to ranking candidates based on their largest conditional probabilities .Page 4, “Model for Candidate Generation”

See all papers in *Proc. ACL 2011* that mention conditional probability.

See all papers in *Proc. ACL* that mention conditional probability.

Back to top.

Appears in 5 sentences as: edit distance (6)

In *A Fast and Accurate Method for Approximate String Search*

- The use of edit distance is a typical example, which exploits operations of character deletion, insertion and substitution.Page 2, “Related Work”
- Some methods generate candidates within a fixed range of edit distance or different ranges for strings with different lengths (Li et al., 2006; Whitelaw et al., 2009).Page 2, “Related Work”
- Other methods make use of weighted edit distance to enhance the representation power of edit distance (Ristad and Yianilos, 1998; Oncina and Sebban, 2005; McCal-lum et al., 2005; Ahmad and Kondrak, 2005).Page 2, “Related Work”
- Conventional edit distance does not take in consideration context information.Page 2, “Related Work”
- For example, people tend to misspell “c” to “s” or “k” depending on contexts, and a straightforward application of edit distance cannot deal with the problem.Page 2, “Related Work”

See all papers in *Proc. ACL 2011* that mention edit distance.

See all papers in *Proc. ACL* that mention edit distance.

Back to top.

Appears in 4 sentences as: generative model (4)

In *A Fast and Accurate Method for Approximate String Search*

- In spelling error correction, Brill and Moore (2000) proposed employing a generative model for candidate generation and a hierarchy of trie structures for fast candidate retrieval.Page 1, “Introduction”
- For example, Brill and Moore (2000) developed a generative model including contextual substitution rules; and Toutanova and Moore (2002) further improved the model by adding pronunciation factors into the model.Page 2, “Related Work”
- Two representative methods were used as baselines: the generative model proposed by (Brill and Moore, 2000) referred to as generative and the logistic regression model proposed by (Okazaki et al., 2008)Page 6, “Experimental Results”
- Usually a discriminative model works better than a generative model , and that seems to be what happens with small k’s.Page 6, “Experimental Results”

See all papers in *Proc. ACL 2011* that mention generative model.

See all papers in *Proc. ACL* that mention generative model.

Back to top.

Appears in 4 sentences as: logistic regression (4)

In *A Fast and Accurate Method for Approximate String Search*

- (2008) proposed using a logistic regression model for approximate dictionary matching.Page 1, “Introduction”
- (2008) utilized substring substitution rules and incorporated the rules into a L1-regularized logistic regression model.Page 2, “Related Work”
- Two representative methods were used as baselines: the generative model proposed by (Brill and Moore, 2000) referred to as generative and the logistic regression model proposed by (Okazaki et al., 2008)Page 6, “Experimental Results”
- When using their method for ranking, we used outputs of the logistic regression model as rank scores.Page 6, “Experimental Results”

See all papers in *Proc. ACL 2011* that mention logistic regression.

See all papers in *Proc. ACL* that mention logistic regression.

Back to top.

Appears in 4 sentences as: regression model (4)

In *A Fast and Accurate Method for Approximate String Search*

- (2008) proposed using a logistic regression model for approximate dictionary matching.Page 1, “Introduction”
- (2008) utilized substring substitution rules and incorporated the rules into a L1-regularized logistic regression model .Page 2, “Related Work”
- Two representative methods were used as baselines: the generative model proposed by (Brill and Moore, 2000) referred to as generative and the logistic regression model proposed by (Okazaki et al., 2008)Page 6, “Experimental Results”
- When using their method for ranking, we used outputs of the logistic regression model as rank scores.Page 6, “Experimental Results”

See all papers in *Proc. ACL 2011* that mention regression model.

See all papers in *Proc. ACL* that mention regression model.

Back to top.

Appears in 3 sentences as: probability distribution (3)

In *A Fast and Accurate Method for Approximate String Search*

- The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction conditioned on the misspelled word.Page 1, “Abstract”
- The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction given the misspelled word.Page 2, “Introduction”
- We define the conditional probability distribution of we and R(wm, we) given mm as the following log linear model:Page 4, “Model for Candidate Generation”

See all papers in *Proc. ACL 2011* that mention probability distribution.

See all papers in *Proc. ACL* that mention probability distribution.

Back to top.