Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition

We present a novel approach to deciding whether two sentences hold a paraphrase relationship.

The problem of modeling paraphrase relationships between natural language utterances (McK-eown, 1979) has recently attracted interest.

Since our task is a classification problem, we require our model to provide an estimate of the posterior probability of the relationship (i.e., “paraphrase,” denoted p, or “not paraphrase,” denoted n), given the pair of sentences.1 Here, pQ denotes model probabilities, c is a relationship class (p or n), and 51 and 52 are the two sentences.

Here, we turn to the models G p and G n in detail.

In all our experiments, we have used the Microsoft Research Paraphrase Corpus (Dolan et al., 2004; Quirk et al., 2004).

Here we present our experimental evaluation using pQ.

Incorporating structural alignment and surface overlap features inside a single model can make exact inference infeasible.

There is a growing body of research that uses the MSRPC (Dolan et al., 2004; Quirk et al., 2004) to build models of paraphrase.

In this paper, we have presented a probabilistic model of paraphrase incorporating syntax, lexical semantics, and hidden loose alignments between two sentences’ trees.

Appears in 12 sentences as: SVM (13)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- Dolan and Brockett (2005) remark that this corpus was created semiautomatically by first training an SVM classifier on a disjoint annotated 10,000 sentence pair dataset and then applying the SVM on an unseen 49,375 sentence pair corpus, with its output probabilities skewed towards over-identification, i.e., towards generating some false paraphrases.Page 5, “Data and Task”
- The SVM was trained to classify positive and negative examples of paraphrase using SVMlight (J oachims, 1999).8 Metaparameters, tuned on the development data, were the regularization constant and the degree of the polynomial kernel (chosen in [10—5, 102] and 1—5 respectively.Page 6, “Experimental Evaluation”
- It is unsurprising that the SVM performs very well on the MSRPC because of the corpus creation process (see Sec.Page 6, “Experimental Evaluation”
- 4) where an SVM was applied as well, with very similar features and a skewed decision process (Dolan and Brockett, 2005).Page 6, “Experimental Evaluation”
- 2 shows performance achieved by the baseline SVM and variations on mg on the test set.Page 7, “Experimental Evaluation”
- The development data-selected pQ achieves higher recall by 1 point than Wan et al.’s SVM , but has precision 2 points worse.Page 7, “Experimental Evaluation”
- the more intricate QG to the straightforward SVM .Page 7, “Experimental Evaluation”
- LR (like the QG) provides a probability distribution, but uses surface features (like the SVM ).Page 8, “Product of Experts”
- 2; this model is on par with the SVM , though trading recall in favor of precision.Page 8, “Product of Experts”
- We view it as a probabilistic simulation of the SVM more suitable for combination with the QG.Page 8, “Product of Experts”
- 11This accuracy is significant over my under a paired t-test (p < 0.04), but is not significant over the SVM .Page 8, “Product of Experts”

See all papers in *Proc. ACL 2009* that mention SVM.

See all papers in *Proc. ACL* that mention SVM.

Back to top.

Appears in 11 sentences as: lexical semantic (1) lexical semantics (10)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- The model cleanly incorporates both syntax and lexical semantics using quasi-synchronous dependency grammars (Smith and Eisner, 2006).Page 1, “Abstract”
- Because dependency syntax is still only a crude approximation to semantic structure, we augment the model with a lexical semantics component, based on WordNet (Miller, 1995), that models how words are probabilistically altered in generating a paraphrase.Page 1, “Introduction”
- This combination of loose syntax and lexical semantics is similar to the “Jeopardy” model of Wang et al.Page 1, “Introduction”
- (2007) in treating the correspondences as latent variables, and in using a WordNet—based lexical semantics model to generate the target words.Page 2, “QG for Paraphrase Modeling”
- 5 We use log-linear models three times: for the configuration, the lexical semantics class, and the word.Page 4, “QG for Paraphrase Modeling”
- WordNet relation(s) The model next chooses a lexical semantics relation between 3360-) and the yet-to-be-chosen word ti (line 12).Page 4, “QG for Paraphrase Modeling”
- Word Finally, the target word is randomly chosen from among the set of words that bear the lexical semantic relationship just chosen (line 13).Page 4, “QG for Paraphrase Modeling”
- (2007) designed pin-d as an interpolation between a log-linear lexical semantics model and a word model.Page 4, “QG for Paraphrase Modeling”
- We removed the lexical semantics component of the QG,10 and disallowed the syntactic configurations one by one, to investigate which components of mg contributes to system performance.Page 7, “Experimental Evaluation”
- The lexical semantics component is critical, as seen by the drop in accuracy from the table (without this component, pQ behaves almost like the “all p” baseline).Page 7, “Experimental Evaluation”
- In this paper, we have presented a probabilistic model of paraphrase incorporating syntax, lexical semantics , and hidden loose alignments between two sentences’ trees.Page 8, “Conclusion”

See all papers in *Proc. ACL 2009* that mention lexical semantics.

See all papers in *Proc. ACL* that mention lexical semantics.

Back to top.

Appears in 7 sentences as: log-linear (7)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- 5 We use log-linear models three times: for the configuration, the lexical semantics class, and the word.Page 4, “QG for Paraphrase Modeling”
- (2007),6 we employ a 14-feature log-linear model over all logically possible combinations of the 14 WordNet relations (Miller, 1995).7 Similarly to Eq.Page 4, “QG for Paraphrase Modeling”
- 14, we normalize this log-linear model based on the set of relations that are nonempty in WordNet for the word 3360-).Page 4, “QG for Paraphrase Modeling”
- (2007) designed pin-d as an interpolation between a log-linear lexical semantics model and a word model.Page 4, “QG for Paraphrase Modeling”
- 14, and the weights of the various features in the log-linear model for the lexical-semantics model.Page 5, “QG for Paraphrase Modeling”
- These features have to be included in estimating pkn-d, which has log-linear component models (Eq.Page 7, “Product of Experts”
- For these bigram or trigram overlap features, a similar log-linear model has to be normalized with a partition function, which considers the (unnormalized) scores of all possible target sentences, given the source sentence.Page 7, “Product of Experts”

See all papers in *Proc. ACL 2009* that mention log-linear.

See all papers in *Proc. ACL* that mention log-linear.

Back to top.

Appears in 7 sentences as: WordNet (7)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- Because dependency syntax is still only a crude approximation to semantic structure, we augment the model with a lexical semantics component, based on WordNet (Miller, 1995), that models how words are probabilistically altered in generating a paraphrase.Page 1, “Introduction”
- WordNet relation(s) The model next chooses a lexical semantics relation between 3360-) and the yet-to-be-chosen word ti (line 12).Page 4, “QG for Paraphrase Modeling”
- (2007),6 we employ a 14-feature log-linear model over all logically possible combinations of the 14 WordNet relations (Miller, 1995).7 Similarly to Eq.Page 4, “QG for Paraphrase Modeling”
- 14, we normalize this log-linear model based on the set of relations that are nonempty in WordNet for the word 3360-).Page 4, “QG for Paraphrase Modeling”
- html#wn) to WordNet (Miller, 1995) for lemmatization information.Page 6, “Experimental Evaluation”
- We also tried ablating the WordNet relations, and observed that the “identical-word” feature hurt the model the most.Page 7, “Experimental Evaluation”
- 10This is accomplished by eliminating lines 12 and 13 from the definition of pm and redefining pword to be the unigram word distribution estimated from the Gigaword corpus, as in G0, without the help of WordNet .Page 7, “Experimental Evaluation”

See all papers in *Proc. ACL 2009* that mention WordNet.

See all papers in *Proc. ACL* that mention WordNet.

Back to top.

Appears in 6 sentences as: probabilistic model (5) probability model (1)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- This syntactic framework represents a major departure from useful and popular surface similarity features, and the latter are difficult to incorporate into our probabilistic model .Page 1, “Introduction”
- We introduce our probabilistic model in §2.Page 1, “Introduction”
- For the present, consider it a specially-defined probabilistic model that generates sentences with a specific property, like “paraphrases s,” when 0 = p.) Given 5, Ge generates the other sentence in the pair, 5’ .Page 2, “Probabilistic Model”
- It is never used for parsing or for generation; it is only used as a component in the generative probability model presented in §2 (Eq.Page 5, “QG for Paraphrase Modeling”
- It is quite promising that a linguistically-motivated probabilistic model comes so close to a string-similarity baseline, without incorporating string-local phrases.Page 7, “Experimental Evaluation”
- In this paper, we have presented a probabilistic model of paraphrase incorporating syntax, lexical semantics, and hidden loose alignments between two sentences’ trees.Page 8, “Conclusion”

See all papers in *Proc. ACL 2009* that mention probabilistic model.

See all papers in *Proc. ACL* that mention probabilistic model.

Back to top.

Appears in 6 sentences as: unigram (7)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- (15) Here aw is the Good-Turing unigram probability estimate of a word 21) from the Gigaword corpus (Graff, 2003).Page 4, “QG for Paraphrase Modeling”
- As noted, the distributions pm, the word unigram weights in Eq.Page 5, “QG for Paraphrase Modeling”
- Notice the high lexical overlap between the two sentences ( unigram overlap of 100% in one direction and 72% in the other).Page 6, “Data and Task”
- 19 is another true paraphrase pair with much lower lexical overlap ( unigram overlap of 50% in one direction and 30% in the other).Page 6, “Data and Task”
- (2006), using features calculated directly from 51 and 52 without recourse to any hidden structure: proportion of word unigram matches, proportion of lemma-tized unigram matches, BLEU score (Papineni et al., 2001), BLEU score on lemmatized tokens, F measure (Turian et al., 2003), difference of sentence length, and proportion of dependency relation overlap.Page 6, “Experimental Evaluation”
- 10This is accomplished by eliminating lines 12 and 13 from the definition of pm and redefining pword to be the unigram word distribution estimated from the Gigaword corpus, as in G0, without the help of WordNet.Page 7, “Experimental Evaluation”

See all papers in *Proc. ACL 2009* that mention unigram.

See all papers in *Proc. ACL* that mention unigram.

Back to top.

Appears in 5 sentences as: log-linear model (4) log-linear models (1)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- 5 We use log-linear models three times: for the configuration, the lexical semantics class, and the word.Page 4, “QG for Paraphrase Modeling”
- (2007),6 we employ a 14-feature log-linear model over all logically possible combinations of the 14 WordNet relations (Miller, 1995).7 Similarly to Eq.Page 4, “QG for Paraphrase Modeling”
- 14, we normalize this log-linear model based on the set of relations that are nonempty in WordNet for the word 3360-).Page 4, “QG for Paraphrase Modeling”
- 14, and the weights of the various features in the log-linear model for the lexical-semantics model.Page 5, “QG for Paraphrase Modeling”
- For these bigram or trigram overlap features, a similar log-linear model has to be normalized with a partition function, which considers the (unnormalized) scores of all possible target sentences, given the source sentence.Page 7, “Product of Experts”

See all papers in *Proc. ACL 2009* that mention log-linear model.

See all papers in *Proc. ACL* that mention log-linear model.

Back to top.

Appears in 5 sentences as: Treebank (1) treebank (4)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- (2005), trained on sections 2—21 of the WSJ Penn Treebank , transformed to dependency trees following Yamada and Matsumoto (2003).Page 3, “QG for Paraphrase Modeling”
- (The same treebank data were also to estimate many of the parameters of our model, as discussed in the text.)Page 3, “QG for Paraphrase Modeling”
- 4 is estimated in our model using the transformed treebank (see footnote 4).Page 3, “QG for Paraphrase Modeling”
- We estimate the distributions over dependency labels, POS tags, and named entity classes using the transformed treebank (footnote 4).Page 5, “QG for Paraphrase Modeling”
- 15, and the parameters of the base grammar are fixed using the treebank (see footnote 4) and the Gigaword corpus.Page 5, “QG for Paraphrase Modeling”

See all papers in *Proc. ACL 2009* that mention treebank.

See all papers in *Proc. ACL* that mention treebank.

Back to top.

Appears in 4 sentences as: POS tag (1) POS tagging (1) POS tags (3)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- For unobserved cases, the conditional probability is estimated by backing off to the parent POS tag and child direction.Page 3, “QG for Paraphrase Modeling”
- We estimate the distributions over dependency labels, POS tags , and named entity classes using the transformed treebank (footnote 4).Page 5, “QG for Paraphrase Modeling”
- (17) The parameters 9 to be learned include the class priors, the conditional distributions of the dependency labels given the various configurations, the POS tags given POS tags , the NE tags given NEPage 5, “QG for Paraphrase Modeling”
- model is approximate, because we used different preprocessing tools: MX-POST for POS tagging (Ratnaparkhi, 1996), MSTParser for parsing (McDonald et al., 2005), and Dan Bikel’s interface (http: //WWW .Page 6, “Experimental Evaluation”

See all papers in *Proc. ACL 2009* that mention POS tags.

See all papers in *Proc. ACL* that mention POS tags.

Back to top.

Appears in 3 sentences as: dependency tree (1) dependency trees (2)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- In this paper, we adopt a model that posits correspondence between the words in the two sentences, defining it in loose syntactic terms: if two sentences are paraphrases, we expect their dependency trees to align closely, though some divergences are also expected, with some more likely than others.Page 1, “Introduction”
- A dependency tree on a sequence w 2 (ml, ..., wk) is a mapping of indices of words to indices of syntactic parents, 7p : {1, —> {0, ..., k}, and a mapping of indices of words to dependency relation types in £, 7] : {1, ..., k} —> £.Page 3, “QG for Paraphrase Modeling”
- (2005), trained on sections 2—21 of the WSJ Penn Treebank, transformed to dependency trees following Yamada and Matsumoto (2003).Page 3, “QG for Paraphrase Modeling”

See all papers in *Proc. ACL 2009* that mention dependency trees.

See all papers in *Proc. ACL* that mention dependency trees.

Back to top.

Appears in 3 sentences as: Dynamic Programming (1) dynamic programming (2)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- We next describe a dynamic programming solution for calculating p(7't | Gp(7's)).Page 3, “QG for Paraphrase Modeling”
- 3.3 Dynamic ProgrammingPage 3, “QG for Paraphrase Modeling”
- Thus every word generated under G0 aligns to null, and we can simplify the dynamic programming algorithm that scores a tree 7'5 under G0:Page 5, “QG for Paraphrase Modeling”

See all papers in *Proc. ACL 2009* that mention dynamic programming.

See all papers in *Proc. ACL* that mention dynamic programming.

Back to top.

Appears in 3 sentences as: logistic regression (3)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- Furthermore, using a product of experts (Hinton, 2002), we combine the model with a complementary logistic regression model based on state-of-the-art lexical overlap features.Page 1, “Abstract”
- We use a product of experts (Hinton, 2002) to bring together a logistic regression classifier built from n-gram overlap features and our syntactic model.Page 1, “Introduction”
- Probabilistic Lexical Overlap Model We devised a logistic regression (LR) model incorporating 18 simple features, computed directly from 51 and 52, without modeling any hidden correspondence.Page 8, “Product of Experts”

See all papers in *Proc. ACL 2009* that mention logistic regression.

See all papers in *Proc. ACL* that mention logistic regression.

Back to top.

Appears in 3 sentences as: n-gram (4)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- Although paraphrase identification is defined in semantic terms, it is usually solved using statistical classifiers based on shallow lexical, n-gram , and syntactic “overlap” features.Page 1, “Introduction”
- We use a product of experts (Hinton, 2002) to bring together a logistic regression classifier built from n-gram overlap features and our syntactic model.Page 1, “Introduction”
- The features are of the form precisionn (number of n-gram matches divided by the number of n-grams in 51), recalln (number of n-gram matches divided by the number of n-grams in 52) and E, (harmonic mean of the previous two features), where l g n g 3.Page 8, “Product of Experts”

See all papers in *Proc. ACL 2009* that mention n-gram.

See all papers in *Proc. ACL* that mention n-gram.

Back to top.

Appears in 3 sentences as: named entity (5)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- In addition to assuming that dependency parse trees for s and t are observable, we also assume each word 2121- comes with POS and named entity tags.Page 3, “QG for Paraphrase Modeling”
- Dependency label, POS, and named entity class The newly generated target word’s dependency label, POS, and named entity class drawn from multinomial distributions plab, ppos, and pm that condition, respectively, on the configuration and the POS and named entity class of the aligned source-tree word 3360-) (lines 9—11).Page 4, “QG for Paraphrase Modeling”
- We estimate the distributions over dependency labels, POS tags, and named entity classes using the transformed treebank (footnote 4).Page 5, “QG for Paraphrase Modeling”

See all papers in *Proc. ACL 2009* that mention named entity.

See all papers in *Proc. ACL* that mention named entity.

Back to top.

Appears in 3 sentences as: semantic relationship (2) semantics relation (1)

In *Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition*

- WordNet relation(s) The model next chooses a lexical semantics relation between 3360-) and the yet-to-be-chosen word ti (line 12).Page 4, “QG for Paraphrase Modeling”
- Word Finally, the target word is randomly chosen from among the set of words that bear the lexical semantic relationship just chosen (line 13).Page 4, “QG for Paraphrase Modeling”
- We have shown that this model is competitive for determining whether there exists a semantic relationship between them, and can be improved by principled combination with more standard lexical overlap approaches.Page 8, “Conclusion”

See all papers in *Proc. ACL 2009* that mention semantic relationship.

See all papers in *Proc. ACL* that mention semantic relationship.

Back to top.