Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
Morin, Emmanuel and Hazem, Amir

Article Structure

Abstract

The main work in bilingual lexicon extraction from comparable corpora is based on the implicit hypothesis that corpora are balanced.

Introduction

The bilingual lexicon extraction task from bilingual corpora was initially addressed by using parallel corpora (i.e.

Bilingual Lexicon Extraction

In this section, we first describe the standard approach that deals with the task of bilingual lexicon extraction from comparable corpora.

Linguistic Resources

In this section, we outline the different textual resources used for our experiments: the comparable corpora, the bilingual dictionary and the terminology reference lists.

Experiments and Results

In this section, we present experiments to evaluate the influence of comparable corpus size and prediction models on the quality of bilingual terminology extraction.

Conclusion

In this paper, we have studied how an unbalanced specialized comparable corpus could influence the quality of the bilingual lexicon extraction.

Topics

co-occurrence

Appears in 13 sentences as: (1) co-occurrence (14)
In Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
  1. For each word 2' of the source and the target languages, we obtain a context vector U;- which gathers the set of co-occurrence words j associated with the number of times that j and 2' occur together 0000(2', j).
    Page 2, “Bilingual Lexicon Extraction”
  2. One way to deal with this problem is to reestimate co-occurrence counts by a prediction function (Hazem and Morin, 2013).
    Page 3, “Bilingual Lexicon Extraction”
  3. This consists in assigning to each observed co-occurrence count of a small comparable corpora, a new value learned beforehand from a large training corpus.
    Page 3, “Bilingual Lexicon Extraction”
  4. In order to make co-occurrence counts more discriminant and in the same way as Hazem and Morin (2013), one strategy consists in addressing this problem through regression: given training corpora of small and large size (abundant in the general domain), we predict word co-occurrence counts in order to make them more reliable.
    Page 3, “Bilingual Lexicon Extraction”
  5. We then apply the resulting regression function to each word co-occurrence count as a preprocessing step of the standard approach.
    Page 3, “Bilingual Lexicon Extraction”
  6. We use regression analysis to describe the relationship between word co-occurrence counts in a large corpus (the response variable) and word co-occurrence counts in a small corpus (the predictor variable).
    Page 3, “Bilingual Lexicon Extraction”
  7. As we can not claim that the prediction of word co-occurrence counts is a linear problem, we consider in addition to the simple linear regression
    Page 3, “Bilingual Lexicon Extraction”
  8. model (Lin), a generalized linear model which is the logistic regression model (Logit) and non linear regression models such as polynomial regression model (Polyn) of order n. Given an input vector cc E R", where $1,...,:cm represent features, we find a prediction 3) E R" for the co-occurrence count of a couple of words 3/ E R using one of the regression models presented below:
    Page 4, “Bilingual Lexicon Extraction”
  9. Let us denote by f the regression function and by cooc(wz-, wj) the co-occurrence count of the words 212,- and wj.
    Page 4, “Bilingual Lexicon Extraction”
  10. The aim of this experiment is twofold: first, we want to evaluate the usefulness of predicting word co-occurrence counts and second, we want to find out whether it is more appropriate to apply prediction to the source side, the target side or both sides of the bilingual comparable corpora.
    Page 7, “Experiments and Results”
  11. We applied the same regression function to all co-occurrence counts while learning models for low and high frequencies should have been more appropriate.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2014 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

regression models

Appears in 13 sentences as: (1) regression model (7) Regression Models (1) regression models (9)
In Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
  1. Moreover, we have introduced a regression model that boosts the observations of word co-occurrences used in the context-based projection method.
    Page 1, “Abstract”
  2. To make them more reliable, our second contribution is to contrast different regression models in order to boost the observations of word co-occurrences.
    Page 1, “Introduction”
  3. We then present an extension of this approach based on regression models .
    Page 2, “Bilingual Lexicon Extraction”
  4. First, while they experienced the linear regression model, we propose to contrast different regression models .
    Page 3, “Bilingual Lexicon Extraction”
  5. As most regression models have already been described in great detail (Christensen, 1997; Agresti, 2007), the derivation of most models is only briefly introduced in this work.
    Page 3, “Bilingual Lexicon Extraction”
  6. model (Lin), a generalized linear model which is the logistic regression model (Logit) and non linear regression models such as polynomial regression model (Polyn) of order n. Given an input vector cc E R", where $1,...,:cm represent features, we find a prediction 3) E R" for the co-occurrence count of a couple of words 3/ E R using one of the regression models presented below:
    Page 4, “Bilingual Lexicon Extraction”
  7. Table 6: Results (MAP %) of the standard approach using different regression models on the balanced breast cancer and diabetes corpora
    Page 7, “Experiments and Results”
  8. 4.2.1 Regression Models Comparison
    Page 7, “Experiments and Results”
  9. We contrast the simple linear regression model (Lin) with the second and the third order polynomial regressions (Poly2 and P0ly3) and the logistic regression model (Logit).
    Page 7, “Experiments and Results”
  10. can notice that except for the Logit model, all the regression models outperform the baseline (N 0 prediction).
    Page 7, “Experiments and Results”
  11. That said, the gain of regression models is not significant.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2014 that mention regression models.

See all papers in Proc. ACL that mention regression models.

Back to top.

best result

Appears in 5 sentences as: best result (4) best results (3)
In Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
  1. We chose the balanced corpora where the standard approach has shown the best results in the previous experiment, namely [breast cancer corpus 12] and [diabetes corpus 7].
    Page 7, “Experiments and Results”
  2. We can see that the best results are obtained by the Sourcepred approach for both comparable corpora.
    Page 8, “Experiments and Results”
  3. We can also notice that the Balanced + Prediction approach slightly outperforms the baseline while the U nbalanced+ Prediction approach gives the best results .
    Page 8, “Experiments and Results”
  4. Thus, the MAP goes up from 29.6% (best result on the balanced corpora) to 42.3% ( best result on the unbalanced corpora) in the breast cancer domain, and from 16.5% to 26.0% in the diabetes domain.
    Page 9, “Conclusion”
  5. Here, the MAP goes up from 42.3% (best result on the unbalanced corpora) to 46.9% ( best result on the unbalanced corpora with prediction) in the breast cancer domain, and from 26.0% to 29.8% in the diabetes domain.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention best result.

See all papers in Proc. ACL that mention best result.

Back to top.

linear regression

Appears in 5 sentences as: linear regression (5)
In Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
  1. First, while they experienced the linear regression model, we propose to contrast different regression models.
    Page 3, “Bilingual Lexicon Extraction”
  2. As we can not claim that the prediction of word co-occurrence counts is a linear problem, we consider in addition to the simple linear regression
    Page 3, “Bilingual Lexicon Extraction”
  3. model (Lin), a generalized linear model which is the logistic regression model (Logit) and non linear regression models such as polynomial regression model (Polyn) of order n. Given an input vector cc E R", where $1,...,:cm represent features, we find a prediction 3) E R" for the co-occurrence count of a couple of words 3/ E R using one of the regression models presented below:
    Page 4, “Bilingual Lexicon Extraction”
  4. We contrast the simple linear regression model (Lin) with the second and the third order polynomial regressions (Poly2 and P0ly3) and the logistic regression model (Logit).
    Page 7, “Experiments and Results”
  5. In this experiment, we chose to use the linear regression model (Lin) for the prediction part.
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2014 that mention linear regression.

See all papers in Proc. ACL that mention linear regression.

Back to top.

parallel corpora

Appears in 5 sentences as: parallel corpora (5)
In Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
  1. The bilingual lexicon extraction task from bilingual corpora was initially addressed by using parallel corpora (i.e.
    Page 1, “Introduction”
  2. However, despite good results in the compilation of bilingual lexicons, parallel corpora are scarce resources, especially for technical domains and for language pairs not involving English.
    Page 1, “Introduction”
  3. ung (2004), who range bilingual corpora from parallel corpora to quasi-comparable corpora going through comparable corpora, there is a continuum from parallel to comparable corpora (i.e.
    Page 1, “Introduction”
  4. For instance, the historical context-based projection method (Fung, 1995; Rapp, 1995), known as the standard approach, dedicated to this task seems implicitly to lead to work with balanced comparable corpora in the same way as for parallel corpora (i.e.
    Page 1, “Introduction”
  5. As McEnery and Xiao (2007, p. 21) observe, a specialized comparable corpus is built as balanced by analogy with a parallel corpus: “Therefore, in relation to parallel corpora , it is more likely for comparable corpora to be designed as general balanced corpora”.
    Page 3, “Bilingual Lexicon Extraction”

See all papers in Proc. ACL 2014 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

logistic regression

Appears in 3 sentences as: logistic regression (3)
In Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
  1. model (Lin), a generalized linear model which is the logistic regression model (Logit) and non linear regression models such as polynomial regression model (Polyn) of order n. Given an input vector cc E R", where $1,...,:cm represent features, we find a prediction 3) E R" for the co-occurrence count of a couple of words 3/ E R using one of the regression models presented below:
    Page 4, “Bilingual Lexicon Extraction”
  2. We contrast the simple linear regression model (Lin) with the second and the third order polynomial regressions (Poly2 and P0ly3) and the logistic regression model (Logit).
    Page 7, “Experiments and Results”
  3. This suggests that both linear and polynomial regressions are suitable as a preprocessing step of the standard approach, while the logistic regression seems to be inappropriate according to the results shown in Table 6.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2014 that mention logistic regression.

See all papers in Proc. ACL that mention logistic regression.

Back to top.

significantly outperforms

Appears in 3 sentences as: significantly outperforms (3)
In Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
  1. We can see that the Unbalanced approach significantly outperforms the baseline (Balanced).
    Page 8, “Experiments and Results”
  2. We can also notice that the prediction model applied to the balanced corpus (Balanced + Prediction) slightly outperforms the baseline while the Unbalanced + Prediction approach significantly outperforms the three other approaches (moreover the variation observed with the Unbalanced approach are lower than the Unbalanced —|— Prediction approach).
    Page 8, “Experiments and Results”
  3. As for the previous experiment, we can see that the Unbalanced approach significantly outperforms the Balanced approach.
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2014 that mention significantly outperforms.

See all papers in Proc. ACL that mention significantly outperforms.

Back to top.