SciSurf: Index of 'Multilingual Harvesting of Cross-Cultural Stereotypes'

Multilingual Harvesting of Cross-Cultural Stereotypes

Veale, Tony and Hao, Yanfen and Li, Guofu

Published in Proc. ACL, 2008

Article Structure

Abstract

People rarely articulate explicitly what a native speaker of a language is already assumed to know.

Introduction

Direct perception of our environment is just one of the ways we can acquire knowledge of the world.

Related Work

Text-based approaches to knowledge acquisition range from the ambitiously comprehensive, in which an entire text or resource is fully parsed and analyzed in depth, to the surgically precise, in which highly-specific text patterns are used to eke out correspondingly specific relationships from a large corpus.

Harvesting Knowledge from Similes: English and Chinese

Because similes are containers of culturally-received knowledge, we can reasonably expect the most commonly used similes to vary significantly from language to language, especially when those languages correspond to very different cultures.

Tagging and Mapping of Similes

In each case, the harvesting processes for English and for Chinese allow us to acquire stereotypi-

Empirical Evaluation: Simile-derived Representations

Stereotypes persist in language and culture because they are, more often than not, cognitively useful: by emphasizing the most salient aspects of a concept, a stereotype acts as a dense conceptual description that is easily communicated, widely shared, and which supports rapid inference.

Conclusions

Knowledge-acquisition from texts can be a process fraught with complexity: such texts - especially web-based texts - are frequently under-determined and vague; highly ambiguous, both lexically and structurally; and dense with figures of speech, hy-perbolae and irony.

Topics

WordNet

Appears in 14 sentences as: WordNet (14) WordNet’s (1)

In Multilingual Harvesting of Cross-Cultural Stereotypes

(1999), in which each of the textual glosses in WordNet (Fellbaum, 1998) is linguistically analyzed to yield a sense-tagged logical form, is an example of the former approach.
Page 2, “Related Work”
Almuhareb and Poesio go on to demonstrate that the values and attributes that are found for word-concepts on the web yield a sufficiently rich representation for these word-concepts to be automatically clustered into a form resembling that assigned by WordNet (see Fellbaum, 1998).
Page 3, “Related Work”
Veale and Hao (2007) use the Google API in conjunction with Princeton WordNet (Fellbaum, 1998) as the basis of their harvesting system.
Page 3, “Harvesting Knowledge from Similes: English and Chinese”
They first extracted a list of antonymous adjectives, such as “hot” or “cold”, from WordNet , the intuition being that explicit similes will tend to exploit properties that occupy an exemplary point on a scale.
Page 3, “Harvesting Knowledge from Similes: English and Chinese”
To harvest a comparable body of Chinese similes from the web, we also use the Google API, in conjunction with both WordNet and HowNet (Dong and Dong, 2006).
Page 4, “Harvesting Knowledge from Similes: English and Chinese”
We employ the same two-phase design as Veale and Hao: an initial set of Chinese adjectives are extracted from HowNet, with the stipulation that their English translations (as given by HowNet) are also categorized as adjectives in WordNet .
Page 4, “Harvesting Knowledge from Similes: English and Chinese”
In the case of English similes, Veale and Hao (2007) describe how two English similes “as A as N1” and “as A as N2” will be mutually disambiguating if N1 and N2 are synonyms in WordNet, or if some sense of N1 is a hypernym or hyponym of some sense of N2 in WordNet .
Page 5, “Tagging and Mapping of Similes”
For instance, though HowNet has a much shallower hierarchical organization than WordNet , it compensates by encapsulating the meaning of different word senses using simple logical formulae of semantic primitives, or sememes, that are derived from the meaning of common Chinese characters.
Page 6, “Tagging and Mapping of Similes”
WordNet and HowNet thus offer two complementary levels or granularities of generalization that can be exploited as the context demands.
Page 6, “Tagging and Mapping of Similes”
Unlike WordNet , HowNet organizes its adjectival senses hierarchically, allowing one to obtain a weaker form of a given description by climbing the hierarchy, or to obtain a stronger form by descending the hierarchy from a particular sense.
Page 6, “Tagging and Mapping of Similes”
Almuhareb and Poesio (2004) used as their experimental basis a sampling of 214 English nouns from 13 of WordNet’s upper-level semantic categories, and proceeded to harvest adjectival features for these noun-concepts from the web using the textual pattern “[a | an | the] * C [is | was]”.
Page 7, “Empirical Evaluation: Simile-derived Representations”

See all papers in Proc. ACL 2008 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Chinese words

Appears in 5 sentences as: Chinese word (2) Chinese words (3)

In Multilingual Harvesting of Cross-Cultural Stereotypes

HowNet is a bilingual lexical ontology that associates English and Chinese word labels with an underlying set of approximately 100,000 lexical concepts.
Page 4, “Harvesting Knowledge from Similes: English and Chinese”
Thus, the Chinese word “% 4?.” can translate as “celebrated”, “famous”, “well-known” and “reputable”, but all four of these possible senses, given by celebrated|%%, famous|§ 4%, well-known|% 4?.
Page 5, “Tagging and Mapping of Similes”
For while the words used in any given simile are likely to be ambiguous (in the case of one-character Chinese words , highly so), it would seem unlikely that an incorrect translation of a web simile would also be found on the web.
Page 5, “Tagging and Mapping of Similes”
One significant reason for this kind of omission is not cultural difference, but obviousness: many Chinese words are multi-character gestalts of different ideas (see Packard, 2000), so that these ideas form an explicit part of the orthography of a lexical concept.
Page 6, “Tagging and Mapping of Similes”
We simply look for pairs of HowNet senses of the form Noun|XYZ and Adj |X, where X and XYZ are Chinese words and the simile “as Adj as a|an Noun” is found in the English simile set.
Page 6, “Tagging and Mapping of Similes”

See all papers in Proc. ACL 2008 that mention Chinese words.

See all papers in Proc. ACL that mention Chinese words.

feature set

Appears in 4 sentences as: feature set (4) feature sets (1)

In Multilingual Harvesting of Cross-Cultural Stereotypes

As noted by the latter authors, this results in a much smaller yet more diagnostic feature set for each concept.
Page 3, “Related Work”
Suspecting that a noisy feature set had contributed to the apparent drop in performance, these authors then proceed to apply a variety of noise filters to reduce the set of feature values to 51,345, which in turn leads to an improved cluster purity measure of 62.7%.
Page 7, “Empirical Evaluation: Simile-derived Representations”
In experiment 2, we see a similar ratio of feature quantities before filtering; after some initial filtering, Almuhareb and Poesio reduce their feature set to just under 10 times the size of the simile-derived feature set .
Page 8, “Empirical Evaluation: Simile-derived Representations”
First, the feature representations do not need to be hand-filtered and noise-free to be effective; we see from the above results that the raw values extracted from the simile pattern prove slightly more effective than filtered feature sets used by Almuhareb and Poesio.
Page 8, “Empirical Evaluation: Simile-derived Representations”

See all papers in Proc. ACL 2008 that mention feature set.

See all papers in Proc. ACL that mention feature set.

logical form

Appears in 3 sentences as: logical form (3)

In Multilingual Harvesting of Cross-Cultural Stereotypes

(1999), in which each of the textual glosses in WordNet (Fellbaum, 1998) is linguistically analyzed to yield a sense-tagged logical form , is an example of the former approach.
Page 2, “Related Work”
and reputable|§ 4%, are associated with the same logical form in HowNet, which defines them as a specialization of ReputationValue|QZ $13.
Page 5, “Tagging and Mapping of Similes”
This allows us to safely identify “%%” with this logical form .
Page 5, “Tagging and Mapping of Similes”

See all papers in Proc. ACL 2008 that mention logical form.

See all papers in Proc. ACL that mention logical form.