Resolving Entity Morphs in Censored Data
Huang, Hongzhao and Wen, Zhen and Yu, Dian and Ji, Heng and Sun, Yizhou and Han, Jiawei and Li, He

Article Structure

Abstract

In some societies, internet users have to create information morphs (e.g.

Introduction

Language constantly evolves to maximize communicative success and expressive power in daily social interactions.

Approach Overview

Morph Query

Target Candidate Identification

The general goal of the first step is to identify a list of target candidates for each morph query from the comparable corpora including Sina Weibo, Chinese News websites and English Twitter.

Target Candidate Ranking

Next, we propose a learning-to-rank framework to rank target candidates based on various levels of novel features based on surface, semantic and social analysis.

Experiments

Next, we present the experiment under various settings shown in Table 3, and the impacts of cross source and cross genre information.

Related Work

To analyze social media behavior under active censorship, (Bamman et al., 2012) automatically discovered politically sensitive terms from Chinese tweets based on message deletion analysis.

Conclusion and Future Work

To the best of our knowledge, this is the first work of resolving implicit information morphs from the data under active censorship.

Topics

similarity measures

Appears in 14 sentences as: Similarity Measurements (1) similarity measurements (2) similarity measures (11)
In Resolving Entity Morphs in Censored Data
  1. Then we propose various novel similarity measurements including surface features, meta-path based semantic features and social correlation features and combine them in a learning-to-rank framework.
    Page 1, “Abstract”
  2. 0 We propose two new similarity measures , as well as integrating temporal information into
    Page 2, “Introduction”
  3. the similarity measures to generate global semantic features.
    Page 2, “Introduction”
  4. We first extract surface features between the morph and the candidate based on measuring orthographic similarity measures which were commonly used in entity coreference resolution (e.g.
    Page 3, “Target Candidate Ranking”
  5. 4.2.3 Meta-Path-Based Semantic Similarity Measurements
    Page 4, “Target Candidate Ranking”
  6. We then adopt meta-path-based similarity measures (Sun et al., 2011a; Sun et al., 2011b), which are defined over heterogeneous networks to extract semantic features.
    Page 4, “Target Candidate Ranking”
  7. For the determined meta-paths, we extract semantic features using the similarity measures proposed in (Sun et al., 2011a; Hsiung et al., 2005).
    Page 4, “Target Candidate Ranking”
  8. We now list several meta-path-based similarity measures below.
    Page 4, “Target Candidate Ranking”
  9. Beyond the above similarity measures , we also propose to use cosine-similarity-style normalization method to modify common neighbor and pairwise random walk measures so that we can ensure the morph node and the target candidate node are strongly connected and also have similar popularity.
    Page 5, “Target Candidate Ranking”
  10. The above similarity measures can also be applied to homogeneous networks that do not differentiate the neighbor types.
    Page 5, “Target Candidate Ranking”
  11. into similarity measures to generate global semantic features.
    Page 5, “Target Candidate Ranking”

See all papers in Proc. ACL 2013 that mention similarity measures.

See all papers in Proc. ACL that mention similarity measures.

Back to top.

social media

Appears in 11 sentences as: social media (13)
In Resolving Entity Morphs in Censored Data
  1. The proliferation of online social media significantly expedites this evolution, as new phrases triggered by social events may be disseminated rapidly in social media .
    Page 1, “Introduction”
  2. To automatically analyze such fast evolving language in social media , new computational models are demanded.
    Page 1, “Introduction”
  3. We believe that successful resolution of morphs is a crucial step for automated understanding of the fast evolving social media language, which is important for social media marketing (Bar-wise and Meehan, 2010).
    Page 1, “Introduction”
  4. However, morph resolution in social media is challenging due to the following reasons.
    Page 2, “Introduction”
  5. Thus, the co-occurrence of a morph and its target is quite low in the vast amount of information in social media .
    Page 2, “Introduction”
  6. 0 We detect target candidates by exploiting the dynamics of the social media to extract temporal distribution of entities, based on the assumption that the popularity of an individual is correlated between censored and uncensored text within a certain time window.
    Page 2, “Introduction”
  7. Unfortunately the state-of-the-art techniques for these tasks still perform poorly on social media in terms of both accuracy and coverage of important information, these sophisticated semantic links all produced negative impact on the target ranking performance.
    Page 4, “Target Candidate Ranking”
  8. In contrast, users are less restricted in some other uncensored social media such as Twitter.
    Page 5, “Target Candidate Ranking”
  9. Because of such social correlation, close social neighbors in social media such as Twitter and Weibo may post similar information, or share similar opinion.
    Page 6, “Target Candidate Ranking”
  10. As shown in Table9 (K is the number of predefined topics), PLSA is not quite effective mainly because traditional topic modeling approaches do not perform well on short texts from social media .
    Page 8, “Experiments”
  11. To analyze social media behavior under active censorship, (Bamman et al., 2012) automatically discovered politically sensitive terms from Chinese tweets based on message deletion analysis.
    Page 9, “Related Work”

See all papers in Proc. ACL 2013 that mention social media.

See all papers in Proc. ACL that mention social media.

Back to top.

named entities

Appears in 5 sentences as: Named entities (1) named entities (2) named entity (2)
In Resolving Entity Morphs in Censored Data
  1. However, obviously we cannot consider all of the named entities in these sources as target candidates due to the sheer volume of information.
    Page 3, “Target Candidate Identification”
  2. In addition, morphs are not limited to named entity forms.
    Page 3, “Target Candidate Identification”
  3. Then we apply a hierarchical Hidden Markov Model (HMM) based Chinese lexical analyzer ICTCLAS (Zhang et al., 2003) to extract named entities , noun phrases and events.
    Page 4, “Target Candidate Ranking”
  4. Named entities which co-occur at least 6 times with a morph query in the same topic are selected as its target candidates.
    Page 8, “Experiments”
  5. Other similar research lines are the TAC-KBP Entity Linking (EL) (Ji et al., 2010; Ji et al., 2011), which links a named entity in news and web documents to an appropriate knowledge base (KB) entry, the task of mining name translation pairs from comparable corpora (Udupa et al., 2009; Ji, 2009; Fung and Yee, 1998; Rapp, 1999; Shao and Ng, 2004; Hassan et al., 2007) and the link prediction problem (Adamic and Adar, 2001; Liben-Nowell and Kleinberg, 2003; Sun et al., 2011b;
    Page 9, “Related Work”

See all papers in Proc. ACL 2013 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.

noun phrases

Appears in 4 sentences as: Noun Phrases (1) noun phrases (3)
In Resolving Entity Morphs in Censored Data
  1. Then we apply a hierarchical Hidden Markov Model (HMM) based Chinese lexical analyzer ICTCLAS (Zhang et al., 2003) to extract named entities, noun phrases and events.
    Page 4, “Target Candidate Ranking”
  2. Therefore we limited the types of vertices into: Morph (M), Entity(E), which includes target candidates, Event (EV), and NonEntity Noun Phrases (NP); and used co-occnrrence as the edge type.
    Page 4, “Target Candidate Ranking”
  3. We extract entities, events, and nonentity noun phrases that occur in more than one tweet as neighbors.
    Page 4, “Target Candidate Ranking”
  4. entities, events, nonentity noun phrases ).
    Page 4, “Target Candidate Ranking”

See all papers in Proc. ACL 2013 that mention noun phrases.

See all papers in Proc. ACL that mention noun phrases.

Back to top.

semantic similarity

Appears in 4 sentences as: semantic similarities (1) Semantic Similarity (1) semantic similarity (2)
In Resolving Entity Morphs in Censored Data
  1. 0 We model social user behaviors and use social correlation to assist in measuring semantic similarities because the users who posted a morph and its corresponding target tend to share similar interests and opinions.
    Page 2, “Introduction”
  2. 4.2.3 Meta-Path-Based Semantic Similarity Measurements
    Page 4, “Target Candidate Ranking”
  3. In this paper we exploit cross-genre information and social correlation to measure semantic similarity .
    Page 9, “Related Work”
  4. Both of the Meta-path based and social correlation based semantic similarity measurements are proven powerful and complementary.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2013 that mention semantic similarity.

See all papers in Proc. ACL that mention semantic similarity.

Back to top.

co-occurrence

Appears in 3 sentences as: co-occurrence (3)
In Resolving Entity Morphs in Censored Data
  1. Thus, the co-occurrence of a morph and its target is quite low in the vast amount of information in social media.
    Page 2, “Introduction”
  2. After applying the same annotation techniques as tweets for uncensored data sets, sentence-level co-occurrence relations are extracted and integrated into the network as shown in Figure 3.
    Page 6, “Target Candidate Ranking”
  3. Retweets and redundant web documents are filtered to ensure more reliable frequency counting of co-occurrence relations.
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

topic modeling

Appears in 3 sentences as: topic modeling (3)
In Resolving Entity Morphs in Censored Data
  1. For comparison we also attempted topic modeling approach to detect target candidates, as shown in section 5.3.
    Page 3, “Target Candidate Identification”
  2. We also attempted using topic modeling approach to detect target candidates.
    Page 8, “Experiments”
  3. As shown in Table9 (K is the number of predefined topics), PLSA is not quite effective mainly because traditional topic modeling approaches do not perform well on short texts from social media.
    Page 8, “Experiments”

See all papers in Proc. ACL 2013 that mention topic modeling.

See all papers in Proc. ACL that mention topic modeling.

Back to top.