Event Discovery in Social Media Feeds
Benson, Edward and Haghighi, Aria and Barzilay, Regina

Article Structure

Abstract

We present a novel method for record extraction from social streams such as Twitter.

Introduction

We propose a method for discovering event records from social media feeds such as Twitter.

Related Work

A large number of information extraction approaches exploit redundancy in text collections to improve their accuracy and reduce the need for manually annotated data (Agichtein and Gravano, 2000; Yangarber et al., 2000; Zhu et al., 2009; Mintz et al., 2009a; Yao et al., 2010b; Hasegawa et al., 2004; Shinyama and Sekine, 2006).

Problem Formulation

Here we describe the key latent and observed random variables of our problem.

Model

Our model can be represented as a factor graph which takes the form,

Inference

Our goal is to predict a set of records R. Ideally we would like to compute P(R|x), marginalizing out the nuisance variables A and y.

Evaluation Setup

Data We apply our approach to construct a database of concerts in New York City.

Evaluation

The evaluation of record construction is challenging because many induced music events discussed

Conclusion

We presented a novel model for record extraction from social media streams such as Twitter.

Topics

CRF

Appears in 11 sentences as: +CRF (2) CRF (11)
In Event Discovery in Social Media Feeds
  1. We bias local decisions made by the CRF to be consistent with canonical record values, thereby facilitating consistency within an event cluster.
    Page 2, “Introduction”
  2. The sequence labeling factor is similar to a standard sequence CRF (Lafferty et al., 2001), Where the potential over a message label sequence decomposes
    Page 3, “Model”
  3. The weights of the CRF component of our model, QSEQ, are the only weights learned at training time, using a distant supervision process described in Section 6.
    Page 5, “Model”
  4. Since a uniform initialization of all factors is a saddle-point of the objective, we opt to initialize the q(y) factors with the marginals obtained using just the CRF parameters, accomplished by running forwards-backwards on all messages using only the
    Page 6, “Inference”
  5. To do so, we run the CRF component of our model (ngEQ) over the corpus and extract, for each 6, all spans that have a token-level probability of being labeled 6 greater than A = 0.1.
    Page 7, “Inference”
  6. +LowThresh +CRF +List -)'(-OurWork
    Page 8, “Evaluation Setup”
  7. The CRF lines terminate because of low record yield.
    Page 8, “Evaluation Setup”
  8. Our List Baseline labels messages by finding string overlaps against a list of musical artists and venues scraped from web data (the same lists used as features in our CRF component).
    Page 8, “Evaluation Setup”
  9. The CRF Baseline is most similar to Mann and Yarowsky (2005)’s CRF Voting method and uses the maximum likelihood CRF labeling of each message.
    Page 8, “Evaluation Setup”
  10. +LowThresh +CRF +List *Our Work +Our Work+Con
    Page 9, “Evaluation”
  11. The CRF and hard-constrained consensus lines terminate because of low record yield.
    Page 9, “Evaluation”

See all papers in Proc. ACL 2011 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

social media

Appears in 7 sentences as: Social Media (1) Social media (2) social media (4)
In Event Discovery in Social Media Feeds
  1. We propose a method for discovering event records from social media feeds such as Twitter.
    Page 1, “Introduction”
  2. Social media messages are often short, make heavy use of colloquial language, and require situational context for interpretation (see examples in Figure 1).
    Page 1, “Introduction”
  3. Social Media Feeds
    Page 1, “Introduction”
  4. These properties of social media streams make existing extraction techniques significantly less effective.
    Page 1, “Introduction”
  5. Social media is a natural place to discover new events missed by curation, but mentioned online by someone planning to attend.
    Page 1, “Introduction”
  6. While our experiments utilized binary relations, we believe our general approach should be useful for nary relation recovery in the social media domain.
    Page 9, “Evaluation”
  7. We presented a novel model for record extraction from social media streams such as Twitter.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention social media.

See all papers in Proc. ACL that mention social media.

Back to top.

sequence labeling

Appears in 4 sentences as: Sequence Labeling (1) sequence labeling (3)
In Event Discovery in Social Media Feeds
  1. Message Labels (3/): We assume that each message has a sequence labeling , where the labels consist of the record fields (e.g., ARTIST and VENUE) as well as a NONE label denoting the token does not correspond to any domain field.
    Page 3, “Problem Formulation”
  2. 4.1 Sequence Labeling Factor
    Page 3, “Model”
  3. The sequence labeling factor is similar to a standard sequence CRF (Lafferty et al., 2001), Where the potential over a message label sequence decomposes
    Page 3, “Model”
  4. As mentioned in Section 4, the only learned parameters in our model are those associated with the sequence labeling factor ngEQ.
    Page 7, “Evaluation Setup”

See all papers in Proc. ACL 2011 that mention sequence labeling.

See all papers in Proc. ACL that mention sequence labeling.

Back to top.

distant supervision

Appears in 3 sentences as: distant supervision (2) “distant supervision” (1)
In Event Discovery in Social Media Feeds
  1. Our work also relates to recent approaches for relation extraction with distant supervision (Mintz et al., 2009b; Bunescu and Mooney, 2007; Yao et al., 2010a).
    Page 2, “Related Work”
  2. The weights of the CRF component of our model, QSEQ, are the only weights learned at training time, using a distant supervision process described in Section 6.
    Page 5, “Model”
  3. While it is possible to train these parameters via direct annotation of messages with label sequences, we opted instead to use a simple approach where message tokens from the training weekend are labeled via their intersection with gold records, often called “distant supervision” (Mintz et al., 2009b).
    Page 7, “Evaluation Setup”

See all papers in Proc. ACL 2011 that mention distant supervision.

See all papers in Proc. ACL that mention distant supervision.

Back to top.