Introduction | To enable the NLP tools to better understand Twitter feeds, we propose the task of linking a tweet to a news article that is relevant to the tweet, thereby augmenting the context of the tweet. |
Introduction | For example, we want to supplement the implicit context of the above tweet with a news article such as the following entitled: |
Introduction | To create a gold standard dataset, we download tweets spanning over 18 days, each with a url linking to a news article of CNN or NYTIMES, as well as all the news of CNN and NYTIMES published during the period. |
Task and Data | The task is given the text in a tweet, a system aims to find the most relevant news article . |
Task and Data | For gold standard data, we harvest all the tweets that have a single url link to a CNN or NYTIMES news article , dated from the 11th of Jan to the 27th of Jan, 2013. |
Task and Data | In evaluation, we consider this url-referred news article as the gold standard — the most relevant document for the tweet, and remove the url from the text of the tweet. |
Abstract | One motivating example of its application is for increasing user engagement around news articles by suggesting relevant comparable questions, such as “is Beyonce a better singer than Madonna .7”, for the user to answer. |
Comparable Question Mining | Input: A news article Output: A sorted list of comparable questions 1: Identify all target named entities (NEs) in the article 2: Infer the distribution of LDA topics for the article 3: For each comparable relation R in the database, compute its relevance score to be the similarity between the topic distributions of R and the article 4: Rank all the relations according to their relevance score and pick the top M as relevant 5: for each relevant relation R in the order of relevance ranking do 6: Filter out all the target NEs that do not pass the single entity classifier for R 7: Generate all possible NE pairs from the those that passed the single classifier 8: Filter out all the generated NE pairs that do not pass the entity pair classifier for R 9: Pick up the top N pairs with positive classification score to be qualified for generation |
Introduction | In this paper we propose a new way to increase user engagement around news articles , namely suggesting questions for the user to answer, which are related to the viewed article. |
Introduction | Sadly, fun and engaging comparative questions are typically not found within the text of news articles . |
Introduction | However, it is highly unlikely that such sources will contain enough relevant questions for any news article due to typical sparseness issues as well as differences in interests between askers in CQA sites and news reporters. |
Motivation and Algorithmic Overview | Given a news article , our algorithm generates a set of comparable questions for the article from question templates, e.g. |
Online Question Generation | The online part of our automatic generation algorithm takes as input a news article and generates concrete comparable questions for it. |
Experiments | We extracted a set of news articles and corresponding user comments from Yahoo! |
Experiments | We then run our summarization algorithm on the instantiated graph to produce a summary for each news article . |
Experiments | In addition, each news article and corresponding set of comments were presented to three human annotators. |
Framework | Depending on the summarization application, 0 can refer to the set of documents (e.g., newswire) related to a particular topic as in standard summarization; in other scenarios (e. g., user-generated content), it is a collection of comments associated with a news article or a blog post, etc. |
Introduction | On the other hand, in the case of user-generated content (say, comments on a news article ), even though the text is short, one is faced with a different set of problems: volume (popular articles generate more than 10,000 comments), noise (most comments are vacuous, linguistically deficient, and tangential to the article), and redundancy (similar views are expressed by multiple commenters). |
Introduction | We then conduct experiments on two corpora: the DUC 2004 corpus and a corpus of user comments on news articles . |
Experiment settings | All six are large collections with 50 news articles , so this baseline is significantly different from a random baseline. |
Headline generation | Our approach takes as input, for training, a corpus of news articles organized in news collections. |
Headline generation | Algorithm 2 EXTRACTPATTERNSq;(n, E): n is the list of sentences in a news article . |
Introduction | For some applications it is important to understand, given a collection of related news articles and re- |
Related work | Most headline generation work in the past has focused on the problem of single-document summarization: given the main passage of a single news article , generate a very short summary of the article. |
Related work | Filippova (2010) reports a system that is very close to our settings: the input is a collection of related news articles , and the system generates a headline that describes the main event. |
Conclusions & Future Work | els do not seem to be effective in the context of a news articles search task, they are a good indicator of effectiveness in the context of web search. |
Evaluation | Robust04 is composed 528,155 of news articles coming from three newspapers and the FBIS. |
Evaluation | We hypothesize that the heterogeneous nature of the web allows to model very different topics covering several aspects of the query, while news articles are contributions focused on a single subject. |
Evaluation | Although topics coming from news articles may be limited, they benefit from the rich vocabulary of professional writers who are trained to avoid repetition. |
Document-level Parsing Approaches | For example, this is true for 75% cases in our development set containing 20 news articles from RST—DT and for 79% cases in our development set containing 20 how-to-do manuals from the Instructional corpus. |
Introduction | While previous approaches have been tested on only one corpus, we evaluate our approach on texts from two very different genres: news articles and instructional how-to-do manuals. |
Related work | They evaluate their approach on the RST—DT corpus (Carlson et al., 2002) of news articles . |
Experiments | We processed news articles for an entire year in 2008 by Xinhua news who publishes news in both English and Chinese, which were also used by Kim et al. |
Experiments | The English corpus consists of 100,746 news articles, and the Chinese corpus consists of 88,031 news articles . |
Experiments | 0 D0: All news articles are used. |
Cross Language Text Categorization | The data we work with consists of comparable corpora of news articles in English and Italian. |
Cross Language Text Categorization | Each news article is annotated with one of the four categories: culture_andJchool, tourism, quality_0f_llfe, madejnltaly. |
Introduction | To test the usefulness of etymological information we work with comparable collections of news articles in English and Italian, whose articles are assigned one of four categories: culture_andJchool, tourism, qual-ity_0f_life, madejnltaly. |
Data | The TR-CONLL corpus (Leidner, 2008) contains 946 REUTERS news articles published in August 1996. |
Error Analysis | An instance of California in a baseball-related news article is incorrectly predicted to be the town California, Pennsylvania. |
Introduction | ically recorded travel costs on the shaping of empires (Scheidel et al., 2012), and systems that convey the geographic content in news articles (Teitler et al., 2008; Sankaranarayanan et al., 2009) and microblogs (Gelernter and Mushegian, 2011). |