Crowdsourcing Interaction Logs to Understand Text Reuse from the Web
Potthast, Martin and Hagen, Matthias and Völske, Michael and Stein, Benno

Article Structure

Abstract

We report on the construction of the Webis text reuse corpus 2012 for advanced research on text reuse.

Introduction

The web has become one of the most common sources for text reuse.

Corpus Construction

Two data sets form the basis for constructing our corpus, namely (1) a set of topics to write about and (2) a set of web pages to research about a given topic.

Corpus Analysis

This section presents selected results of a preliminary corpus analysis.

Summary and Outlook

This paper details the construction of the Webis text reuse corpus 2012 (Webis-TRC-12), a new corpus for text reuse research that has been created entirely manually on a large scale.

Topics