We report on the construction of the Webis text reuse corpus 2012 for advanced research on text reuse.
The web has become one of the most common sources for text reuse.
Two data sets form the basis for constructing our corpus, namely (1) a set of topics to write about and (2) a set of web pages to research about a given topic.
This section presents selected results of a preliminary corpus analysis.
This paper details the construction of the Webis text reuse corpus 2012 (Webis-TRC-12), a new corpus for text reuse research that has been created entirely manually on a large scale.