Wikipedia Revision History


Wikimedia provides an XML dump of the complete revision history for all Wikipedia articles. This dataset contains a version of that data from April, 2010. This dataset does not contain the full text of the revisions, but rather just the meta information about the revisions, including things like language, timestamp, article and the like.

  • Name: publicdata:samples.wikipedia
  • Number of rows: 314M


Field name Type Description
revision_id string These are unique across all revisions to all pages in a particular language and increase with time. Sorting the revisions to a page by revision_id will yield them in chronological order.
contributor_ip string Typically, either _ip or (_id and _username) will be set. IP information is unavailable for edits from registered accounts. A (very) small fraction of edits have neither _ip or (_id and _username). They show up on Wikipedia as "(Username or IP removed)".
contributor_id integer See above.
contributor_username string See above.
timestamp integer In Unix time, seconds since epoch
is_bot boolean is_bot is a special flag that some of Wikipedia's more active bots voluntarily set
is_minor boolean correponds to the "Minor Edit" checkbox on Wikipedia's edit page
reversion_id integer If this edit is a reversion to a previous edit, this field records the revision_id that was reverted to. If the same article text occurred multiple times, then this will point to the earliest revision. Only revisions with greater than fifty characters are considered for this field. This is to avoid labeling multiple blankings as reversions.
comment string Optional user-supplied description of the edit. Section edits are, by default, prefixed with "/* Section Name */ ".
title string Title of the page as displayed on the page (not in the URL). Always starts with a capital letter and may begin with a namespace (e.g. "Talk:", "User:", "User Talk:", ... )
id integer A unique ID for the article that was revised. These correspond to the order in which articles were created, except for the first several thousand IDs, which are issued in alphabetical order.
language string Empty in the current dataset.
num_characters integer The length of the article after the revision was applied.
is_redirect boolean Versions later than ca. 200908 may have a redirection marker in the XML
wp_namespace integer

Wikipedia segments its pages into namespaces (e.g. "Talk", "User", etc.)

MEDIA            = 202; // =-2 in WP XML, but these values must be >0
SPECIAL          = 201; // =-1 in WP XML, but these values must be >0
MAIN             = 0;
TALK             = 1;
USER             = 2;
USER_TALK        = 3;
WIKIPEDIA        = 4;
IMAGE            = 6;  // Has since been renamed to "File" in WP XML.
IMAGE_TALK       = 7;  // Equivalent to "File talk".
MEDIAWIKI        = 8;
TEMPLATE         = 10;
HELP             = 12;
HELP_TALK        = 13;
CATEGORY         = 14;
PORTAL           = 100;
PORTAL_TALK      = 101;
WIKIPROJECT      = 102;
REFERENCE        = 104;
BOOK             = 108;
BOOK_TALK        = 109;