Transforming Data

Sometimes you can prepare your data for BigQuery by performing a simple action such as exporting data in JSON format, but sometimes you may need to transform your data in a more complex way. For example, your data might be stored in XML format, and preparing the data would require more than just a simple query. This topic describes common scenarios where you may need to transform your data before loading it into BigQuery.

Transforming XML data

Large datasets are often represented using XML. BigQuery doesn't support directly loading XML files, but XML files can be easily converted to an equivalent JSON format or flat CSV structure.

When dealing with large XML files, it's important to not use a DOM-based parser on the XML file, as the parser will attempt to load the entire file as an in-memory tree. Instead, we suggest using a pull-based parser or a SAX parser, which are more efficient. Parsing a large XML file is simple when using Python.

The following code examples show a sample of 100 GB of Wikipedia revision history data, and a Python code example that efficiently parses the data.

    <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org
    /xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en">
      <siteinfo>
        <sitename>Wikipedia</sitename>
        <base>https://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.20wmf10</generator>
        <case>first-letter</case>
        ….
      </siteinfo>
      <page>
        <title>AccessibleComputing</title>
        <id>10</id>
        <revision>
          <id>233192</id>
          <contributor>
            <id>99</id>
          </contributor>
        </revision>
        <revision>
          …
        </revision>
      </page>
    </mediawiki>
from lxml import etree
# open file, loop through all  elements
xml_file = open(sys.argv[1],"r")
for _, element in etree.iterparse(xml_file, tag='{http://www.mediawiki.org/xml/export-0.7/}page'):
  title = element.find('{http://www.mediawiki.org/xml/export-0.7/}title').text
  id = element.find('{http://www.mediawiki.org/xml/export-0.7/}id').text
  revisions = list(element.iter('{http://www.mediawiki.org/xml/export-0.7/}revision'))
  wp_namespace = element.find('{http://www.mediawiki.org/xml/export-0.7/}ns')
  for revision_element in revisions:
    # we're confident that we always have a revision ID
    revision_id = revision_element.find('{http://www.mediawiki.org/xml/export-0.7/}id').text
    comment = revision_element.find('{http://www.mediawiki.org/xml/export-0.7/}comment')
    # comment_text and contributor_id are optional
    if comment is not None and comment.text is not None:
      comment_text = comment.text
    contributor_id = contributor.find('{http://www.mediawiki.org/xml/export-0.7/}id')
    if contributor_id is not None:
      contributor_id_text = contributor_id.text

Back to top

Send feedback about...

BigQuery Documentation