GDELT HathiTrust and Internet Archive Book Data

How to query public data sets using BigQuery

BigQuery is a fully managed data warehouse and analytics platform. Public datasets are available for you to analyze using SQL queries. You can access BigQuery public data sets using the web UI, the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.

Currently, BigQuery public datasets are stored in the US multi-region location. When you query a public dataset, supply the --location=US flag on the command line, choose US as the processing location in the BigQuery web UI, or specify the location property in the jobReference section of the job resource when you use the API. Because the public datasets are stored in the US, you cannot write public data query results to a table in another region, and you cannot join tables in public datasets with tables in another region.

To get started using a BigQuery public dataset, create or select a project. The first terabyte of data processed per month is free, so you can start querying public datasets without enabling billing. If you intend to go beyond the free tier, you should also enable billing.

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery API.

    Enable the API

Dataset overview

This dataset contains 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes). These collections have been processed using the GDELT Global Knowledge Graph and are available in Google BigQuery. More than a billion pages stretching back 215 years have been examined to compile a list of all people, organizations, and other names, fulltext geocoded to render them fully mappable, and more than 4,500 emotions and themes compiled. All of this computed metadata is combined with all available book-level metadata, including title, author, publisher, and subject tags as provided by the contributing libraries.

Internet Archive data includes the complete full text of all Internet Archive books published 1800-1922 and all books in the American Libraries collection for which English-language full text was available using the search “collection:(americana)”. For HathiTrust, all English language public domain books 1800-2015 were provided by HathiTrust as part of a special research extract. Only public domain volumes are included.

You can start exploring the HathiTrust and Internet Archive book collections in the BigQuery console.

Go to the HathiTrust dataset

Go to the Internet Archive dataset

For more detailed information on how to work with this data, see GDELT’s detailed announcement article.

Sample queries

Here are some examples of SQL queries you can run on this data in BigQuery.

Find an author

This is a basic query to find an author (Walt Whitman) with full text results (the Internet Archive has full text data up through 1922).

FROM (TABLE_QUERY([gdelt-bq:internetarchivebooks], 'REGEXP_EXTRACT(table_id, r"(\d{4})") BETWEEN "1819" AND "2014"'))
  BookMeta_Creator CONTAINS "Walt Whitman"

Using similar queries you can get a selection of books to perform different analysis, including sentiment analysis.

Sentiment analysis

This sample query shows how you can apply sentiment analysis to vast quantities of text at incredible speeds using BigQuery. The analysis uses a made-up dictionary of 9 words and associated scores to calculate tone from all the available full text books in the Internet Archives dataset published in 1922.

  (TotalMatchingWords/TotWordCount*100) ToneIntensity,
  (SumToneScore/TotalMatchingWords) ToneScore
    MAX(TotWordCount) TotWordCount,
    SUM(ThisWordCount) TotalMatchingWords,
    SUM(ThisWordScore) SumToneScore
  FROM (
      a.DocumentIdentifier DocumentIdentifier,
      a.totwordcount TotWordCount,
      a.word Word,
      a.COUNT ThisWordCount,
      b.Score ThisWordScore
    FROM (
        COUNT(*) AS COUNT,
      FROM (
          SPLIT(REGEXP_REPLACE(LOWER(BookMeta_FullText),'[^a-z]', ' '), ' ') AS word,
          COUNT(SPLIT(REGEXP_REPLACE(LOWER(BookMeta_FullText),'[^a-z]', ' '), ' ')) AS totwordcount
        FROM (TABLE_QUERY([gdelt-bq:internetarchivebooks], 'REGEXP_EXTRACT(table_id, r"(\d{4})") BETWEEN "1922" AND "1922"')) )
        totwordcount ) a
        [gdelt-bq:extra.toytonelookup] ) b
      a.word = b.Word )
    DocumentIdentifier )
  ToneScore DESC

About the data

Dataset Source: GDELT (processed from HathiTrust and Internet Archive public domain collections)

Category: Media

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

View in BigQuery: HathiTrust Book Collection Internet Archive Book Collection

Send feedback about...