GDELT HathiTrust and Internet Archive Book Data

This dataset contains 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes). These collections have been processed using the GDELT Global Knowledge Graph and are available in Google BigQuery. More than a billion pages stretching back 215 years have been examined to compile a list of all people, organizations, and other names, fulltext geocoded to render them fully mappable, and more than 4,500 emotions and themes compiled. All of this computed metadata is combined with all available book-level metadata, including title, author, publisher, and subject tags as provided by the contributing libraries.

Internet Archive data includes the complete full text of all Internet Archive books published 1800-1922 and all books in the American Libraries collection for which English-language full text was available using the search “collection:(americana)”. For HathiTrust, all English language public domain books 1800-2015 were provided by HathiTrust as part of a special research extract. Only public domain volumes are included.

You can start exploring the HathiTrust and Internet Archive book collections in the BigQuery console.

Go to the HathiTrust dataset

Go to the Internet Archive dataset

For more detailed information on how to work with this data, see GDELT’s detailed announcement article.

Sample queries

Here are some examples of SQL queries you can run on this data in BigQuery.

Find an author

This is a basic query to find an author (Walt Whitman) with full text results (the Internet Archive has full text data up through 1922).

SELECT
  BookMeta_Title,
  BookMeta_Creator,
  BookMeta_Year
FROM (TABLE_QUERY([gdelt-bq:internetarchivebooks], 'REGEXP_EXTRACT(table_id, r"(\d{4})") BETWEEN "1819" AND "2014"'))
WHERE
  BookMeta_Creator CONTAINS "Walt Whitman"

Using similar queries you can get a selection of books to perform different analysis, including sentiment analysis.

Sentiment analysis

This sample query shows how you can apply sentiment analysis to vast quantities of text at incredible speeds using BigQuery. The analysis uses a made-up dictionary of 9 words and associated scores to calculate tone from all the available full text books in the Internet Archives dataset published in 1922.

SELECT
  DocumentIdentifier,
  TotWordCount,
  TotalMatchingWords,
  SumToneScore,
  (TotalMatchingWords/TotWordCount*100) ToneIntensity,
  (SumToneScore/TotalMatchingWords) ToneScore
FROM (
  SELECT
    DocumentIdentifier,
    MAX(TotWordCount) TotWordCount,
    SUM(ThisWordCount) TotalMatchingWords,
    SUM(ThisWordScore) SumToneScore
  FROM (
    SELECT
      a.DocumentIdentifier DocumentIdentifier,
      a.totwordcount TotWordCount,
      a.word Word,
      a.COUNT ThisWordCount,
      b.Score ThisWordScore
    FROM (
      SELECT
        DocumentIdentifier,
        word,
        COUNT(*) AS COUNT,
        totwordcount
      FROM (
        SELECT
          DocumentIdentifier,
          SPLIT(REGEXP_REPLACE(LOWER(BookMeta_FullText),'[^a-z]', ' '), ' ') AS word,
          COUNT(SPLIT(REGEXP_REPLACE(LOWER(BookMeta_FullText),'[^a-z]', ' '), ' ')) AS totwordcount
        FROM (TABLE_QUERY([gdelt-bq:internetarchivebooks], 'REGEXP_EXTRACT(table_id, r"(\d{4})") BETWEEN "1922" AND "1922"')) )
      GROUP EACH BY
        DocumentIdentifier,
        word,
        totwordcount ) a
    JOIN EACH (
      SELECT
        Word,
        Score
      FROM
        [gdelt-bq:extra.toytonelookup] ) b
    ON
      a.word = b.Word )
  GROUP EACH BY
    DocumentIdentifier )
ORDER BY
  ToneScore DESC

About the data

Dataset Source: GDELT (processed from HathiTrust and Internet Archive public domain collections)

Category: Media

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://gdeltproject.org/about.html — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

View in BigQuery: HathiTrust Book Collection Internet Archive Book Collection

Send feedback about...

BigQuery Documentation