USA Contagious Disease Data

How to query public data sets using BigQuery

BigQuery is a fully managed data warehouse and analytics platform. Public datasets are available for you to analyze using SQL queries. You can access BigQuery public data sets using the web UI, the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.

Currently, BigQuery public datasets are stored in the US multi-region location. When you query a public dataset, supply the --location=US flag on the command line, choose US as the processing location in the BigQuery web UI, or specify the location property in the jobReference section of the job resource when you use the API. Because the public datasets are stored in the US, you cannot write public data query results to a table in another region, and you cannot join tables in public datasets with tables in another region.

To get started using a BigQuery public dataset, create or select a project. The first terabyte of data processed per month is free, so you can start querying public datasets without enabling billing. If you intend to go beyond the free tier, you should also enable billing.

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery API.

    Enable the API

Dataset overview

This public data is published by the US Department of Health and Human Services and includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013. The data set consists of eight important vaccine-preventable contagious diseases: diphtheria, hepatitis A, measles, mumps, pertussis, polio, rubella and smallpox.

You can start exploring this data in the BigQuery console:

Go to the USA Contagious Disease Dataset

Sample queries

Here are some examples of SQL queries you can run on this data in BigQuery.

These samples use BigQuery’s support for standard SQL. Use the #standardSQL tag to let BigQuery know you want to use standard SQL. For more information about the #standardSQL prefix, see Setting a query prefix.

Diseases by year

Web UI

#standardSQL
WITH
  --
  -- Group diseases cases per year.
  --
  disease_cases_per_year AS (
  SELECT
    disease,
    -- The most significant four digits of 'epi_week' are the year.
    DIV(epi_week, 100) AS year,
    -- 'cases' contains some NA values that are encoded as '\N'. Convert those to zero.
    SUM(IFNULL(SAFE_CAST(cases AS INT64), 0)) AS total_cases
  FROM
    lookerdata.cdc.project_tycho_reports
  GROUP BY
    1,
    2),
  --
  -- For each disease and year, this 'rank' indicates for how many years
  -- prior cases have been reported, inclusive of this year.
  --
  disease_cases_per_year_with_reporting_history AS (
  SELECT
    *,
    RANK() OVER (PARTITION BY disease ORDER BY year) AS z___rank
  FROM
    disease_cases_per_year),
  --
  -- For each year, identify the minimum years of prior cases that
  -- have been reported for any disease, inclusive of this year.
  --
  disease_cases_per_year_with_reporting_history_and_min AS (
  SELECT
    *,
    MIN(z___rank) OVER (PARTITION BY year) AS z___min_rank
  FROM
      disease_cases_per_year_with_reporting_history)
  --
  -- Return the yearly disease cases ordered by total cases.
  --
SELECT
  *
FROM
  disease_cases_per_year_with_reporting_history_and_min
ORDER BY
  total_cases,
  year,
  disease

Command-line

bq query --use_legacy_sql=false '
WITH
  --
  -- Group diseases cases per year.
  --
  disease_cases_per_year AS (
  SELECT
    disease,
    -- The most significant four digits of 'epi_week' are the year.
    DIV(epi_week, 100) AS year,
    -- 'cases' contains some NA values that are encoded as '\N'. Convert those to zero.
    SUM(IFNULL(SAFE_CAST(cases AS INT64), 0)) AS total_cases
  FROM
    lookerdata.cdc.project_tycho_reports
  GROUP BY
    1,
    2),
  --
  -- For each disease and year, this 'rank' indicates for how many years
  -- prior cases have been reported, inclusive of this year.
  --
  disease_cases_per_year_with_reporting_history AS (
  SELECT
    *,
    RANK() OVER (PARTITION BY disease ORDER BY year) AS z___rank
  FROM
    disease_cases_per_year),
  --
  -- For each year, identify the minimum years of prior cases that
  -- have been reported for any disease, inclusive of this year.
  --
  disease_cases_per_year_with_reporting_history_and_min AS (
  SELECT
    *,
    MIN(z___rank) OVER (PARTITION BY year) AS z___min_rank
  FROM
      disease_cases_per_year_with_reporting_history)
  --
  -- Return the yearly disease cases ordered by total cases.
  --
SELECT
  *
FROM
  disease_cases_per_year_with_reporting_history_and_min
ORDER BY
  total_cases,
  year,
  disease'

Sample results are shown here:

+-------------+------+-------------+----------+--------------+
|   disease   | year | total_cases | z___rank | z___min_rank |
+-------------+------+-------------+----------+--------------+
| RUBELLA     | 2002 |           4 |       37 |           35 |
| MEASLES     | 2002 |          14 |       75 |           35 |
| SMALLPOX    | 1952 |          16 |       25 |           15 |
| RUBELLA     | 2001 |          18 |       36 |           34 |
| SMALLPOX    | 1951 |          19 |       24 |           14 |
| SMALLPOX    | 1950 |          42 |       23 |           13 |
| POLIO       | 1967 |          43 |       40 |            2 |
| POLIO       | 1965 |          55 |       38 |           38 |
| POLIO       | 1968 |          56 |       41 |            1 |
| MEASLES     | 2000 |          56 |       73 |           33 |
| MEASLES     | 1998 |          58 |       71 |           31 |
...
+-------------+------+-------------+----------+--------------+

Comparing Mumps outbreak in California and Connecticut

Mumps shows the same seasonal pattern coast to coast, in both California and Connecticut, during the 1970 outbreak.

Web UI

#standardSQL
  --
  -- Group state cases per week, restricting to only mumps in California and
  -- Connecticut in 1970.
  --
SELECT
  state,
  epi_week AS week,
  -- 'cases' contains some NA values that are encoded as '\N'. Convert those to zero.
  SUM(IFNULL(SAFE_CAST(cases AS INT64), 0)) AS total_cases
FROM
  lookerdata.cdc.project_tycho_reports
WHERE
  disease = 'MUMPS'
  -- The most significant four digits of 'epi_week' are the year.
  AND DIV(epi_week, 100) = 1970
  AND state IN ('CA', 'CT')
GROUP BY
  state,
  week
ORDER BY
  week,
  state,
  total_cases

Command-line

bq query --use_legacy_sql=false '
--
-- Group state cases per week, restricting to only mumps in California and
-- Connecticut in 1970.
--
SELECT
  state,
  epi_week AS week,
  -- 'cases' contains some NA values that are encoded as '\N'. Convert those to zero.
  SUM(IFNULL(SAFE_CAST(cases AS INT64), 0)) AS total_cases
FROM
  lookerdata.cdc.project_tycho_reports
WHERE
  disease = "MUMPS"
  -- The most significant four digits of 'epi_week' are the year.
  AND DIV(epi_week, 100) = 1970
  AND state IN ("CA", "CT")
GROUP BY
  state,
  week
ORDER BY
  week,
  state,
  total_cases'

Sample results are shown here:

+-------+--------+-------------+
| state |  week  | total_cases |
+-------+--------+-------------+
| CA    | 197001 |          66 |
| CT    | 197001 |          71 |
| CA    | 197002 |         120 |
| CT    | 197002 |         125 |
| CA    | 197003 |         116 |
| CT    | 197003 |         184 |
| CA    | 197004 |         189 |
| CT    | 197004 |         114 |
| CA    | 197005 |         125 |
| CT    | 197005 |         142 |
...
+-------+--------+-------------+

About the data

Dataset Source: Data.gov

Category: Health

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

View in BigQuery: Go to the USA Contagious Diseasae Dataset

Was this page helpful? Let us know how we did:

Send feedback about...