Medicare Data

How to query public data sets using BigQuery

BigQuery is a fully managed data warehouse and analytics platform. Public datasets are available for you to analyze using SQL queries. You can access BigQuery public data sets using the web UI the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.

To get started using a BigQuery public dataset, create or select a project. The first terabyte of data processed per month is free, so you can start querying public datasets without enabling billing. If you intend to go beyond the free tier, you should also enable billing.

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Manage resources page

  3. Enable billing for your project.

    Enable billing

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery API.

    Enable the API

Dataset overview

This public dataset was created by the Centers for Medicare & Medicaid Services. The data summarizes the utilization and payments for procedures, services, and prescription drugs provided to Medicare beneficiaries by specific inpatient and outpatient hospitals, physicians, and other suppliers. The dataset includes the following data.

  • Common inpatient and outpatient services
  • All physician and other supplier procedures and services
  • All Part D prescriptions.

Providers determine what they will charge for items, services, and procedures provided to patients and these charges are the amount that providers bill for an item, service, or procedure.

You can start exploring this data in the BigQuery console:

Go to Medicare Dataset

Sample queries

Here are some examples of SQL queries you can run on this data in BigQuery.

These samples use BigQuery’s support for standard SQL. Use the #standardSQL tag to let BigQuery know you want to use standard SQL. For more information about the #standardSQL prefix, see Setting a query prefix.

What is the total number of medications prescribed in each state?

The following query computes the total number of claims made in each state. The total claim count for the top five states is shown in the output table.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  nppes_provider_state AS state,
  ROUND(SUM(total_claim_count) / 1e6) AS total_claim_count_millions
FROM
  `bigquery-public-data.medicare.part_d_prescriber_2014`
GROUP BY
  state
ORDER BY
  total_claim_count_millions DESC
LIMIT
  5;

Command-line

bq query --use_legacy_sql=false '
SELECT
  nppes_provider_state AS state,
  ROUND(SUM(total_claim_count) / 1e6) AS total_claim_count_millions
FROM
  `bigquery-public-data.medicare.part_d_prescriber_2014`
GROUP BY
  state
ORDER BY
  total_claim_count_millions DESC
LIMIT
  5;'

The results are shown here:

+-------+----------------------------+
| state | total_claim_count_millions |
+-------+----------------------------+
|  CA   | 116.0                      |
|  FL   | 91.0                       |
|  NY   | 80.0                       |
|  TX   | 76.0                       |
|  PA   | 63.0                       |
+-------+----------------------------+

What is the most prescribed medication in each state?

This query builds on the previous query to find the medication with the highest claim count for each state.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  A.state,
  drug_name,
  total_claim_count,
  day_supply,
  ROUND(total_cost_millions) AS total_cost_millions
FROM (
  SELECT
    generic_name AS drug_name,
    nppes_provider_state AS state,
    ROUND(SUM(total_claim_count)) AS total_claim_count,
    ROUND(SUM(total_day_supply)) AS day_supply,
    ROUND(SUM(total_drug_cost)) / 1e6 AS total_cost_millions
  FROM
    `bigquery-public-data.medicare.part_d_prescriber_2014`
  GROUP BY
    state,
    drug_name) A
INNER JOIN (
  SELECT
    state,
    MAX(total_claim_count) AS max_total_claim_count
  FROM (
    SELECT
      nppes_provider_state AS state,
      ROUND(SUM(total_claim_count)) AS total_claim_count
    FROM
      `bigquery-public-data.medicare.part_d_prescriber_2014`
    GROUP BY
      state,
      generic_name)
  GROUP BY
    state) B
ON
  A.state = B.state
  AND A.total_claim_count = B.max_total_claim_count
ORDER BY
  A.total_claim_count DESC
LIMIT
  5;

Command-line

bq query --use_legacy_sql=false '
SELECT
  A.state,
  drug_name,
  total_claim_count,
  day_supply,
  ROUND(total_cost_millions) AS total_cost_millions
FROM (
  SELECT
    generic_name AS drug_name,
    nppes_provider_state AS state,
    ROUND(SUM(total_claim_count)) AS total_claim_count,
    ROUND(SUM(total_day_supply)) AS day_supply,
    ROUND(SUM(total_drug_cost)) / 1e6 AS total_cost_millions
  FROM
    `bigquery-public-data.medicare.part_d_prescriber_2014`
  GROUP BY
    state,
    drug_name) A
INNER JOIN (
  SELECT
    state,
    MAX(total_claim_count) AS max_total_claim_count
  FROM (
    SELECT
      nppes_provider_state AS state,
      ROUND(SUM(total_claim_count)) AS total_claim_count
    FROM
      `bigquery-public-data.medicare.part_d_prescriber_2014`
    GROUP BY
      state,
      generic_name)
  GROUP BY
    state) B
ON
  A.state = B.state
  AND A.total_claim_count = B.max_total_claim_count
ORDER BY
  A.total_claim_count DESC
LIMIT
  5;'

The top five states are shown in the output, along with the medication name, total cost and total claim count.

+-------+---------------------------+-------------------+--------------+---------------------+
| state |         drug_name         | total_claim_count |  day_supply  | total_cost_millions |
+-------+---------------------------+-------------------+--------------+---------------------+
| CA    | LEVOTHYROXINE SODIUM      |         3845087.0 |  2.1174738E8 |                78.0 |
| FL    | LEVOTHYROXINE SODIUM      |         2982612.0 |  1.6338837E8 |                64.0 |
| TX    | HYDROCODONE/ACETAMINOPHEN |         2834059.0 |  6.0410516E7 |                63.0 |
| NY    | AMLODIPINE BESYLATE       |         2609896.0 | 1.23226481E8 |                21.0 |
| PA    | LEVOTHYROXINE SODIUM      |         2353845.0 | 1.09168537E8 |                44.0 |
+-------+---------------------------+-------------------+--------------+---------------------+

Stratified Random Sampling of the dataset

The following query demonstrates how you can stratify a dataset and draw a random sample from each strata. In this case, the cross product of provider_state and provider_city defines the list of strata. We assign a RANK() order number per strata after ordering by a random number. You can specify the number of samples per strata by editing the filter: WHERE samples_per_strata <= 10.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  *
FROM (
  SELECT
    provider_state,
    provider_city,
    ROUND(average_covered_charges,2),
    ROUND(average_total_payments,2),
    ROUND(average_medicare_payments,2),
    RANK() OVER(PARTITION BY provider_state, provider_city ORDER BY random_number) AS samples_per_strata
  FROM (
    SELECT
      provider_state,
      provider_city,
      average_covered_charges,
      average_total_payments,
      average_medicare_payments,
      RAND() AS random_number
    FROM
     `bigquery-public-data.medicare.inpatient_charges_2014`) )
WHERE
  samples_per_strata <= 10
LIMIT
  5;

Command-line

bq query --use_legacy_sql=false '
SELECT
  *
FROM (
  SELECT
    provider_state,
    provider_city,
    ROUND(average_covered_charges,2),
    ROUND(average_total_payments,2),
    ROUND(average_medicare_payments,2),
    RANK() OVER(PARTITION BY provider_state, provider_city ORDER BY random_number) AS samples_per_strata
  FROM (
    SELECT
      provider_state,
      provider_city,
      average_covered_charges,
      average_total_payments,
      average_medicare_payments,
      RAND() AS random_number
    FROM
     `bigquery-public-data.medicare.inpatient_charges_2014`) )
WHERE
  samples_per_strata <= 10
LIMIT
  5;'

The results are shown here:

+----------------+---------------+-----------+----------+----------+--------------------+
| provider_state | provider_city |    f0_    |   f1_    |   f2_    | samples_per_strata |
+----------------+---------------+-----------+----------+----------+--------------------+
| AL             | GADSDEN       | 100478.64 |  7359.86 |  6324.43 |                  3 |
| AL             | GADSDEN       |  57587.79 |  6003.93 |  5309.07 |                  1 |
| AL             | GADSDEN       |  47132.23 |   4039.9 |  2634.94 |                  5 |
| AL             | GADSDEN       |  55738.92 |  6098.92 |  5264.46 |                  2 |
| AL             | GADSDEN       |  582430.0 | 59699.62 | 58705.15 |                  4 |
+----------------+---------------+-----------+----------+----------+--------------------+

What is the average cost for inpatient and outpatient treatment in each city and state?

To find the average inpatient and outpatient cost by provider in each city and state for 2014, we need two tables: the inpatient 2014 table (inpatient_charges_2014) and the outpatient 2014 table (outpatient_charges_2014).

The query includes the following steps.

  1. Aggregate the cost in each table by provider_id. This is done by taking a weighted average of the costs for each procedure done by the provider.

  2. Join on provider_id, provider_city, and provider_state.

  3. Order by the combined average cost of inpatient and outpatient procedures.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  OP.provider_state AS State,
  OP.provider_city AS City,
  OP.provider_id AS Provider_ID,
  ROUND(OP.average_OP_cost) AS Average_OP_Cost,
  ROUND(IP.average_IP_cost) AS Average_IP_Cost,
  ROUND(OP.average_OP_cost + IP.average_IP_cost) AS Combined_Average_Cost
FROM (
  SELECT
    provider_state,
    provider_city,
    provider_id,
    SUM(average_total_payments*outpatient_services)/SUM(outpatient_services) AS average_OP_cost
  FROM
    `bigquery-public-data.medicare.outpatient_charges_2014`
  GROUP BY
    provider_state,
    provider_city,
    provider_id ) AS OP
INNER JOIN (
  SELECT
    provider_state,
    provider_city,
    provider_id,
    SUM(average_medicare_payments*total_discharges)/SUM(total_discharges) AS average_IP_cost
  FROM
    `bigquery-public-data.medicare.inpatient_charges_2014`
  GROUP BY
    provider_state,
    provider_city,
    provider_id ) AS IP
ON
  OP.provider_id = IP.provider_id
  AND OP.provider_state = IP.provider_state
  AND OP.provider_city = IP.provider_city
ORDER BY
  combined_average_cost DESC
LIMIT
  10;

Command-line

bq query --use_legacy_sql=false '
SELECT
  OP.provider_state AS State,
  OP.provider_city AS City,
  OP.provider_id AS Provider_ID,
  ROUND(OP.average_OP_cost) AS Average_OP_Cost,
  ROUND(IP.average_IP_cost) AS Average_IP_Cost,
  ROUND(OP.average_OP_cost + IP.average_IP_cost) AS Combined_Average_Cost
FROM (
  SELECT
    provider_state,
    provider_city,
    provider_id,
    SUM(average_total_payments*outpatient_services)/SUM(outpatient_services) AS average_OP_cost
  FROM
    `bigquery-public-data.medicare.outpatient_charges_2014`
  GROUP BY
    provider_state,
    provider_city,
    provider_id ) AS OP
INNER JOIN (
  SELECT
    provider_state,
    provider_city,
    provider_id,
    SUM(average_medicare_payments*total_discharges)/SUM(total_discharges) AS average_IP_cost
  FROM
    `bigquery-public-data.medicare.inpatient_charges_2014`
  GROUP BY
    provider_state,
    provider_city,
    provider_id ) AS IP
ON
  OP.provider_id = IP.provider_id
  AND OP.provider_state = IP.provider_state
  AND OP.provider_city = IP.provider_city
ORDER BY
  combined_average_cost DESC
LIMIT
  10;'

The results are shown here:

+-------+---------------+-------------+-----------------+-----------------+-----------------------+
| State |     City      | Provider_ID | Average_OP_Cost | Average_IP_Cost | Combined_Average_Cost |
+-------+---------------+-------------+-----------------+-----------------+-----------------------+
| IN    | MISHAWAKA     |      150177 |           399.0 |        102521.0 |              102920.0 |
| MI    | WARREN        |      230264 |           104.0 |         88620.0 |               88724.0 |
| TX    | HOUSTON       |      450674 |            88.0 |         67571.0 |               67659.0 |
| TX    | EL PASO       |      450877 |           230.0 |         45179.0 |               45409.0 |
| CO    | VAIL          |       60096 |           261.0 |         38651.0 |               38912.0 |
| TN    | MEMPHIS       |      440152 |            90.0 |         35698.0 |               35788.0 |
| LA    | BATON ROUGE   |      190128 |            87.0 |         34369.0 |               34456.0 |
| LA    | LEESVILLE     |      190297 |           560.0 |         32611.0 |               33172.0 |
| CA    | SAN FRANCISCO |       50454 |           157.0 |         32803.0 |               32959.0 |
| CA    | LOS ANGELES   |       50262 |           120.0 |         31795.0 |               31916.0 |
+-------+---------------+-------------+-----------------+-----------------+-----------------------+

Ranking of conditions and comparison of average payments by city

In this query we are trying to answer the following questions:

  • Which are the most common inpatient diagnostic conditions in the United States?

  • Which cities have the most number of cases for each diagnostic condition?

  • What are the average payments for these conditions in these cities and how do they compare to the national average?

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  drg_definition AS Diagnosis,
  provider_city AS City,
  provider_state AS State,
  cityrank AS City_Rank,
  CAST(ROUND(citywise_avg_total_payments) AS INT64) AS Citywise_Avg_Payments,
  CONCAT(CAST(ROUND(citywise_avg_total_payments /national_avg_total_payments * 100, 0) AS STRING), " %") AS Avg_Payments_City_vs_National
FROM (
  SELECT
    drg_definition,
    provider_city,
    provider_state,
    cityrank,
    national_num_cases,
    citywise_avg_total_payments,
    national_sum_total_payments,
    (national_sum_total_payments /national_num_cases) AS national_avg_total_payments
  FROM (
    SELECT
      drg_definition,
      provider_city,
      provider_state,
      citywise_avg_total_payments,
      RANK() OVER (PARTITION BY drg_definition ORDER BY citywise_num_cases DESC ) AS cityrank,
      SUM(citywise_num_cases) OVER (PARTITION BY drg_definition ) AS national_num_cases,
      SUM(citywise_sum_total_payments) OVER (PARTITION BY drg_definition ) AS national_sum_total_payments
    FROM (
      SELECT
        drg_definition,
        provider_city,
        provider_state,
        SUM(total_discharges) AS citywise_num_cases,
        SUM(average_total_payments * total_discharges)/ SUM(total_discharges) AS citywise_avg_total_payments,
        SUM(average_total_payments * total_discharges) AS citywise_sum_total_payments
      FROM
        `bigquery-public-data.medicare.inpatient_charges_2014`
      GROUP BY
        drg_definition,
        provider_city,
        provider_state))
  WHERE
    cityrank <=3)  # Limit to top 3 cities for each Diagnosis
ORDER BY
  national_num_cases DESC,
  cityrank
LIMIT
  9;  # Limit Results to the top 3 cities for the top 3 diagnosis

Command-line

bq query --use_legacy_sql=false '
SELECT
  drg_definition AS Diagnosis,
  provider_city AS City,
  provider_state AS State,
  cityrank AS City_Rank,
  CAST(ROUND(citywise_avg_total_payments) AS INT64) AS Citywise_Avg_Payments,
  CONCAT(CAST(ROUND(citywise_avg_total_payments /national_avg_total_payments * 100, 0) AS STRING), " %") AS Avg_Payments_City_vs_National
FROM (
  SELECT
    drg_definition,
    provider_city,
    provider_state,
    cityrank,
    national_num_cases,
    citywise_avg_total_payments,
    national_sum_total_payments,
    (national_sum_total_payments /national_num_cases) AS national_avg_total_payments
  FROM (
    SELECT
      drg_definition,
      provider_city,
      provider_state,
      citywise_avg_total_payments,
      RANK() OVER (PARTITION BY drg_definition ORDER BY citywise_num_cases DESC ) AS cityrank,
      SUM(citywise_num_cases) OVER (PARTITION BY drg_definition ) AS national_num_cases,
      SUM(citywise_sum_total_payments) OVER (PARTITION BY drg_definition ) AS national_sum_total_payments
    FROM (
      SELECT
        drg_definition,
        provider_city,
        provider_state,
        SUM(total_discharges) AS citywise_num_cases,
        SUM(average_total_payments * total_discharges)/ SUM(total_discharges) AS citywise_avg_total_payments,
        SUM(average_total_payments * total_discharges) AS citywise_sum_total_payments
      FROM
        `bigquery-public-data.medicare.inpatient_charges_2014`
      GROUP BY
        drg_definition,
        provider_city,
        provider_state))
  WHERE
    cityrank <=3)  # Limit to top 3 cities for each Diagnosis
ORDER BY
  national_num_cases DESC,
  cityrank
LIMIT # Limit Results to the top 3 cities for the top 3 diagnosis
  9;'

The results are shown here:

+--------------------------------------------------------------------------+-----------+-------+-----------+-----------------------+-------------------------------+
|                                Diagnosis                                 |   City    | State | City_Rank | Citywise_Avg_Payments | Avg_Payments_City_vs_National |
+--------------------------------------------------------------------------+-----------+-------+-----------+-----------------------+-------------------------------+
| 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC | NEW YORK  | NY    |         1 |                 21656 | 144 %                         |
| 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC | BOSTON    | MA    |         2 |                 18324 | 122 %                         |
| 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC | BALTIMORE | MD    |         3 |                 23056 | 154 %                         |
| 871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC                 | HOUSTON   | TX    |         1 |                 14592 | 106 %                         |
| 871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC                 | CHICAGO   | IL    |         2 |                 16071 | 117 %                         |
| 871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC                 | BROOKLYN  | NY    |         3 |                 23398 | 170 %                         |
| 291 - HEART FAILURE & SHOCK W MCC                                        | CHICAGO   | IL    |         1 |                 13287 | 122 %                         |
| 291 - HEART FAILURE & SHOCK W MCC                                        | HOUSTON   | TX    |         2 |                 12572 | 116 %                         |
| 291 - HEART FAILURE & SHOCK W MCC                                        | BALTIMORE | MD    |         3 |                 16992 | 157 %                         |
+--------------------------------------------------------------------------+-----------+-------+-----------+-----------------------+-------------------------------+

About the data

Dataset Source: Centers for Medicare and Medicaid Services

Category: Healthcare

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Update Frequency: Annual

View in BigQuery: Go to Medicare dataset

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...