Chicago Taxi Trips

How to query public data sets using BigQuery

BigQuery is a fully managed data warehouse and analytics platform. Public datasets are available for you to analyze using SQL queries. You can access BigQuery public data sets using the web UI the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.

To get started using a BigQuery public dataset, create or select a project. The first terabyte of data processed per month is free, so you can start querying public datasets without enabling billing. If you intend to go beyond the free tier, you should also enable billing.

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Manage resources page

  3. Enable billing for your project.

    Enable billing

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery API.

    Enable the API

Dataset overview

This dataset includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. To protect privacy but allow for aggregate analyses, the Taxi ID is consistent for any given taxi medallion number but does not show the number, Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes. Due to the data reporting process, not all trips are reported but the City believes that most are. See http://digital.cityofchicago.org/index.php/chicago-taxi-data-released for more information about this dataset and how it was created.

You can start exploring this data in the BigQuery console:

Go to Chicago Taxi Trips data

Sample queries

Here are some examples of SQL queries you can run on this data in BigQuery.

These samples use BigQuery’s support for standard SQL. Use the #standardSQL tag to let BigQuery know you want to use standard SQL. For more information about the #standardSQL prefix, see Setting a query prefix.

What are the maximum, minimum and average fares for rides lasting 10 minutes or more?

This query retrieves the maximum, minimum, and average fare for rides lasting 10 minutes or longer. The results are grouped by week day. In addition, the query calculates the standard deviation of fares and the total number of rides given on each day.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS day,
  FORMAT('%3.2f', MAX(fare)) AS maximum_fare,
  FORMAT('%3.2f', MIN(fare)) AS minimum_fare,
  FORMAT('%3.2f', AVG(fare)) AS avg_fare,
  FORMAT('%3.2f', STDDEV(fare)) AS std_dev_fare,
  COUNT(1) AS rides
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
  trip_seconds >= 600
GROUP BY
  day
ORDER BY
  day

Command-line

bq query --use_legacy_sql=false '
SELECT
  EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS day,
  FORMAT('%3.2f', MAX(fare)) AS maximum_fare,
  FORMAT('%3.2f', MIN(fare)) AS minimum_fare,
  FORMAT('%3.2f', AVG(fare)) AS avg_fare,
  FORMAT('%3.2f', STDDEV(fare)) AS std_dev_fare,
  COUNT(1) AS rides
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
  trip_seconds >= 600
GROUP BY
  day
ORDER BY
  day'

Sample results are shown here:

+------+---------------+---------------+-----------+---------------+------------+
| day  | maximum_fare  | minimum_fare  | avg_fare  | std_dev_fare  | rides      |
+------+---------------+---------------+-----------+---------------+------------+
| 1    | 9900.41       | 0.00          | 18.90     | 51.07         | 6216446    |
| 2    | 9900.42       | 0.00          | 20.29     | 56.45         | 5846465    |
| 3    | 9900.41       | 0.00          | 19.01     | 54.48         | 6183382    |
| 4    | 9825.04       | 0.00          | 18.99     | 56.53         | 6638638    |
| 5    | 9702.08       | 0.00          | 18.76     | 52.25         | 7240327    |
| 6    | 9900.45       | 0.00          | 17.67     | 56.25         | 8319078    |
| 7    | 9900.21       | 0.00          | 15.84     | 51.08         | 7847205    |
+------+---------------+---------------+-----------+---------------+------------+

An analysis of the data reveals some outliers — $9,900 maximum fares for taxi rides. This data needs to be filtered to do further analysis.

Which drop-off areas have the highest average tip?

This query retrieves the highest average tip and the maximum tip given to taxi drivers grouped by drop-off community areas.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  dropoff_community_area,
  FORMAT('%3.2f', AVG(tips)) AS average_tip,
  FORMAT('%3.2f', MAX(tips)) AS max_tip
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
  dropoff_community_area IS NOT NULL
GROUP BY
  dropoff_community_area
ORDER BY
  average_tip DESC
LIMIT
  10

Command-line

bq query --use_legacy_sql=false '
SELECT
  dropoff_community_area,
  FORMAT('%3.2f', AVG(tips)) AS average_tip,
  FORMAT('%3.2f', MAX(tips)) AS max_tip
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
  dropoff_community_area IS NOT NULL
GROUP BY
  dropoff_community_area
ORDER BY
  average_tip DESC
LIMIT
  10'

Sample results are shown here:

+------------------------+--------------+-----------+
| dropoff_community_area | average_tip  | max_tip   |
+------------------------+--------------+-----------+
| 76                     | 3.69         | 596.85    |
| 72                     | 3.43         | 80.00     |
| 56                     | 3.12         | 285.00    |
| 74                     | 2.58         | 150.00    |
| 75                     | 2.13         | 75.00     |
| 9                      | 1.83         | 74.75     |
| 41                     | 1.81         | 130.00    |
| 64                     | 1.69         | 100.00    |
| 55                     | 1.67         | 27.90     |
| 52                     | 1.49         | 38.85     |
+------------------------+--------------+-----------+

The query results reveal that community area 76 (O’Hare airport) yields the highest average tip across all rides and a maximum tip of nearly $600. That person really needed to make their flight!

How does trip duration affect fare rates for trips lasting less than 90 minutes?

This query converts trip durations to minutes, buckets them into deciles, and calculates the total number of trips and the average fare across each duration range.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  FORMAT('%02.0fm to %02.0fm', min_minutes, max_minutes) AS minutes_range,
  SUM(trips) AS total_trips,
  FORMAT('%3.2f', SUM(total_fare) / SUM(trips)) AS average_fare
FROM (
  SELECT
    MIN(duration_in_minutes) OVER (quantiles) AS min_minutes,
    MAX(duration_in_minutes) OVER (quantiles) AS max_minutes,
    SUM(trips) AS trips,
    SUM(total_fare) AS total_fare
  FROM (
    SELECT
      ROUND(trip_seconds / 60) AS duration_in_minutes,
      NTILE(10) OVER (ORDER BY trip_seconds / 60) AS quantile,
      COUNT(1) AS trips,
      SUM(fare) AS total_fare
    FROM
      `bigquery-public-data.chicago_taxi_trips.taxi_trips`
    WHERE
      ROUND(trip_seconds / 60) BETWEEN 1 AND 90
    GROUP BY
      trip_seconds,
      duration_in_minutes )
  GROUP BY
    duration_in_minutes,
    quantile
  WINDOW quantiles AS (PARTITION BY quantile)
  )
GROUP BY
  minutes_range
ORDER BY
  Minutes_range

Command-line

bq query --use_legacy_sql=false '
SELECT
  FORMAT('%02.0fm to %02.0fm', min_minutes, max_minutes) AS minutes_range,
  SUM(trips) AS total_trips,
  FORMAT('%3.2f', SUM(total_fare) / SUM(trips)) AS average_fare
FROM (
  SELECT
    MIN(duration_in_minutes) OVER (quantiles) AS min_minutes,
    MAX(duration_in_minutes) OVER (quantiles) AS max_minutes,
    SUM(trips) AS trips,
    SUM(total_fare) AS total_fare
  FROM (
    SELECT
      ROUND(trip_seconds / 60) AS duration_in_minutes,
      NTILE(10) OVER (ORDER BY trip_seconds / 60) AS quantile,
      COUNT(1) AS trips,
      SUM(fare) AS total_fare
    FROM
      `bigquery-public-data.chicago_taxi_trips.taxi_trips`
    WHERE
      ROUND(trip_seconds / 60) BETWEEN 1 AND 90
    GROUP BY
      trip_seconds,
      duration_in_minutes )
  GROUP BY
    duration_in_minutes,
    quantile
  WINDOW quantiles AS (PARTITION BY quantile)
  )
GROUP BY
  minutes_range
ORDER BY
  Minutes_range'

Sample results are shown here:

+----------------+-------------+---------------+
| minutes_range  | total_trips | average_fare  |
+----------------+-------------+---------------+
| 01m to 09m     | 40542103    | 6.40          |
| 09m to 17m     | 31987217    | 10.41         |
| 17m to 25m     | 10768268    | 19.27         |
| 25m to 33m     | 5221200     | 29.13         |
| 33m to 41m     | 2720021     | 34.63         |
| 41m to 49m     | 1571449     | 38.01         |
| 49m to 57m     | 912660      | 40.54         |
| 57m to 65m     | 498915      | 42.99         |
| 65m to 74m     | 302588      | 45.33         |
| 75m to 90m     | 174676      | 50.27         |
+----------------+-------------+---------------+

You can use this query to create a report in Google Data Studio. Graphing the query results reveals that fare rates mostly scale linearly according to the ride’s duration. As well, the duration of the vast majority of rides is less than 17 minutes.

Tips and fares by ride duration

For a tutorial on using Google Data Studio with BigQuery, see Visualizing BigQuery Data Using Google Data Studio.

About the data

Dataset Source: City of Chicago

Category: Chicago, Transportation

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofchicago.org - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Update Frequency: Monthly

View in BigQuery: Go to Chicago Taxi Trips data

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...