Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Analyzing NYC biking data with Google BigQuery

Wednesday, December 7, 2016
Posted by Sara Robinson, Developer Advocate

After moving to New York a few months ago I started using Citibike, New York’s bike share program. While biking around the city, I wondered exactly how many people were using these bikes to get around, and more specifically, the most popular times and stations and the demographics of riders.

You can imagine my excitement when I discovered that Citibike publishes data on every single bike trip in New York City since July 2013. Using Google BigQuery, I found some pretty interesting trends in the data. Here are some highlights:

  • The most trips taken by a single Citibike bike is 5,594. See its journey over 3 years mapped below.
  • Men and women take very different bike routes.
  • 87% of riders are Citibike subscribers. Their routes and pickup locations differ from those of non-subscribers, who tend to take circular routes around Central Park.

By analyzing the data with BigQuery, I was able to query 33 million rows of trip data in seconds without worrying about managing any of the infrastructure behind those queries. All of the queries in this post took less than 15 seconds to execute, and were within BigQuery’s free monthly terabyte of data processing.

Ready to start querying? Check out the public tables.

Getting started

With a simple COUNT(*) query we can see that 33,319,019 trips have been taken since Citibike started collecting data in 2013.

#standardSQL
SELECT COUNT(*) FROM `bigquery-public-data.new_york.citibike_trips`

Doing the same on the citibike stations table, there are 663 Citibike stations.

Using COUNT DISTINCT, we can find out how many Citibikes there are in New York and how this number has grown over time:

#standardSQL
SELECT 
 EXTRACT(YEAR from starttime) as year,  
  COUNT(DISTINCT(bikeid)) as num_bikes
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY year
ORDER BY year

Most popular stations

Of the 663 Citibike stations, I wanted to find out which were the most popular for starting a trip. First I queried the entire dataset using BigQuery’s standard SQL (note that the public tables include all trips in one table):
#standardSQL
SELECT
      start_station_name,
      start_station_latitude,
      start_station_longitude, COUNT(*) as num_trips
FROM
      `bigquery-public-data.new_york.citibike_trips`
GROUP BY
      1,2,3
ORDER BY num_trips DESC
LIMIT 10

Here are the results:

Since this is geo data, it would be much more fun to visualize as a map. Using Carto, I created a map of every station (stations with more pickups are darker):

Trips by member type

Each trip in the Citibike dataset includes a usertype column indicating whether the rider is a subscriber. Next, I ran a simple query to see how many trips have been taken by subscribers in 2016, and if there’s a difference in average in trip-duration for subscribers vs. non-subscribers (called “Customers” in the dataset):

#standardSQL
SELECT  
  usertype,  
  COUNT(*) as num_trips,  
  ROUND(AVG(cast(tripduration as int64) / 60),2) as duration
FROM  
  `bigquery-public-data.new_york.citibike_trips`
WHERE CAST(starttime as STRING) LIKE "2016%"
GROUP BY

Here are the results:

Interestingly, 87% of trips are taken by subscribers, but their average trip duration is significantly less: 14 minutes compared to 35 for non-subscribers. This is likely because subscribers use the bikes for commuting, whereas non-subscribers use them for sightseeing.

To confirm this I looked at the top routes broken down by member type, using CONCAT to get the routes by concatenating the from and to station columns:

#standardSQL
SELECT  
  usertype,  
  CONCAT(start_station_name, " to ", end_station_name) as route,  
  COUNT(*) as num_trips,  
  ROUND(AVG(cast(tripduration as int64) / 60),2) as duration
FROM  
  `bigquery-public-data.new_york.citibike_trips`
GROUP BY  
  start_station_name, end_station_name, usertype
ORDER BY  
  num_trips DESC
LIMIT 10

While non-subscribers (“Customer”) account for a small percentage of trips, they account for a much higher percentage of top routes. Additionally, all four of the non-subscriber routes in the top 10 have the same pickup and drop off station, three of which are near the entrance to Central Park. The non-subscriber trips average a much longer duration (close to an hour), and the subscriber routes are very short commuter distances.

Breaking it down by gender and age

The Citibike data has gender and age info for all subscriber trips. I ran this query to see how the number of trips breaks down by age and gender:

#standardSQL
SELECT 
      (2016 - cast(birth_year as INT64)) as age,
      COUNT(*) c, 
      COUNT(case WHEN gender = "male" then 1 end) as num_males,
      COUNT(case WHEN gender = "female" then 1 end) as num_females 
FROM 
      `bigquery-public-data.new_york.citibike_trips`
WHERE CAST(birth_year as string) != "null"
GROUP BY age
ORDER BY c DESC

Next, I used Exploratory to create a stacked bar graph of this data. Nothing too surprising here with respect to biker age:

Before analyzing the data, I expected bike routes to be roughly the same by gender, so I was surprised by the results. This query looks at the top 5 routes taken by women in 2016:

#standardSQL
SELECT  
  CONCAT(start_station_name, " to ", end_station_name) as route,  
  COUNT(*) as num_trips 
FROM  
  `bigquery-public-data.new_york.citibike_trips`
WHERE gender = "female"
AND CAST(starttime as string) LIKE '2016%'
GROUP BY start_station_name, end_station_name
ORDER BY num_trips DESC
LIMIT 5

Here are the results:

Comparing this with the top 5 routes for male riders, we can see there’s only one overlap:

Using Carto, I’ve created a map of the top 30 routes for both men and women (purple routes are women, blue routes are men). You can hover over a route to see the name of the route along with the number of trips taken in 2016.

The life of a single bike

For each trip, Citibike gives us a bike ID to indicate the particular bike that was used. I ran a query to find out which bike in Citibike’s system had the most trips since 2013:

#standardSQL
SELECT  
  bikeid,  
  COUNT(*) c 
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY 1
ORDER BY c DESC

The winner was bike number 17526 with a total of 5,594 trips over the years. Then I ran a query to get the latitude and longitude of the pickup station for each of those trips and mapped the results over time (note that the date format in the Citibike dataset changed from 2014 - 2015, hence the regex):

#standardSQL
SELECT  
  starttime,  
  start_station_latitude,  
  start_station_longitude 
FROM `bigquery-public-data.new_york.citibike_trips`
WHERE cast(bikeid as int64) = 17526
ORDER BY starttime

Check out the full animated map here.

(And if you were wondering, bike 17526 has visited 469 of Citibike’s 554 stations.)

Here’s a heatmap of the stations it has visited.

What’s next?

We’ve made the Citibike dataset public in BigQuery so you can run your own queries on the data (and maybe even find the row with your own route, if you’re a member). You can find even more BigQuery public datasets here.

Have questions or ideas for future posts? Leave a comment or find me on Twitter @SRobTweets. And, here are some good next steps:

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.