Analyzing NYC biking data with Google BigQuery
After moving to New York a few months ago I started using Citibike, New York’s bike share program. While biking around the city, I wondered exactly how many people were using these bikes to get around, and more specifically, the most popular times and stations and the demographics of riders.
You can imagine my excitement when I discovered that Citibike publishes data on every single bike trip in New York City since July 2013. Using Google BigQuery, I found some pretty interesting trends in the data. Here are some highlights:
- The most trips taken by a single Citibike bike is 5,594. See its journey over 3 years mapped below.
- Men and women take very different bike routes.
- 87% of riders are Citibike subscribers. Their routes and pickup locations differ from those of non-subscribers, who tend to take circular routes around Central Park.
By analyzing the data with BigQuery, I was able to query 33 million rows of trip data in seconds without worrying about managing any of the infrastructure behind those queries. All of the queries in this post took less than 15 seconds to execute, and were within BigQuery’s free monthly terabyte of data processing.
Ready to start querying? Check out the public tables.
With a simple
COUNT(*) query we can see that 33,319,019 trips have been taken since Citibike started collecting data in 2013.
#standardSQL SELECT COUNT(*) FROM `bigquery-public-data.new_york.citibike_trips`
Doing the same on the citibike stations table, there are 663 Citibike stations.
COUNT DISTINCT, we can find out how many Citibikes there are in New York and how this number has grown over time:
#standardSQL SELECT EXTRACT(YEAR from starttime) as year, COUNT(DISTINCT(bikeid)) as num_bikes FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY year ORDER BY year
Most popular stationsOf the 663 Citibike stations, I wanted to find out which were the most popular for starting a trip. First I queried the entire dataset using BigQuery’s standard SQL (note that the public tables include all trips in one table):
#standardSQL SELECT start_station_name, start_station_latitude, start_station_longitude, COUNT(*) as num_trips FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY 1,2,3 ORDER BY num_trips DESC LIMIT 10
Here are the results:
Trips by member type
Each trip in the Citibike dataset includes a usertype column indicating whether the rider is a subscriber. Next, I ran a simple query to see how many trips have been taken by subscribers in 2016, and if there’s a difference in average in trip-duration for subscribers vs. non-subscribers (called “Customers” in the dataset):
#standardSQL SELECT usertype, COUNT(*) as num_trips, ROUND(AVG(cast(tripduration as int64) / 60),2) as duration FROM `bigquery-public-data.new_york.citibike_trips` WHERE CAST(starttime as STRING) LIKE "2016%" GROUP BY
Here are the results:
Interestingly, 87% of trips are taken by subscribers, but their average trip duration is significantly less: 14 minutes compared to 35 for non-subscribers. This is likely because subscribers use the bikes for commuting, whereas non-subscribers use them for sightseeing.
To confirm this I looked at the top routes broken down by member type, using CONCAT to get the routes by concatenating the from and to station columns:
#standardSQL SELECT usertype, CONCAT(start_station_name, " to ", end_station_name) as route, COUNT(*) as num_trips, ROUND(AVG(cast(tripduration as int64) / 60),2) as duration FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY start_station_name, end_station_name, usertype ORDER BY num_trips DESC LIMIT 10
While non-subscribers (“Customer”) account for a small percentage of trips, they account for a much higher percentage of top routes. Additionally, all four of the non-subscriber routes in the top 10 have the same pickup and drop off station, three of which are near the entrance to Central Park. The non-subscriber trips average a much longer duration (close to an hour), and the subscriber routes are very short commuter distances.
Breaking it down by gender and age
The Citibike data has gender and age info for all subscriber trips. I ran this query to see how the number of trips breaks down by age and gender:
#standardSQL SELECT (2016 - cast(birth_year as INT64)) as age, COUNT(*) c, COUNT(case WHEN gender = "male" then 1 end) as num_males, COUNT(case WHEN gender = "female" then 1 end) as num_females FROM `bigquery-public-data.new_york.citibike_trips` WHERE CAST(birth_year as string) != "null" GROUP BY age ORDER BY c DESC
Before analyzing the data, I expected bike routes to be roughly the same by gender, so I was surprised by the results. This query looks at the top 5 routes taken by women in 2016:
#standardSQL SELECT CONCAT(start_station_name, " to ", end_station_name) as route, COUNT(*) as num_trips FROM `bigquery-public-data.new_york.citibike_trips` WHERE gender = "female" AND CAST(starttime as string) LIKE '2016%' GROUP BY start_station_name, end_station_name ORDER BY num_trips DESC LIMIT 5
Here are the results:
Comparing this with the top 5 routes for male riders, we can see there’s only one overlap:
Using Carto, I’ve created a map of the top 30 routes for both men and women (purple routes are women, blue routes are men). You can hover over a route to see the name of the route along with the number of trips taken in 2016.
The life of a single bike
For each trip, Citibike gives us a bike ID to indicate the particular bike that was used. I ran a query to find out which bike in Citibike’s system had the most trips since 2013:
#standardSQL SELECT bikeid, COUNT(*) c FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY 1 ORDER BY c DESC
The winner was bike number 17526 with a total of 5,594 trips over the years. Then I ran a query to get the latitude and longitude of the pickup station for each of those trips and mapped the results over time (note that the date format in the Citibike dataset changed from 2014 - 2015, hence the regex):
#standardSQL SELECT starttime, start_station_latitude, start_station_longitude FROM `bigquery-public-data.new_york.citibike_trips` WHERE cast(bikeid as int64) = 17526 ORDER BY starttime
Check out the full animated map here.
(And if you were wondering, bike 17526 has visited 469 of Citibike’s 554 stations.)
Here’s a heatmap of the stations it has visited.
We’ve made the Citibike dataset public in BigQuery so you can run your own queries on the data (and maybe even find the row with your own route, if you’re a member). You can find even more BigQuery public datasets here.
Have questions or ideas for future posts? Leave a comment or find me on Twitter @SRobTweets. And, here are some good next steps: