New York City public datasets now available on Google BigQuery
By Reto Meier, Google Developer Advocate
This rich dataset makes it easy to learn how to explore and visualize data using BigQuery.
New York City is home to 8.5 million residents, and more than 50 million people visit this vibrant and dynamic city each year. With so many sights and sounds, it’s easy to get lost in the details, and lose sight of the big picture: How do New Yorkers actually survive in the “concrete jungle?”
Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:
- Over 8 million 311 service requests from 2012-2016 (updated daily)
- More than 1 million motor vehicle collisions 2012-present (updated regularly)
- Citi Bike stations and 30 million Citi Bike trips 2013-present (updated regularly)
- Over 1 billion Yellow and Green Taxi rides from 2009-present (updated regularly)
- Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015
There’s no cost for the first terabyte of data you process each month, and because BigQuery is serverless, there’s no infrastructure you need to manage or maintain. That means we can focus on querying, joining and visualizing this data to learn more about New York City and the people who make up this bustling metropolis.As you’ll see below, all these new data sets can be used with existing data, such as NOAA GSOD to discover trends based on changes in weather. We’ll also be continually adding new datasets from other cities, so soon you’ll be able to compare habits and trends between cities and countries around the globe — using BigQuery to better understand the world around us.On which New York City streets are you most likely to find a loud party?
If there's something strange in your neighborhood, the right number to call is 311; created specifically for non-emergency municipal inquiries and non-urgent community concerns. What does that include?
The graph below shows the top five reasons why New Yorkers call 311 over the past 4 years.
(To run this query yourself, you can copy/paste the above SQL into BigQuery, or follow this link to my shared query.)
SELECT Extract(YEAR from created_date) AS year, REPLACE(UPPER(complaint_type), "HEATING", "HEAT/HOT WATER") as complaint, COUNT(*) AS count FROM `bigquery-public-data-staging.new_york.311_service_requests_all` GROUP BY complaint, year ORDER BY COUNT DESC LIMIT 1000
Call volume tells us that it gets noisy in New York, and it also gets very cold. By joining the 311 calls to the NOAA GSOD weather table, we confirm that most calls about faulty heat and hot water happen when the temperature drops — while noise remains a constant annoyance.
There were also 267,887 calls about dead, damaged or dying trees, so you might wonder if there are any healthy trees left in NYC.Can you find the Virginia Pines in New York City?
The Decennial NYC tree surveys from 1995, 2005, and 2015 are all available in BigQuery, and the preliminary data from 2015 so far found the London Planetrees, Honeylocusts and Callery Pears represented almost a third of all trees outside of parks.Where was the only collision caused by an animal that injured a cyclist?
There’s a lot of traffic in New York, and while the number of accidents has slowly increased each year, the number of injuries has remained fairly consistent. Fortunately, the number of deaths has dropped by an average of 9% each year.
As you can see below, “Driver Inattention/Distraction” is the most likely cause of accident and injury, but disregarding traffic control (such as running a red light) is the most common cause of death.
The following graphs show that most traffic accidents happen in Brooklyn, but it’s Midtown and Downtown Manhattan that have the highest concentration of collisions — and Staten Island the highest proportion of deaths per accident.Longest Distance in the Shortest Time (on a route with at least 100 rides)?
Comparing the average duration of 5 of the most popular Citi Bike routes, to taxi journeys beginning and ending within an approximately 50-meter radius of the corresponding Citi Bike stations, we see that for trips under 10 minutes there’s not much difference between taking a taxi or riding a bike.
There are countless ways to slice, dice, join and visualize this data, and we’re just getting started.
Share your own insights and visualizations with us using the hashtag #TILwBQ, and join us here every week for Today I Learned with BigQuery, as we dig into these tables, launch new public datasets, demonstrate BigQuery, share protips and offer interviews with Big Data industry experts.
If you’re new to BigQuery, here are some concepts to keep in mind while working with the New York City datasets:
- With BigQuery, everyone gets one terabyte at no charge every month to run queries. If you've never tried BigQuery before, follow these getting started instructions.