Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

New York City public datasets now available on Google BigQuery

Wednesday, January 18, 2017

By Reto Meier, Google Developer Advocate

This rich dataset makes it easy to learn how to explore and visualize data using BigQuery.

New York City is home to 8.5 million residents, and more than 50 million people visit this vibrant and dynamic city each year. With so many sights and sounds, it’s easy to get lost in the details, and lose sight of the big picture: How do New Yorkers actually survive in the “concrete jungle?”

Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:

There’s no cost for the first terabyte of data you process each month, and because BigQuery is serverless, there’s no infrastructure you need to manage or maintain. That means we can focus on querying, joining and visualizing this data to learn more about New York City and the people who make up this bustling metropolis.As you’ll see below, all these new data sets can be used with existing data, such as NOAA GSOD to discover trends based on changes in weather. We’ll also be continually adding new datasets from other cities, so soon you’ll be able to compare habits and trends between cities and countries around the globe using BigQuery to better understand the world around us.

On which New York City streets are you most likely to find a loud party?

If there's something strange in your neighborhood, the right number to call is 311; created specifically for non-emergency municipal inquiries and non-urgent community concerns. What does that include?

The graph below shows the top five reasons why New Yorkers call 311 over the past 4 years.

SELECT
  Extract(YEAR from created_date) AS year,
  REPLACE(UPPER(complaint_type), 
          "HEATING", "HEAT/HOT WATER") as complaint, 
  COUNT(*) AS count
FROM
  `bigquery-public-data-staging.new_york.311_service_requests_all`
GROUP BY complaint, year
ORDER BY COUNT DESC
LIMIT 1000
(To run this query yourself, you can copy/paste the above SQL into BigQuery, or follow this link to my shared query.)

Call volume tells us that it gets noisy in New York, and it also gets very cold. By joining the 311 calls to the NOAA GSOD weather table, we confirm that most calls about faulty heat and hot water happen when the temperature drops — while noise remains a constant annoyance.

There were also 267,887 calls about dead, damaged or dying trees, so you might wonder if there are any healthy trees left in NYC.

Can you find the Virginia Pines in New York City?

The Decennial NYC tree surveys from 1995, 2005, and 2015 are all available in BigQuery, and the preliminary data from 2015 so far found the London Planetrees, Honeylocusts and Callery Pears represented almost a third of all trees outside of parks.

Where was the only collision caused by an animal that injured a cyclist?

There’s a lot of traffic in New York, and while the number of accidents has slowly increased each year, the number of injuries has remained fairly consistent. Fortunately, the number of deaths has dropped by an average of 9% each year.

As you can see below, “Driver Inattention/Distraction” is the most likely cause of accident and injury, but disregarding traffic control (such as running a red light) is the most common cause of death.

The following graphs show that most traffic accidents happen in Brooklyn, but it’s Midtown and Downtown Manhattan that have the highest concentration of collisions — and Staten Island the highest proportion of deaths per accident.

With motor vehicle accidents resulting in 6 motorist deaths for each cyclist death (and no Citi Bike rider deaths), you might be safer taking a Citi Bike.

What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?

Comparing the average duration of 5 of the most popular Citi Bike routes, to taxi journeys beginning and ending within an approximately 50-meter radius of the corresponding Citi Bike stations, we see that for trips under 10 minutes there’s not much difference between taking a taxi or riding a bike.

Next steps

There are countless ways to slice, dice, join and visualize this data, and we’re just getting started.

Share your own insights and visualizations with us using the hashtag #TILwBQ, and join us here every week for Today I Learned with BigQuery, as we dig into these tables, launch new public datasets, demonstrate BigQuery, share protips and offer interviews with Big Data industry experts.



If you’re new to BigQuery, here are some concepts to keep in mind while working with the New York City datasets:

Sign up or sign in to BigQuery today to create and share your own NYC analysis and visualizations.
  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.