Google Cloud

When art meets big data: Analyzing 200,000 items from The Met collection in BigQuery

August 7, 2017

Sara Robinson

Developer Advocate, Google Cloud Platform

This new public dataset is invaluable for anyone who wants to learn how to build a custom machine-learning model, create an app for sorting and visualizing the images, and more.

Today we’re adding a new public dataset to Google BigQuery: over 200,000 items from The Metropolitan Museum of Art (aka “The Met”), representing all its public domain art from a total of 1.5 million art objects. The Met Museum Public Domain dataset includes metadata about each piece of art, along with an image or images of the artifact. Google and The Met Museum have been close collaborators for years through Google Arts & Culture and we’re incredibly excited to bring the museum's public dataset to BigQuery.

Let’s dive right into the data by looking at the museum departments represented by our collection:

https://storage.googleapis.com/gweb-cloudblog-publish/images/met015yaa.max-400x400.PNG

Similarly, we can find the top media used for the items in our collection. medium is a comma-separated string in the table, so we’ll use SPLIT to get the results:

https://storage.googleapis.com/gweb-cloudblog-publish/images/met02o5f1.max-500x500.PNG

We can use Google Data Studio to visualize the results from both of these queries:

https://storage.googleapis.com/gweb-cloudblog-publish/images/met03.max-1100x1100.png

https://storage.googleapis.com/gweb-cloudblog-publish/images/met04.max-1000x1000.png

Analyzing the image data

Because art is visual, our analysis can only go so far by looking at metadata. We’ve got images for 200,000 pieces of art - that’s a lot of pixels! I took one image for each of our 200,000 pieces and sent it to the Cloud Vision API. Then I took the Cloud Vision API’s JSON response and stored the result in a BigQuery table. This gives us lots of information on each piece of art: what’s in the image, where the image can be found on the web, urls of similar images, and more. For details on how I processed these images, check out this post.

Labels by time period

To start, let’s JOIN the time period for each piece of art from our metadata table with the labels returned by the Cloud Vision API’s label detection method. This query will give us the top 3 labels for each time period:

https://storage.googleapis.com/gweb-cloudblog-publish/images/met05.max-500x500.png

Cloud Vision API web detection

The web detection feature of the Cloud Vision API finds all the pages on the internet where our image exists, along with the URLs of visually similar images. With web detection we also get a list of entities - these are labels describing the image based on the context of where it was found on the web. Let’s look at the web detection response we get back for this image:

https://storage.googleapis.com/gweb-cloudblog-publish/images/met06copyk88b.max-700x700.JPEG

Creative Commons Zero http://www.metmuseum.org/art/collection/search/23939

Here’s a sample of the JSON response:

Notice that we get the URLs for all exact image matches and the pages where those images were found. This can be used to see all the places a particular image has been shared. For our Met dataset, we’ll find the most common domains where our images are found on the web with this query:

And here’s the result in BigQuery:

https://storage.googleapis.com/gweb-cloudblog-publish/images/met07qp0j.max-500x500.PNG

Sorting images by color

The Cloud Vision API can also find the dominant colors in an image and the RGB value for those colors, so sort our images by color. The following query will give us the URLs of all the images that are a particular shade of blue:

Here’s a sample:

https://storage.googleapis.com/gweb-cloudblog-publish/images/00met08copy0e7p.max-400x400.JPEG

Creative Commons Zero http://www.metmuseum.org/art/collection/search/544864

https://storage.googleapis.com/gweb-cloudblog-publish/images/00met09ficv.max-400x400.JPEG

Creative Commons Zero http://www.metmuseum.org/art/collection/search/5592

https://storage.googleapis.com/gweb-cloudblog-publish/images/00met010copycopyegee.max-300x300.JPEG

Creative Commons Zero http://www.metmuseum.org/art/collection/search/2941

Which images contain famous landmarks?

We can use the Vision API’s landmarkAnnotations feature to identify common landmarks in our photos (more on that here), so I was interested to see if it was able to extract a landmark from items in The Met collection.

The following query returns the images containing landmarks, sorted by the Vision API’s confidence score:

The API identified this photograph from the collection as Yosemite with 98% confidence:

https://storage.googleapis.com/gweb-cloudblog-publish/images/met011copycopycopycmnt.max-500x500.JPEG

Creative Commons Zero http://www.metmuseum.org/art/collection/search/689997

And it knew this piece of art was from the Basilica of San Vitale in Italy:

https://storage.googleapis.com/gweb-cloudblog-publish/images/met012copycopycopycopyffhe.max-400x400.JPEG

Creative Commons Zero http://www.metmuseum.org/art/collection/search/466586

Next steps

These examples just scratch the surface of what you can do with this dataset: there are possibilities for comparing the images with other art collections, using the data to build and train a custom machine learning model, building an app for sorting and visualizing the images, and more. I’d love to see what you do with the Met data--leave a comment or find me on Twitter @SRobTweets.

To learn more about the Cloud Vision API, try it out on your own images by uploading them directly in the browser to see the API response. Then start diving into code by going through the Cloud Vision API quickstart.

Stay tuned for more posts exploring this dataset, including training a model on the Met data using TensorFlow, and for many exciting collaborations between Google and The Met.

Posted in