Building a better GIPHY with Google Cloud’s machine learning tools
By Nick Hasty, GIPHY Director of Engineering
Editor’s note: If you’re a GIPHY user, you may have noticed that it’s gotten easier to find just the right GIF to liven up your emails, texts and chats. That’s because they used Google Cloud Machine Learning to analyze and tag their GIFs. In this post, GIPHY director of engineering Nick Hasty talks about the specific APIs they used, and how those translated into better search results for their users.
Here at GIPHY, we serve over three billion GIFs a day to over 300 million daily active users, and are constantly looking for ways to improve the results of their GIF searches. Recently, we’ve integrated Google’s Cloud Vision and Cloud NLP APIs into our GIF processing pipeline, which has helped us collect more metadata about our GIFs and enhance the core GIPHY search engine. Specifically, using Web Entities data from Cloud Vision helped us boost our search performance and using syntax data from Cloud Natural Language and the Knowledge Graph API let us create rich titles for our GIFs.
Like many young startups, our engineering team spends the majority of its time expanding infrastructure to meet demand and building products to grow our user base, so any data science work that isn’t focused on growth analytics is considered a nice-to-have. Meanwhile, our GIF library has been growing in tandem with our demand. Since our primary product is a search engine, we found ourselves needing new, scalable ways to collect metadata about our GIFs.
Entertainment and culture is at the heart of GIPHY’s content, and we need to be able to identify specific instances of celebrities, animals, movies, sports, video games, and more. The incredible pop-culture wizards in our in-house editorial team trawl through our catalog identifying and annotating our content, but at our current size, they need help. We wanted to leverage machine learning to supplement our editors, but our need for specificity presented a challenge, as most computer-vision image labelling systems usually provide generic, high-level labels for the objects found in images, and aren’t necessarily trained to identify objects on a granular level.
The depth of this problem was growing with every GIF we crawled or was uploaded to our site. We knew it was possible to train custom ML models in-house, and we do have a large amount of labeled data, but we needed a solution fast, so we decided to research third-party services. There are lots of machine learning APIs on the market, but we believed Google offered the most mature service and the widest array of quality models. After a few exploratory tests with positive results, we decided to go big and processed over 10 million GIFs across using the full array of Google Vision services. Here’s an overview of what we did.
In search of subtitles (and other text)
Our first integration of the metadata generated from Cloud Vision into our search engine was to use its optical character recognition (OCR) endpoint, which evaluates images for the presence of text, notably subtitles.
OCR detects text that is integrated into the actual pixel data of an image, and because this text can change over frames, like with subtitles, we wanted to make sure we captured as much text data as possible.
That said, we didn’t necessarily want to analyze each and every frame. Like videos, GIFs are comprised of individual frames that play sequentially, but they can be configured to “start over upon completion”—on an endless loop. At scale, evaluating every frame of every GIF can be prohibitively expensive and time-consuming, so we usually only send a single frame at a time. This doesn’t give us total coverage, as the contents of a GIF’s frames are by no means guaranteed to have continuity, but the benefits still greatly outweigh the downsides.
To begin evaluating a GIF for text, we sent off an initial frame to see if the API detected any text. If it did, we sent another frame and computed the difference in text values to see if the the text found in the GIF was static or dynamic across frames. If the text differed enough across frames, like in the case of dialog subtitles, then we repeated this process across a percentage of total frames until we had sufficient coverage to perform textual analysis. You can read more about our findings on our engineering blog, or look at our sample Python code and try it for yourself—all you need are a captioned GIF with which to test and your Google Cloud credentials.
Contextual labels with Web Entities
After our success with OCR, we focused on integrating Google’s label data, specifically the Web Entities, which were great for discovering the additional specifics about the image.
Cloud Vision provides two types of image labels: “Label Detection”, which provides labels for objects that Google’s machine learning models are trained to detect, and “Web Entities”, which yield labels derived from the context in which the image was crawled and indexed. Since Web Entity labels take into account the data embedded around an image, like the surrounding text or captions, they tend to be very specific and can even provide proper nouns. For example, while Cloud Vision doesn’t have a model specifically trained to identify a specific celebrity, if it discovers an person’s image across multiple websites and in each case that image is displayed with a caption containing a proper noun, then the Web Entities endpoint labels that image with that proper noun.
The Web Entities endpoint seemed especially promising for us since a portion of our content is pulled from the web. We processed a large number of GIFs for Web Entities and tested how well this data performed in our own search. My colleague Sean Quiqley ran some A/B tests that included Web Entities as a signal in our search algorithm. He found that adding this data increased our click-through-rate (CTR) by .33% in absolute terms and .86% in relative terms, across both desktop and mobile clients: a significant improvement! Medium-to-long-tail searches benefitted the most, becoming richer with relevant content as the Web Entity data surfaced previously un-annotated GIFs that would have otherwise been hidden. We also saw a .01% drop in search pages served with 0 results, which means that we were now returning content for searches that we’d historically failed on.
Toward better GIF titles
A parallel project to using Web Entities was to improve the algorithm for generating titles for our GIFs. GIF titles are most noticeable on our GIF detail pages—right above the GIF itself. Initially, creating these titles was fairly simple and involved choosing the most popular tag for that GIF. This worked for a while, but with an ever-growing catalog we were creating many duplicate titles and lacking specificity.
For our new title schema, we wanted to be as descriptive as possible, with special emphasis on including the names of famous people, fictional characters, or TV shows, as well as actions, emotions, or reactions. My colleague Nick Santaniello and I came up with a series of schemas to create titles depending on the metadata we had for the GIF, but for these schemas to work, we needed better metadata about our tags themselves. Our schemas required metadata about, say, a tag’s part of speech and whether or not the tag was a proper noun. There are many solid open-source libraries for determining syntax, but we ended up using Google Cloud Natural Language. Not only does it provide very precise syntax detection, but its Entity Recognition also provides a great way to identify proper nouns. Entity Recognition identifies known objects in a block of text, and provides Google Knowledge Graph IDs when available. This tight coupling between APIs helped us identify tags that refer to specific people, places, and things and highly specific metadata about those objects.
After processing our tag catalog, we began putting the data to use and generating new GIF titles. So far we’ve been thrilled with the results. Check out some of our before and after GIF titles:
Historically, taking advantage of the latest developments in machine learning and computer vision required in-house specialists and lots of time—two things that growth-stage startups like GIPHY don’t always have.
Although these technologies are becoming more accessible, easier to use, and faster to implement, their ultimate success depends heavily on both the quality and quantity of data you have available for training. Google Cloud’s Machine Learning APIs provides tremendously powerful, cutting-edge models that are just a request away. For GIPHY, taking advantage of these APIs was a more cost-efficient, scalable, and speedy approach than replicating this technology and training similar models on our own. We were able to gather and utilize the data we needed and quickly ship products that perform demonstrably better. It’s a big win for us, and we couldn’t be happier.
So what’s next for GIPHY? We’re now building custom Machine Learning models whose training labels are partially supplemented with the data provided from Cloud Vision. We hope to reveal more in the future. Until then, you can keep up with our progress on the GIPHY Engineering Blog.