How AI, and specifically BERT, helps the patent industry
Rob Srebrovic
Data Scientist, Global Patents
Jay Yonamine
Head of Data Science, Global Patents at Google
In recent years the patent industry has begun to use machine-learning (ML) algorithms to add efficiency and insights to business practices.
Any company, patent office, or academic institution that works with patents—generating them through innovation, processing applications about them, or developing sophisticated ways to analyze them—will benefit from doing patent analytics and machine learning in Google Cloud.
Today, we are excited to release a white paper that outlines a methodology to train a BERT (bidirectional encoder representation from transformers) model on over 100 million patent publications from the U.S. and other countries using open-source tooling. The paper describes how to use the trained model for a number of use cases, including how to more effectively perform prior art searching to determine the novelty of a patent application, automatically generate classification codes to assist with patent categorization, and autocomplete. The white paper is accompanied by a colab notebook as well the trained model hosted in GitHub.
Google’s release of the BERT model (paper, blog post, and open-source code) in 2018 was an important breakthrough that leveraged transformers to outperform other leading state of the art models across major NLP benchmarks, including GLUE, MultiNLI, and SQuAD. Shortly after its release, the BERT framework and many additional transformer-based extensions gained widespread industry adoption across domains like search, chatbots, and translation.
We believe that the patents domain is ripe for the application of algorithms like BERT due to the technical characteristics of patents as well as their business value. Technically, the patent corpus is large (millions of new patents are issued every year world-wide), complex (patent applications generally average ~10,000 words and are often meticulously wordsmithed by inventors, lawyers, and patent examiners), unique (patents are written in a highly specialized ‘legalese’ that can be unintelligible to a lay reader), and highly context dependent (many terms are used to mean completely different things in different patents).
Patents also represent tremendous business value to a number of organizations, with corporations spending tens of billions of dollars a year developing patentable technology and transacting the rights to use the resulting technology and patent offices around the world spending additional billions of dollars a year reviewing patent applications.
We hope that our new white paper and its associated code and model will help the broader patent community in its application of ML, including:
Corporate patent departments looking to improve their internal models and tooling with more advanced ML techniques.
Patent offices interested in leveraging state-of-the-art ML approaches to assist with patent examination and prior art searching.
ML and NLP researchers and academics who might not have considered using the patents corpus to test and develop novel NLP algorithms.
Patent researchers and academics who might not have considered applying the BERT algorithm or other transformer based approaches to their study of patents and innovation.
To learn more, you can download the full white paper, colab notebook, and trained model. Additionally, see Google Patents Public Datasets: Connecting Public, Paid, and Private Patent Data, Expanding your patent set with ML and BigQuery, and Measuring patent claim breadth using Google Patents Public Datasets for more tutorials to help you get started with patent analytics in Google Cloud.