Generating Specialized Knowledge Graphs

Knowledge graphs are valuable artifacts, but constructing one for a particular domain can be prohibitively expensive. This project aims to provide an automated means of producing a knowledge graph of entities/topics/concepts simply by analyzing a corpus of documents uploaded by the user.

Intended use

Apply for access Private documentation

Intended use

Inputs and outputs:

  • Users provide: A set of documents rich with natural language English-language text, uploaded to a bucket on Google Cloud Platform.
  • Users receive: Access to an API that allows the exploration of a graph of entities/topics/concepts, created automatically from the corpus of uploaded documents.

Business challenges

  • Creating a new knowledge graph. If you do not already possess a curated knowledge graph for a domain, you can bootstrap the creation of one by running our automatic induction algorithm. You can then edit it and curate it until it is satisfactory.
  • Creating an alternative knowledge graph. If you already have a knowledge graph, but it is potentially errorful and/or incomplete, you can use our algorithm to provide feedback on your existing resource.

Technical challenges:

This experiment will be most helpful for the following technical challenges:

  • Users have little to no structured data for a given domain AND/OR
  • Users are interested in what the most salient topics/entities/concepts are within a corpus of documents, and their relationships to each other.
  • Users are interested in topics/entities/concepts that are particular to a domain, i.e., whose meaning appears to be different from those same terms when used in general English.

As part of the application to participate in this experiment, we will ask you about your use case, data types, and/or other relevant questions to ensure that the experiment is a good fit for you.

What data do I need?

Data and label types:

A set of text, HTML, or PDF documents that contain digital text (can include documents that came from other formats and were converted). We currently do not support OCR on PDF files that contain images, and may not be able to extract text from other kinds of files. The documents should contain large amounts of natural language English text (put another way, if all you have are documents with tabular data, the algorithms here are unlikely to produce what you want). No special labels or annotations on the documents are required.

The graph induction algorithm’s job is to find terms relevant to a corpus, and it does this essentially by finding patterns in the data. Therefore, the algorithm is intended to work with a corpus of documents that have some relationship to one another, such as a set of documents that

  • are all on the same general topic and/or
  • are all owned/produced by the same related set of authors.

What skills do I need?

As with all AI Workshop experiments, successful users are likely to be savvy with core AI concepts and skills in order to both deploy the experiment technology and interact with our AI researchers and engineers.