Create a conversation dataset

A conversation dataset contains conversation transcript data, and is used to train either a Smart Reply or Summarization custom model. Smart Reply uses the conversation transcripts to recommend text responses to human agents conversing with an end-user. Summarization custom models are trained on conversation datasets that contain both transcripts and annotation data. They use the annotations to generate conversation summaries to human agents after a conversation has completed.

There are two ways to create a dataset: Using the Console tutorial workflows, or manually creating a dataset in the Console using the Data -> Datasets tab. We recommend that you use the Console tutorials as a first option. To use the Console tutorials, navigate to the Agent Assist Console and click the Get started button under the feature you'd like to test.

This page demonstrates how to create a dataset manually.

Before you begin

  1. Follow the Dialogflow setup instructions to enable Dialogflow on a Google Cloud Platform project.

  2. We recommend that you read the Agent Assist basics page before starting this tutorial.

  3. If you are implementing Smart Reply using your own transcript data, make sure your transcripts are in JSON in the specified format and stored in a Google Cloud Storage bucket. A conversation dataset must contain at least 30,000 conversations, otherwise model training will fail. As a general rule, the more conversations you have the better your model quality will be. We suggest that you remove any conversations with fewer than 20 messages or 3 conversation turns (changes in which participant is making an utterance). We also suggest that you remove any bot messages or messages automatically generated by systems (for example, "Agent enters the chat room"). We recommend that you upload at least 3 months of conversations to ensure coverage of as many use cases as possible. The maximum number of conversations in a conversation dataset is 1,000,000.

  4. If you are implementing Summarization using your own transcript and annotation data, make sure your transcripts are in the specified format and stored in a Google Cloud Storage bucket. The recommended minimum number of training annotations is 1000. The enforced minimum number is 100.

  5. Navigate to the Agent Assist Console. Select your Google Cloud Platform project, then click on the Data menu option on the far left margin of the page. The Data menu displays all of your data. There are two tabs, one each for conversation datasets and knowledge bases.

  6. Click on the conversation datasets tab, then on the +Create new button at the top right of the conversation datasets page.

Create a conversation dataset

  1. Enter a Name and optional Description for your new dataset. In the Conversation data field, enter the URI of the storage bucket that contains your conversation transcripts. Agent Assist supports use of the * symbol for wildcard matching. The URI should have the following format:

    gs://<bucket name>/<object name>
    

    For example:

    gs://mydata/conversationjsons/conv0*.json
    gs://mydatabucket/test/conv.json
    
  2. Click Create. Your new dataset now appears in the dataset list on the Data menu page under the Conversation datasets tab.

What's next

Train a Smart Reply or Summarization model on one or more conversation datasets using the Agent Assist console.