Creating datasets and importing data

A dataset contains representative samples of the type of content you want to classify, labeled with the category labels you want your custom model to use. The dataset serves as the input for training a model.

The main steps for building a dataset are:

  1. Create a dataset resource.
  2. Import training data into the dataset.
  3. Label the documents or identify the entities.

For classification and sentiment analysis, steps 2 and 3 are often combined: you can import documents with their labels already assigned.

Creating a dataset

The first step in creating a custom model is to create an empty dataset that will eventually hold the training data for the model. The newly created dataset doesn't contain any data until you import documents into it.

Web UI

To create a dataset:

  1. Open the AutoML Natural Language UI and select Get started in the box corresponding to the type of model you plan to train.

    The Datasets page appears, showing the status of previously created datasets for the current project.

    To add a dataset for a different project, select the project from the drop-down list in the upper right of the title bar.

  2. Click the New Dataset button in the title bar.

  3. Enter a name for the dataset and specify which geographical Location to store the dataset in.

    See Locations for more information.

  4. Select your model objective, which specifies what type of analysis you'll perform with the model you train using this dataset.

    • Single label classification assigns a single label to each classified document
    • Multi-label classification allows a document to be assigned multiple labels
    • Entity extraction identifies entities in documents
    • Sentiment analysis analyzes attitudes within documents
  5. Click Create dataset.

    The Import page for the new dataset appears. See Importing data into a dataset.

Code samples

Classification

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • project-id: your project ID
  • location-id: the location for the resource, us-central1 for the Global location or eu for the European Union

HTTP method and URL:

POST https://automl.googleapis.com/v1/projects/project-id/locations/location-id/datasets

Request JSON body:

{
  "displayName": "test_dataset",
  "textClassificationDatasetMetadata": {
    "classificationType": "MULTICLASS"
  }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/434039606874/locations/us-central1/datasets/356587829854924648",
  "displayName": "test_dataset",
  "createTime": "2018-04-26T18:02:59.825060Z",
  "textClassificationDatasetMetadata": {
    "classificationType": "MULTICLASS"
  }
}

Python

from google.cloud import automl

# TODO(developer): Uncomment and set the following variables
# project_id = 'YOUR_PROJECT_ID'
# display_name = 'YOUR_DATASET_NAME'

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = client.location_path(project_id, 'us-central1')
# Specify the classification type
# Types:
# MultiLabel: Multiple labels are allowed for one example.
# MultiClass: At most one label is allowed per example.
metadata = automl.types.TextClassificationDatasetMetadata(
    classification_type=automl.enums.ClassificationType.MULTICLASS)
dataset = automl.types.Dataset(
    display_name=display_name,
    text_classification_dataset_metadata=metadata)

# Create a dataset with the dataset metadata in the region.
response = client.create_dataset(project_location, dataset)

created_dataset = response.result()

# Display the dataset information
print(u'Dataset name: {}'.format(created_dataset.name))
print(u'Dataset id: {}'.format(created_dataset.name.split('/')[-1]))

Java

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.automl.v1.AutoMlClient;
import com.google.cloud.automl.v1.ClassificationType;
import com.google.cloud.automl.v1.Dataset;
import com.google.cloud.automl.v1.LocationName;
import com.google.cloud.automl.v1.OperationMetadata;
import com.google.cloud.automl.v1.TextClassificationDatasetMetadata;

import java.io.IOException;
import java.util.concurrent.ExecutionException;

class LanguageTextClassificationCreateDataset {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String displayName = "YOUR_DATASET_NAME";
    createDataset(projectId, displayName);
  }

  // Create a dataset
  static void createDataset(String projectId, String displayName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");

      // Specify the classification type
      // Types:
      // MultiLabel: Multiple labels are allowed for one example.
      // MultiClass: At most one label is allowed per example.
      ClassificationType classificationType = ClassificationType.MULTILABEL;

      // Specify the text classification type for the dataset.
      TextClassificationDatasetMetadata metadata =
          TextClassificationDatasetMetadata.newBuilder()
              .setClassificationType(classificationType)
              .build();
      Dataset dataset =
          Dataset.newBuilder()
              .setDisplayName(displayName)
              .setTextClassificationDatasetMetadata(metadata)
              .build();
      OperationFuture<Dataset, OperationMetadata> future =
          client.createDatasetAsync(projectLocation, dataset);

      Dataset createdDataset = future.get();

      // Display the dataset information.
      System.out.format("Dataset name: %s\n", createdDataset.getName());
      // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
      // required for other methods.
      // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
      String[] names = createdDataset.getName().split("/");
      String datasetId = names[names.length - 1];
      System.out.format("Dataset id: %s\n", datasetId);
    }
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const displayName = 'YOUR_DISPLAY_NAME';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require(`@google-cloud/automl`).v1;

// Instantiates a client
const client = new AutoMlClient();

async function createDataset() {
  // Construct request
  const request = {
    parent: client.locationPath(projectId, location),
    dataset: {
      displayName: displayName,
      textClassificationDatasetMetadata: {
        classificationType: 'MULTICLASS',
      },
    },
  };

  // Create dataset
  const [operation] = await client.createDataset(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();

  console.log(`Dataset name: ${response.name}`);
  console.log(`
    Dataset id: ${
      response.name
        .split('/')
        [response.name.split('/').length - 1].split('\n')[0]
    }`);
}

createDataset();

C#

/// <summary>
///  Demonstrates using the AutoML client to create a dataset.
/// </summary>
/// <param name="projectId">GCP Project ID.</param>
/// <param name="displayName">The name of the dataset to be created.</param>
public static object LanguageTextClassificationCreateDataset(string projectId = "YOUR-PROJECT-ID",
    string displayName = "YOUR-DATASET-NAME")
{
    // Initialize the client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    AutoMlClient client = AutoMlClient.Create();

    // A resource that represents Google Cloud Platform location.
    string projectLocation = LocationName.Format(projectId, "us-central1");

    // Specify the classification type
    // Types:
    // MultiLabel: Multiple labels are allowed for one example.
    // MultiClass: At most one label is allowed per example.
    ClassificationType classificationType = ClassificationType.Multilabel;

    // Specify the text classification type for the dataset.
    TextClassificationDatasetMetadata metadata = new TextClassificationDatasetMetadata
    {
        ClassificationType = classificationType
    };

    Dataset dataset = new Dataset
    {
        DisplayName = displayName,
        TextClassificationDatasetMetadata = metadata
    };

    var result = Task.Run(() => client.CreateDatasetAsync(projectLocation, dataset)).Result;
    Dataset createdDataset = result.PollUntilCompleted().Result;

    // Display the dataset information.
    Console.WriteLine($"Dataset name: {createdDataset.Name}");
    // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
    // required for other methods.
    // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
    string[] names = createdDataset.Name.Split("/");
    string datasetId = names[names.Length - 1];
    Console.WriteLine($"Dataset id: {datasetId}");
    return 0;
}

Go

import (
	"context"
	"fmt"
	"io"

	automl "cloud.google.com/go/automl/apiv1"
	automlpb "google.golang.org/genproto/googleapis/cloud/automl/v1"
)

// languageTextClassificationCreateDataset creates a dataset for text classification.
func languageTextClassificationCreateDataset(w io.Writer, projectID string, location string, datasetName string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetName := "dataset_display_name"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %v", err)
	}
	defer client.Close()

	req := &automlpb.CreateDatasetRequest{
		Parent: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		Dataset: &automlpb.Dataset{
			DisplayName: datasetName,
			DatasetMetadata: &automlpb.Dataset_TextClassificationDatasetMetadata{
				TextClassificationDatasetMetadata: &automlpb.TextClassificationDatasetMetadata{
					// Specify the classification type:
					// - MULTILABEL: Multiple labels are allowed for one example.
					// - MULTICLASS: At most one label is allowed per example.
					ClassificationType: automlpb.ClassificationType_MULTICLASS,
				},
			},
		},
	}

	op, err := client.CreateDataset(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateDataset: %v", err)
	}
	fmt.Fprintf(w, "Processing operation name: %q\n", op.Name())

	dataset, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("Wait: %v", err)
	}

	fmt.Fprintf(w, "Dataset name: %v\n", dataset.GetName())

	return nil
}

PHP

use Google\Cloud\AutoMl\V1\AutoMlClient;
use Google\Cloud\AutoMl\V1\ClassificationType;
use Google\Cloud\AutoMl\V1\Dataset;
use Google\Cloud\AutoMl\V1\TextClassificationDatasetMetadata;

/** Uncomment and populate these variables in your code */
// $projectId = '[Google Cloud Project ID]';
// $location = 'us-central1';
// $displayName = 'your_dataset_name';

$client = new AutoMlClient();

try {
    // resource that represents Google Cloud Platform location
    $formattedParent = $client->locationName(
        $projectId,
        $location
    );

    $metadata = (new TextClassificationDatasetMetadata())
        ->setClassificationType(ClassificationType::MULTICLASS);
    $dataset = (new Dataset())
        ->setDisplayName($displayName)
        ->setTextClassificationDatasetMetadata($metadata);

    // create dataset with the above location and metadata
    $operationResponse = $client->createDataset($formattedParent, $dataset);
    $operationResponse->pollUntilComplete();
    if ($operationResponse->operationSucceeded()) {
        $result = $operationResponse->getResult();

        // display dataset information
        $splitName = explode('/', $result->getName());
        printf('Dataset name: %s' . PHP_EOL, $result->getName());
        printf('Dataset id: %s' . PHP_EOL, end($splitName));
    } else {
        $error = $operationResponse->getError();
        // handleError($error)
    }
} finally {
    $client->close();
}

Ruby

require "google/cloud/automl"

project_id = "YOUR_PROJECT_ID"
display_name = "YOUR_DATASET_NAME"

client = Google::Cloud::AutoML::AutoML.new

# A resource that represents Google Cloud Platform location.
project_location = client.class.location_path project_id, "us-central1"
dataset = {
  display_name:                         display_name,
  text_classification_dataset_metadata: {
    # Specify the classification type
    # Types:
    # MultiLabel: Multiple labels are allowed for one example.
    # MultiClass: At most one label is allowed per example.
    classification_type: :MULTICLASS
  }
}

# Create a dataset with the dataset metadata in the region.
created_dataset = client.create_dataset project_location, dataset

# Display the dataset information
puts "Dataset name: #{created_dataset.name}"
puts "Dataset id: #{created_dataset.name.split('/').last}"

Entity extraction

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • project-id: your project ID
  • location-id: the location for the resource, us-central1 for the Global location or eu for the European Union

HTTP method and URL:

POST https://automl.googleapis.com/v1/projects/project-id/locations/location-id/datasets

Request JSON body:

{
  "displayName": "test_dataset",
  "textExtractionDatasetMetadata": {
   }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  name: "projects/000000000000/locations/us-central1/datasets/TEN5582774688079151104"
  display_name: "test_dataset"
  create_time {
     seconds: 1539886451
     nanos: 757650000
   }
   text_extraction_dataset_metadata {
   }
}

Python

from google.cloud import automl

# TODO(developer): Uncomment and set the following variables
# project_id = 'YOUR_PROJECT_ID'
# display_name = 'YOUR_DATASET_NAME'

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = client.location_path(project_id, 'us-central1')
metadata = automl.types.TextExtractionDatasetMetadata()
dataset = automl.types.Dataset(
    display_name=display_name,
    text_extraction_dataset_metadata=metadata)

# Create a dataset with the dataset metadata in the region.
response = client.create_dataset(project_location, dataset)

created_dataset = response.result()

# Display the dataset information
print(u'Dataset name: {}'.format(created_dataset.name))
print(u'Dataset id: {}'.format(created_dataset.name.split('/')[-1]))

Java

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.automl.v1.AutoMlClient;
import com.google.cloud.automl.v1.Dataset;
import com.google.cloud.automl.v1.LocationName;
import com.google.cloud.automl.v1.OperationMetadata;
import com.google.cloud.automl.v1.TextExtractionDatasetMetadata;

import java.io.IOException;
import java.util.concurrent.ExecutionException;

class LanguageEntityExtractionCreateDataset {

  static void createDataset() throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String displayName = "YOUR_DATASET_NAME";
    createDataset(projectId, displayName);
  }

  // Create a dataset
  static void createDataset(String projectId, String displayName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");

      TextExtractionDatasetMetadata metadata = TextExtractionDatasetMetadata.newBuilder().build();
      Dataset dataset =
          Dataset.newBuilder()
              .setDisplayName(displayName)
              .setTextExtractionDatasetMetadata(metadata)
              .build();
      OperationFuture<Dataset, OperationMetadata> future =
          client.createDatasetAsync(projectLocation, dataset);

      Dataset createdDataset = future.get();

      // Display the dataset information.
      System.out.format("Dataset name: %s\n", createdDataset.getName());
      // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
      // required for other methods.
      // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
      String[] names = createdDataset.getName().split("/");
      String datasetId = names[names.length - 1];
      System.out.format("Dataset id: %s\n", datasetId);
    }
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const displayName = 'YOUR_DISPLAY_NAME';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require(`@google-cloud/automl`).v1;

// Instantiates a client
const client = new AutoMlClient();

async function createDataset() {
  // Construct request
  const request = {
    parent: client.locationPath(projectId, location),
    dataset: {
      displayName: displayName,
      textExtractionDatasetMetadata: {},
    },
  };

  // Create dataset
  const [operation] = await client.createDataset(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();

  console.log(`Dataset name: ${response.name}`);
  console.log(`
    Dataset id: ${
      response.name
        .split('/')
        [response.name.split('/').length - 1].split('\n')[0]
    }`);
}

createDataset();

C#

/// <summary>
///  Demonstrates using the AutoML client to create a dataset.
/// </summary>
/// <param name="projectId">GCP Project ID.</param>
/// <param name="displayName">The name of the dataset to be created.</param>
public static object LanguageEntityExtractionCreateDataset(string projectId = "YOUR-PROJECT-ID",
    string displayName = "YOUR-DATASET-NAME")
{
    // Initialize the client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    AutoMlClient client = AutoMlClient.Create();

    // A resource that represents Google Cloud Platform location.
    string projectLocation = LocationName.Format(projectId, "us-central1");

    TextExtractionDatasetMetadata metadata = new TextExtractionDatasetMetadata
    {
    };

    Dataset dataset = new Dataset
    {
        DisplayName = displayName,
        TextExtractionDatasetMetadata = metadata
    };

    var result = Task.Run(() => client.CreateDatasetAsync(projectLocation, dataset)).Result;
    Dataset createdDataset = result.PollUntilCompleted().Result;

    // Display the dataset information.
    Console.WriteLine($"Dataset name: {createdDataset.Name}");
    // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
    // required for other methods.
    // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
    string[] names = createdDataset.Name.Split("/");
    string datasetId = names[names.Length - 1];
    Console.WriteLine($"Dataset id: {datasetId}");

    return 0;
}

Go

import (
	"context"
	"fmt"
	"io"

	automl "cloud.google.com/go/automl/apiv1"
	automlpb "google.golang.org/genproto/googleapis/cloud/automl/v1"
)

// languageEntityExtractionCreateDataset creates a dataset for text entity extraction.
func languageEntityExtractionCreateDataset(w io.Writer, projectID string, location string, datasetName string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetName := "dataset_display_name"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %v", err)
	}
	defer client.Close()

	req := &automlpb.CreateDatasetRequest{
		Parent: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		Dataset: &automlpb.Dataset{
			DisplayName: datasetName,
			DatasetMetadata: &automlpb.Dataset_TextExtractionDatasetMetadata{
				TextExtractionDatasetMetadata: &automlpb.TextExtractionDatasetMetadata{},
			},
		},
	}

	op, err := client.CreateDataset(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateDataset: %v", err)
	}
	fmt.Fprintf(w, "Processing operation name: %q\n", op.Name())

	dataset, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("Wait: %v", err)
	}

	fmt.Fprintf(w, "Dataset name: %v\n", dataset.GetName())

	return nil
}

PHP

use Google\Cloud\AutoMl\V1\AutoMlClient;
use Google\Cloud\AutoMl\V1\Dataset;
use Google\Cloud\AutoMl\V1\TextExtractionDatasetMetadata;

/** Uncomment and populate these variables in your code */
// $projectId = '[Google Cloud Project ID]';
// $location = 'us-central1';
// $displayName = 'your_dataset_name';

$client = new AutoMlClient();

try {
    // resource that represents Google Cloud Platform location
    $formattedParent = $client->locationName(
        $projectId,
        $location
    );

    $metadata = new TextExtractionDatasetMetadata();
    $dataset = (new Dataset())
        ->setDisplayName($displayName)
        ->setTextExtractionDatasetMetadata($metadata);

    // create dataset with the above location and metadata
    $operationResponse = $client->createDataset($formattedParent, $dataset);
    $operationResponse->pollUntilComplete();
    if ($operationResponse->operationSucceeded()) {
        $result = $operationResponse->getResult();

        // display dataset information
        $splitName = explode('/', $result->getName());
        printf('Dataset name: %s' . PHP_EOL, $result->getName());
        printf('Dataset id: %s' . PHP_EOL, end($splitName));
    } else {
        $error = $operationResponse->getError();
        // handleError($error)
    }
} finally {
    $client->close();
}

Ruby

require "google/cloud/automl"

project_id = "YOUR_PROJECT_ID"
display_name = "YOUR_DATASET_NAME"

client = Google::Cloud::AutoML::AutoML.new

# A resource that represents Google Cloud Platform location.
project_location = client.class.location_path project_id, "us-central1"
dataset = {
  display_name:                     display_name,
  text_extraction_dataset_metadata: {}
}

# Create a dataset with the dataset metadata in the region.
created_dataset = client.create_dataset project_location, dataset

# Display the dataset information
puts "Dataset name: #{created_dataset.name}"
puts "Dataset id: #{created_dataset.name.split('/').last}"

Sentiment analysis

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • project-id: your project ID
  • location-id: the location for the resource, us-central1 for the Global location or eu for the European Union

HTTP method and URL:

POST https://automl.googleapis.com/v1/projects/project-id/locations/location-id/datasets

Request JSON body:

{
  "displayName": "test_dataset",
  "textSentimentDatasetMetadata": {
    "sentimentMax": 4
  }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  name: "projects/000000000000/locations/us-central1/datasets/TST8962998974766436002"
  display_name: "test_dataset_name"
  create_time {
    seconds: 1538855662
    nanos: 51542000
  }
  text_sentiment_dataset_metadata {
    sentiment_max: 7
  }
}

Python

from google.cloud import automl

# TODO(developer): Uncomment and set the following variables
# project_id = 'YOUR_PROJECT_ID'
# display_name = 'YOUR_DATASET_NAME'

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = client.location_path(project_id, 'us-central1')
metadata = automl.types.TextSentimentDatasetMetadata(
    sentiment_max=4)  # Possible max sentiment score: 1-10
dataset = automl.types.Dataset(
    display_name=display_name,
    text_sentiment_dataset_metadata=metadata)

# Create a dataset with the dataset metadata in the region.
response = client.create_dataset(project_location, dataset)

created_dataset = response.result()

# Display the dataset information
print(u'Dataset name: {}'.format(created_dataset.name))
print(u'Dataset id: {}'.format(created_dataset.name.split('/')[-1]))

Java

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.automl.v1.AutoMlClient;
import com.google.cloud.automl.v1.Dataset;
import com.google.cloud.automl.v1.LocationName;
import com.google.cloud.automl.v1.OperationMetadata;
import com.google.cloud.automl.v1.TextSentimentDatasetMetadata;

import java.io.IOException;
import java.util.concurrent.ExecutionException;

class LanguageSentimentAnalysisCreateDataset {

  static void createDataset() throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String displayName = "YOUR_DATASET_NAME";
    createDataset(projectId, displayName);
  }

  // Create a dataset
  static void createDataset(String projectId, String displayName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");
      // Specify the text classification type for the dataset.
      TextSentimentDatasetMetadata metadata =
          TextSentimentDatasetMetadata.newBuilder()
              .setSentimentMax(4) // Possible max sentiment score: 1-10
              .build();
      Dataset dataset =
          Dataset.newBuilder()
              .setDisplayName(displayName)
              .setTextSentimentDatasetMetadata(metadata)
              .build();
      OperationFuture<Dataset, OperationMetadata> future =
          client.createDatasetAsync(projectLocation, dataset);

      Dataset createdDataset = future.get();

      // Display the dataset information.
      System.out.format("Dataset name: %s\n", createdDataset.getName());
      // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
      // required for other methods.
      // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
      String[] names = createdDataset.getName().split("/");
      String datasetId = names[names.length - 1];
      System.out.format("Dataset id: %s\n", datasetId);
    }
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const displayName = 'YOUR_DISPLAY_NAME';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require(`@google-cloud/automl`).v1;

// Instantiates a client
const client = new AutoMlClient();

async function createDataset() {
  // Construct request
  const request = {
    parent: client.locationPath(projectId, location),
    dataset: {
      displayName: displayName,
      textSentimentDatasetMetadata: {
        sentimentMax: 4, // Possible max sentiment score: 1-10
      },
    },
  };

  // Create dataset
  const [operation] = await client.createDataset(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();

  console.log(`Dataset name: ${response.name}`);
  console.log(`
    Dataset id: ${
      response.name
        .split('/')
        [response.name.split('/').length - 1].split('\n')[0]
    }`);
}

createDataset();

C#

/// <summary>
///  Demonstrates using the AutoML client to create a dataset.
/// </summary>
/// <param name="projectId">GCP Project ID.</param>
/// <param name="displayName">The name of the dataset to be created.</param>
public static object LanguageSentimentAnalysisCreateDataset(string projectId = "YOUR-PROJECT-ID",
    string displayName = "YOUR-DATASET-NAME")
{
    // Initialize the client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    AutoMlClient client = AutoMlClient.Create();

    // A resource that represents Google Cloud Platform location.
    string projectLocation = LocationName.Format(projectId, "us-central1");

    TextSentimentDatasetMetadata metadata = new TextSentimentDatasetMetadata
    {
        SentimentMax = 4 // Possible max sentiment score: 1-10.
    };

    Dataset dataset = new Dataset
    {
        DisplayName = displayName,
        TextSentimentDatasetMetadata = metadata
    };

    var result = Task.Run(() => client.CreateDatasetAsync(projectLocation, dataset)).Result;
    Dataset createdDataset = result.PollUntilCompleted().Result;

    // Display the dataset information.
    Console.WriteLine($"Dataset name: {createdDataset.Name}");
    // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
    // required for other methods.
    // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
    string[] names = createdDataset.Name.Split("/");
    string datasetId = names[names.Length - 1];
    Console.WriteLine($"Dataset id: {datasetId}");

    return 0;
}

Go

import (
	"context"
	"fmt"
	"io"

	automl "cloud.google.com/go/automl/apiv1"
	automlpb "google.golang.org/genproto/googleapis/cloud/automl/v1"
)

// languageSentimentAnalysisCreateDataset creates a dataset for text sentiment analysis.
func languageSentimentAnalysisCreateDataset(w io.Writer, projectID string, location string, datasetName string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetName := "dataset_display_name"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %v", err)
	}
	defer client.Close()

	req := &automlpb.CreateDatasetRequest{
		Parent: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		Dataset: &automlpb.Dataset{
			DisplayName: datasetName,
			DatasetMetadata: &automlpb.Dataset_TextSentimentDatasetMetadata{
				TextSentimentDatasetMetadata: &automlpb.TextSentimentDatasetMetadata{
					SentimentMax: 4, // Possible max sentiment score: 1-10
				},
			},
		},
	}

	op, err := client.CreateDataset(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateDataset: %v", err)
	}
	fmt.Fprintf(w, "Processing operation name: %q\n", op.Name())

	dataset, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("Wait: %v", err)
	}

	fmt.Fprintf(w, "Dataset name: %v\n", dataset.GetName())

	return nil
}

PHP

use Google\Cloud\AutoMl\V1\AutoMlClient;
use Google\Cloud\AutoMl\V1\Dataset;
use Google\Cloud\AutoMl\V1\TextSentimentDatasetMetadata;

/** Uncomment and populate these variables in your code */
// $projectId = '[Google Cloud Project ID]';
// $location = 'us-central1';
// $displayName = 'your_dataset_name';

$client = new AutoMlClient();

try {
    // resource that represents Google Cloud Platform location
    $formattedParent = $client->locationName(
        $projectId,
        $location
    );

    $metadata = (new TextSentimentDatasetMetadata())
        ->setSentimentMax(4); // Possible max sentiment score: 1-10
    $dataset = (new Dataset())
        ->setDisplayName($displayName)
        ->setTextSentimentDatasetMetadata($metadata);

    // create dataset with the above location and metadata
    $operationResponse = $client->createDataset($formattedParent, $dataset);
    $operationResponse->pollUntilComplete();
    if ($operationResponse->operationSucceeded()) {
        $result = $operationResponse->getResult();

        // display dataset information
        $splitName = explode('/', $result->getName());
        printf('Dataset name: %s' . PHP_EOL, $result->getName());
        printf('Dataset id: %s' . PHP_EOL, end($splitName));
    } else {
        $error = $operationResponse->getError();
        // handleError($error)
    }
} finally {
    $client->close();
}

Ruby

require "google/cloud/automl"

project_id = "YOUR_PROJECT_ID"
display_name = "YOUR_DATASET_NAME"

client = Google::Cloud::AutoML::AutoML.new

# A resource that represents Google Cloud Platform location.
project_location = client.class.location_path project_id, "us-central1"
dataset = {
  display_name:                    display_name,
  text_sentiment_dataset_metadata: {
    # Possible max sentiment score: 1-10
    sentiment_max: 4
  }
}

# Create a dataset with the dataset metadata in the region.
created_dataset = client.create_dataset project_location, dataset
#
# # Wait until the long running operation is done
# operation.wait_until_done!
# created_dataset = operation.response

# Display the dataset information
puts "Dataset name: #{created_dataset.name}"
puts "Dataset id: #{created_dataset.name.split('/').last}"

Importing training data into a dataset

After you have created a dataset, you can import document URIs and labels for documents from a CSV file stored in a Google Cloud Storage bucket. For details on preparing your data and creating a CSV file for import, see Preparing your training data.

You can import documents into an empty dataset or import additional documents into an existing dataset.

Web UI

To import documents into a dataset:

  1. Select the dataset you want to import documents into from the Datasets page.

  2. On the Import tab, specify where to find the training documents.

    You can:

    • Upload a .csv file that contains the training documents and their associated category labels from your local computer or from Google Cloud Storage.

    • Upload a collection of .txt, .pdf, or .zip files that contain the training documents from your local computer.

  3. Select the file(s) to import and the Google Cloud Storage path for the imported documents.

  4. Click Import.

Code samples

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • project-id: your project ID
  • location-id: the location for the resource, us-central1 for the Global location or eu for the European Union
  • dataset-id: your dataset ID
  • bucket-name: your Google Cloud Storage bucket
  • csv-file-name: your CSV training data file

HTTP method and URL:

POST https://automl.googleapis.com/v1/projects/project-id/locations/location-id/datasets/dataset-id:importData

Request JSON body:

{
  "inputConfig": {
    "gcsSource": {
      "inputUris": ["gs://bucket-name/csv-file-name.csv"]
      }
  }
}

To send your request, expand one of these options:

You should see output similar to the following. You can use the operation ID to get the status of the task. For an example, see Getting the status of an operation.

{
  "name": "projects/434039606874/locations/us-central1/operations/1979469554520650937",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata",
    "createTime": "2018-04-27T01:28:36.128120Z",
    "updateTime": "2018-04-27T01:28:36.128150Z",
    "cancellable": true
  }
}

Python

from google.cloud import automl

# TODO(developer): Uncomment and set the following variables
# project_id = 'YOUR_PROJECT_ID'
# dataset_id = 'YOUR_DATASET_ID'
# path = 'gs://BUCKET_ID/path_to_training_data.csv'

client = automl.AutoMlClient()
# Get the full path of the dataset.
dataset_full_id = client.dataset_path(
    project_id, 'us-central1', dataset_id)
# Get the multiple Google Cloud Storage URIs
input_uris = path.split(',')
gcs_source = automl.types.GcsSource(
    input_uris=input_uris)
input_config = automl.types.InputConfig(
    gcs_source=gcs_source)
# Import data from the input URI
response = client.import_data(dataset_full_id, input_config)

print('Processing import...')
print(u'Data imported. {}'.format(response.result()))

Java

import com.google.cloud.automl.v1.AutoMlClient;
import com.google.cloud.automl.v1.DatasetName;
import com.google.cloud.automl.v1.GcsSource;
import com.google.cloud.automl.v1.InputConfig;
import com.google.protobuf.Empty;

import java.io.IOException;
import java.util.Arrays;
import java.util.concurrent.ExecutionException;

class ImportDataset {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String datasetId = "YOUR_DATASET_ID";
    String path = "gs://BUCKET_ID/path_to_training_data.csv";
    importDataset(projectId, datasetId, path);
  }

  // Import a dataset
  static void importDataset(String projectId, String datasetId, String path)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // Get the complete path of the dataset.
      DatasetName datasetFullId = DatasetName.of(projectId, "us-central1", datasetId);

      // Get multiple Google Cloud Storage URIs to import data from
      GcsSource gcsSource =
          GcsSource.newBuilder().addAllInputUris(Arrays.asList(path.split(","))).build();

      // Import data from the input URI
      InputConfig inputConfig = InputConfig.newBuilder().setGcsSource(gcsSource).build();
      System.out.println("Processing import...");

      Empty response = client.importDataAsync(datasetFullId, inputConfig).get();
      System.out.format("Dataset imported. %s\n", response);
    }
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const datasetId = 'YOUR_DISPLAY_ID';
// const path = 'gs://BUCKET_ID/path_to_training_data.csv';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require(`@google-cloud/automl`).v1;

// Instantiates a client
const client = new AutoMlClient();

async function importDataset() {
  // Construct request
  const request = {
    name: client.datasetPath(projectId, location, datasetId),
    inputConfig: {
      gcsSource: {
        inputUris: path.split(','),
      },
    },
  };

  // Import dataset
  console.log(`Proccessing import`);
  const [operation] = await client.importData(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();
  console.log(`Dataset imported: ${response}`);
}

importDataset();

C#

/// <summary>
/// Import labeled items.
/// </summary>
/// <param name="projectId">GCP Project ID.</param>
/// <param name="datasetId">the Id of the dataset.</param>
/// <param name="path">Google Cloud Storage URIs. 
/// Target files must be in AutoML CSV format.</param>
public static object ImportDataset(string projectId = "YOUR-PROJECT-ID",
    string datasetId = "YOUR-DATASET-ID",
    string path = "gs://BUCKET_ID/path_to_training_data.csv")
{
    // Initialize the client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    AutoMlClient client = AutoMlClient.Create();

    // Get the complete path of the dataset.
    string datasetFullId = DatasetName.Format(projectId, "us-central1", datasetId);

    // Get multiple Google Cloud Storage URIs to import data from
    GcsSource gcsSource = new GcsSource
    {
        InputUris = { path.Split(",") }
    };

    // Import data from the input URI
    InputConfig inputConfig = new InputConfig
    {
        GcsSource = gcsSource
    };

    var result = Task.Run(() => client.ImportDataAsync(datasetFullId, inputConfig)).Result;
    Console.WriteLine("Processing import...");
    result.PollUntilCompleted();
    Console.WriteLine($"Data imported.");
    return 0;
}

Go

import (
	"context"
	"fmt"
	"io"

	automl "cloud.google.com/go/automl/apiv1"
	automlpb "google.golang.org/genproto/googleapis/cloud/automl/v1"
)

// importDataIntoDataset imports data into a dataset.
func importDataIntoDataset(w io.Writer, projectID string, location string, datasetID string, inputURI string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetID := "TRL123456789..."
	// inputURI := "gs://BUCKET_ID/path_to_training_data.csv"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %v", err)
	}
	defer client.Close()

	req := &automlpb.ImportDataRequest{
		Name: fmt.Sprintf("projects/%s/locations/%s/datasets/%s", projectID, location, datasetID),
		InputConfig: &automlpb.InputConfig{
			Source: &automlpb.InputConfig_GcsSource{
				GcsSource: &automlpb.GcsSource{
					InputUris: []string{inputURI},
				},
			},
		},
	}

	op, err := client.ImportData(ctx, req)
	if err != nil {
		return fmt.Errorf("ImportData: %v", err)
	}
	fmt.Fprintf(w, "Processing operation name: %q\n", op.Name())

	if err := op.Wait(ctx); err != nil {
		return fmt.Errorf("Wait: %v", err)
	}

	fmt.Fprintf(w, "Data imported.\n")

	return nil
}

PHP

use Google\Cloud\AutoMl\V1\AutoMlClient;
use Google\Cloud\AutoMl\V1\GcsSource;
use Google\Cloud\AutoMl\V1\InputConfig;

/** Uncomment and populate these variables in your code */
// $projectId = '[Google Cloud Project ID]';
// $location = 'us-central1';
// $datasetId = 'my_dataset_id_123';
// $gcsUri = 'gs://BUCKET_ID/path_to_training_data/'

$client = new AutoMlClient();

try {
    // get full path of dataset
    $formattedName = $client->datasetName(
        $projectId,
        $location,
        $datasetId
    );

    // set GCS uri
    $gcsSource = (new GcsSource())
        ->setInputUri($gcsUri);
    $inputConfig = (new InputConfig())
        ->setGcsSource($gcsSource);

    // import data from input uri
    $operationResponse = $client->importData($formattedName, $inputConfig);
    $operationResponse->pollUntilComplete();
    if ($operationResponse->operationSucceeded()) {
        $result = $operationResponse->getResult();
        printf('Dataset imported.' . PHP_EOL);
    } else {
        $error = $operationResponse->getError();
        // handleError($error)
    }
} finally {
    $client->close();
}

Ruby

require "google/cloud/automl"

project_id = "YOUR_PROJECT_ID"
dataset_id = "YOUR_DATASET_ID"
path = "gs://BUCKET_ID/path_to_training_data.csv"

client = Google::Cloud::AutoML::AutoML.new

# Get the full path of the dataset.
dataset_full_id = client.class.dataset_path project_id, "us-central1", dataset_id
input_config = {
  gcs_source: {
    # Get the multiple Google Cloud Storage URIs
    input_uris: path.split(",")
  }
}

# Import data from the input URI
operation = client.import_data dataset_full_id, input_config

puts "Processing import..."

# Wait until the long running operation is done
operation.wait_until_done!

puts "Data imported."

Labeling training documents

To be useful for training a model, each document in a dataset needs to be labeled in the way you want AutoML Natural Language to label similar documents. The quality of your training data strongly impacts the effectiveness of the model you create, and by extension, the quality of the predictions returned from that model. AutoML Natural Language ignores non-labeled documents during training.

You can provide labels for your training documents in three ways:

  • Include labels in your .csv file (for classification and sentiment analysis only)
  • Label your documents in the AutoML Natural Language UI
  • Request labeling from human labelers using the AI Platform Data Labeling Service

The AutoML API does not include methods for labeling.

For details about labeling documents in your .csv file, see Preparing your training data.

Labeling for classification and sentiment analysis

To label documents in the AutoML Natural Language UI, select the dataset from the dataset listing page to see its details. The display name of the selected dataset appears in the title bar, and the page lists the individual documents in the dataset along with their current labels. The navigation bar along the left summarizes the number of labeled and unlabeled documents and enables you to filter the document list by label or sentiment value.

Text items page

Text items page

To assign labels or sentiment values to unlabeled documents or change document labels, select the documents you want to update and the label(s) or value you want to assign to them. There are two ways to update an document's label:

  • Click the check box next to the documents you want to update, then select the label(s) to apply from the Label drop-down list that appears at the top of the document list.

  • Click the row of the document you want to update, then select the label(s) or value to apply from the list that appears on the Text detail page.

Identifying entities for entity extraction

Before training your custom model, you need to annotate the training documents in the dataset. You can annotate training documents before importing them, or you can add annotations in the AutoML Natural Language UI.

To annotate in the AutoML Natural Language UI, select the dataset from the dataset listing page to see its details. The display name of the selected dataset appears in the title bar, and the page lists the individual documents in the dataset along with any annotations in them. The navigation bar along the left summarizes the labels and the number of times each label appears. You can also filter the document list by label.

Annotation list

To add or delete annotations within a document, double-click the document you want to update. The Edit page shows the complete text of the selected document, with all previous annotations highlighted.

Entity editor

For PDF training documents or documents imported with layout information, the Edit page has two tabs: Plain text and Structured text. The Plain text tab shows the raw contents of the training document without any formatting. The Structured text tab recreates the basic layout of the training document. (The Plain text tab also has a link to the original PDF file.)

Structured text editor

To add a new annotation, highlight the text that represents the entity, select the label from the Annotate dialog box, and click Save. When you add annotations on the Structured text tab, AutoML Natural Language captures the annotation's position on the page as a factor considered during training.

Add annotation

To remove an annotation, locate the text in the list of labels on the right and click the garbage can icon next to it.