Creating datasets and importing data

A dataset contains representative samples of the type of content you want to classify, labeled with the category labels you want your custom model to use. The dataset serves as the input for training a model.

The main steps for building a dataset are:

  1. Create a dataset resource.
  2. Import training data into the dataset.
  3. Label the documents or identify the entities.

For classification and sentiment analysis, steps 2 and 3 are often combined: you can import documents with their labels already assigned.

Creating a dataset

The first step in creating a custom model is to create an empty dataset that will eventually hold the training data for the model. The newly created dataset doesn't contain any data until you import documents into it.

Web UI

To create a dataset:

  1. Open the AutoML Natural Language UI and select Get started in the box corresponding to the type of model you plan to train.

    The Datasets page appears, showing the status of previously created datasets for the current project.

    To add a dataset for a different project, select the project from the drop-down list in the upper right of the title bar.

  2. Click the New Dataset button in the title bar.

  3. Enter a name for the dataset and specify which geographical Location to store the dataset in.

    See Locations for more information.

  4. Select your model objective, which specifies what type of analysis you'll perform with the model you train using this dataset.

    • Single label classification assigns a single label to each classified document
    • Multi-label classification allows a document to be assigned multiple labels
    • Entity extraction identifies entities in documents
    • Sentiment analysis analyzes attitudes within documents
  5. Click Create dataset.

    The Import page for the new dataset appears. See Importing data into a dataset.

Code samples

Classification

REST

Before using any of the request data, make the following replacements:

  • project-id: your project ID
  • location-id: the location for the resource, us-central1 for the Global location or eu for the European Union

HTTP method and URL:

POST https://automl.googleapis.com/v1/projects/project-id/locations/location-id/datasets

Request JSON body:

{
  "displayName": "test_dataset",
  "textClassificationDatasetMetadata": {
    "classificationType": "MULTICLASS"
  }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/434039606874/locations/us-central1/datasets/356587829854924648",
  "displayName": "test_dataset",
  "createTime": "2018-04-26T18:02:59.825060Z",
  "textClassificationDatasetMetadata": {
    "classificationType": "MULTICLASS"
  }
}

Python

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Python API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google.cloud import automl

# TODO(developer): Uncomment and set the following variables
# project_id = "YOUR_PROJECT_ID"
# display_name = "YOUR_DATASET_NAME"

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = f"projects/{project_id}/locations/us-central1"
# Specify the classification type
# Types:
# MultiLabel: Multiple labels are allowed for one example.
# MultiClass: At most one label is allowed per example.
metadata = automl.TextClassificationDatasetMetadata(
    classification_type=automl.ClassificationType.MULTICLASS
)
dataset = automl.Dataset(
    display_name=display_name,
    text_classification_dataset_metadata=metadata,
)

# Create a dataset with the dataset metadata in the region.
response = client.create_dataset(parent=project_location, dataset=dataset)

created_dataset = response.result()

# Display the dataset information
print(f"Dataset name: {created_dataset.name}")
print("Dataset id: {}".format(created_dataset.name.split("/")[-1]))

Java

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Java API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.automl.v1.AutoMlClient;
import com.google.cloud.automl.v1.ClassificationType;
import com.google.cloud.automl.v1.Dataset;
import com.google.cloud.automl.v1.LocationName;
import com.google.cloud.automl.v1.OperationMetadata;
import com.google.cloud.automl.v1.TextClassificationDatasetMetadata;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

class LanguageTextClassificationCreateDataset {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String displayName = "YOUR_DATASET_NAME";
    createDataset(projectId, displayName);
  }

  // Create a dataset
  static void createDataset(String projectId, String displayName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");

      // Specify the classification type
      // Types:
      // MultiLabel: Multiple labels are allowed for one example.
      // MultiClass: At most one label is allowed per example.
      ClassificationType classificationType = ClassificationType.MULTILABEL;

      // Specify the text classification type for the dataset.
      TextClassificationDatasetMetadata metadata =
          TextClassificationDatasetMetadata.newBuilder()
              .setClassificationType(classificationType)
              .build();
      Dataset dataset =
          Dataset.newBuilder()
              .setDisplayName(displayName)
              .setTextClassificationDatasetMetadata(metadata)
              .build();
      OperationFuture<Dataset, OperationMetadata> future =
          client.createDatasetAsync(projectLocation, dataset);

      Dataset createdDataset = future.get();

      // Display the dataset information.
      System.out.format("Dataset name: %s\n", createdDataset.getName());
      // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
      // required for other methods.
      // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
      String[] names = createdDataset.getName().split("/");
      String datasetId = names[names.length - 1];
      System.out.format("Dataset id: %s\n", datasetId);
    }
  }
}

Node.js

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Node.js API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const displayName = 'YOUR_DISPLAY_NAME';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1;

// Instantiates a client
const client = new AutoMlClient();

async function createDataset() {
  // Construct request
  const request = {
    parent: client.locationPath(projectId, location),
    dataset: {
      displayName: displayName,
      textClassificationDatasetMetadata: {
        classificationType: 'MULTICLASS',
      },
    },
  };

  // Create dataset
  const [operation] = await client.createDataset(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();

  console.log(`Dataset name: ${response.name}`);
  console.log(`
    Dataset id: ${
      response.name
        .split('/')
        [response.name.split('/').length - 1].split('\n')[0]
    }`);
}

createDataset();

Go

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Go API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import (
	"context"
	"fmt"
	"io"

	automl "cloud.google.com/go/automl/apiv1"
	"cloud.google.com/go/automl/apiv1/automlpb"
)

// languageTextClassificationCreateDataset creates a dataset for text classification.
func languageTextClassificationCreateDataset(w io.Writer, projectID string, location string, datasetName string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetName := "dataset_display_name"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	req := &automlpb.CreateDatasetRequest{
		Parent: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		Dataset: &automlpb.Dataset{
			DisplayName: datasetName,
			DatasetMetadata: &automlpb.Dataset_TextClassificationDatasetMetadata{
				TextClassificationDatasetMetadata: &automlpb.TextClassificationDatasetMetadata{
					// Specify the classification type:
					// - MULTILABEL: Multiple labels are allowed for one example.
					// - MULTICLASS: At most one label is allowed per example.
					ClassificationType: automlpb.ClassificationType_MULTICLASS,
				},
			},
		},
	}

	op, err := client.CreateDataset(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateDataset: %w", err)
	}
	fmt.Fprintf(w, "Processing operation name: %q\n", op.Name())

	dataset, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("Wait: %w", err)
	}

	fmt.Fprintf(w, "Dataset name: %v\n", dataset.GetName())

	return nil
}

Additional languages

C#: Please follow the C# setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for .NET.

PHP: Please follow the PHP setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for PHP.

Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for Ruby.

Entity extraction

REST

Before using any of the request data, make the following replacements:

  • project-id: your project ID
  • location-id: the location for the resource, us-central1 for the Global location or eu for the European Union

HTTP method and URL:

POST https://automl.googleapis.com/v1/projects/project-id/locations/location-id/datasets

Request JSON body:

{
  "displayName": "test_dataset",
  "textExtractionDatasetMetadata": {
   }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  name: "projects/000000000000/locations/us-central1/datasets/TEN5582774688079151104"
  display_name: "test_dataset"
  create_time {
     seconds: 1539886451
     nanos: 757650000
   }
   text_extraction_dataset_metadata {
   }
}

Python

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Python API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google.cloud import automl

# TODO(developer): Uncomment and set the following variables
# project_id = "YOUR_PROJECT_ID"
# display_name = "YOUR_DATASET_NAME"

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = f"projects/{project_id}/locations/us-central1"
metadata = automl.TextExtractionDatasetMetadata()
dataset = automl.Dataset(
    display_name=display_name, text_extraction_dataset_metadata=metadata
)

# Create a dataset with the dataset metadata in the region.
response = client.create_dataset(parent=project_location, dataset=dataset)

created_dataset = response.result()

# Display the dataset information
print(f"Dataset name: {created_dataset.name}")
print("Dataset id: {}".format(created_dataset.name.split("/")[-1]))

Java

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Java API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.automl.v1.AutoMlClient;
import com.google.cloud.automl.v1.Dataset;
import com.google.cloud.automl.v1.LocationName;
import com.google.cloud.automl.v1.OperationMetadata;
import com.google.cloud.automl.v1.TextExtractionDatasetMetadata;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

class LanguageEntityExtractionCreateDataset {

  static void createDataset() throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String displayName = "YOUR_DATASET_NAME";
    createDataset(projectId, displayName);
  }

  // Create a dataset
  static void createDataset(String projectId, String displayName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");

      TextExtractionDatasetMetadata metadata = TextExtractionDatasetMetadata.newBuilder().build();
      Dataset dataset =
          Dataset.newBuilder()
              .setDisplayName(displayName)
              .setTextExtractionDatasetMetadata(metadata)
              .build();
      OperationFuture<Dataset, OperationMetadata> future =
          client.createDatasetAsync(projectLocation, dataset);

      Dataset createdDataset = future.get();

      // Display the dataset information.
      System.out.format("Dataset name: %s\n", createdDataset.getName());
      // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
      // required for other methods.
      // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
      String[] names = createdDataset.getName().split("/");
      String datasetId = names[names.length - 1];
      System.out.format("Dataset id: %s\n", datasetId);
    }
  }
}

Node.js

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Node.js API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const displayName = 'YOUR_DISPLAY_NAME';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1;

// Instantiates a client
const client = new AutoMlClient();

async function createDataset() {
  // Construct request
  const request = {
    parent: client.locationPath(projectId, location),
    dataset: {
      displayName: displayName,
      textExtractionDatasetMetadata: {},
    },
  };

  // Create dataset
  const [operation] = await client.createDataset(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();

  console.log(`Dataset name: ${response.name}`);
  console.log(`
    Dataset id: ${
      response.name
        .split('/')
        [response.name.split('/').length - 1].split('\n')[0]
    }`);
}

createDataset();

Go

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Go API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import (
	"context"
	"fmt"
	"io"

	automl "cloud.google.com/go/automl/apiv1"
	"cloud.google.com/go/automl/apiv1/automlpb"
)

// languageEntityExtractionCreateDataset creates a dataset for text entity extraction.
func languageEntityExtractionCreateDataset(w io.Writer, projectID string, location string, datasetName string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetName := "dataset_display_name"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	req := &automlpb.CreateDatasetRequest{
		Parent: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		Dataset: &automlpb.Dataset{
			DisplayName: datasetName,
			DatasetMetadata: &automlpb.Dataset_TextExtractionDatasetMetadata{
				TextExtractionDatasetMetadata: &automlpb.TextExtractionDatasetMetadata{},
			},
		},
	}

	op, err := client.CreateDataset(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateDataset: %w", err)
	}
	fmt.Fprintf(w, "Processing operation name: %q\n", op.Name())

	dataset, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("Wait: %w", err)
	}

	fmt.Fprintf(w, "Dataset name: %v\n", dataset.GetName())

	return nil
}

Additional languages

C#: Please follow the C# setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for .NET.

PHP: Please follow the PHP setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for PHP.

Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for Ruby.

Sentiment analysis

REST

Before using any of the request data, make the following replacements:

  • project-id: your project ID
  • location-id: the location for the resource, us-central1 for the Global location or eu for the European Union

HTTP method and URL:

POST https://automl.googleapis.com/v1/projects/project-id/locations/location-id/datasets

Request JSON body:

{
  "displayName": "test_dataset",
  "textSentimentDatasetMetadata": {
    "sentimentMax": 4
  }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  name: "projects/000000000000/locations/us-central1/datasets/TST8962998974766436002"
  display_name: "test_dataset_name"
  create_time {
    seconds: 1538855662
    nanos: 51542000
  }
  text_sentiment_dataset_metadata {
    sentiment_max: 7
  }
}

Python

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Python API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google.cloud import automl

# TODO(developer): Uncomment and set the following variables
# project_id = "YOUR_PROJECT_ID"
# display_name = "YOUR_DATASET_NAME"

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = f"projects/{project_id}/locations/us-central1"

# Each dataset requires a sentiment score with a defined sentiment_max
# value, for more information on TextSentimentDatasetMetadata, see:
# https://cloud.google.com/natural-language/automl/docs/prepare#sentiment-analysis
# https://cloud.google.com/automl/docs/reference/rpc/google.cloud.automl.v1#textsentimentdatasetmetadata
metadata = automl.TextSentimentDatasetMetadata(
    sentiment_max=4
)  # Possible max sentiment score: 1-10

dataset = automl.Dataset(
    display_name=display_name, text_sentiment_dataset_metadata=metadata
)

# Create a dataset with the dataset metadata in the region.
response = client.create_dataset(parent=project_location, dataset=dataset)

created_dataset = response.result()

# Display the dataset information
print(f"Dataset name: {created_dataset.name}")
print("Dataset id: {}".format(created_dataset.name.split("/")[-1]))

Java

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Java API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.automl.v1.AutoMlClient;
import com.google.cloud.automl.v1.Dataset;
import com.google.cloud.automl.v1.LocationName;
import com.google.cloud.automl.v1.OperationMetadata;
import com.google.cloud.automl.v1.TextSentimentDatasetMetadata;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

class LanguageSentimentAnalysisCreateDataset {

  static void createDataset() throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String displayName = "YOUR_DATASET_NAME";
    createDataset(projectId, displayName);
  }

  // Create a dataset
  static void createDataset(String projectId, String displayName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");
      // Specify the text classification type for the dataset.
      TextSentimentDatasetMetadata metadata =
          TextSentimentDatasetMetadata.newBuilder()
              .setSentimentMax(4) // Possible max sentiment score: 1-10
              .build();
      Dataset dataset =
          Dataset.newBuilder()
              .setDisplayName(displayName)
              .setTextSentimentDatasetMetadata(metadata)
              .build();
      OperationFuture<Dataset, OperationMetadata> future =
          client.createDatasetAsync(projectLocation, dataset);

      Dataset createdDataset = future.get();

      // Display the dataset information.
      System.out.format("Dataset name: %s\n", createdDataset.getName());
      // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
      // required for other methods.
      // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
      String[] names = createdDataset.getName().split("/");
      String datasetId = names[names.length - 1];
      System.out.format("Dataset id: %s\n", datasetId);
    }
  }
}

Node.js

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Node.js API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const displayName = 'YOUR_DISPLAY_NAME';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1;

// Instantiates a client
const client = new AutoMlClient();

async function createDataset() {
  // Construct request
  const request = {
    parent: client.locationPath(projectId, location),
    dataset: {
      displayName: displayName,
      textSentimentDatasetMetadata: {
        sentimentMax: 4, // Possible max sentiment score: 1-10
      },
    },
  };

  // Create dataset
  const [operation] = await client.createDataset(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();

  console.log(`Dataset name: ${response.name}`);
  console.log(`
    Dataset id: ${
      response.name
        .split('/')
        [response.name.split('/').length - 1].split('\n')[0]
    }`);
}

createDataset();

Go

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Go API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import (
	"context"
	"fmt"
	"io"

	automl "cloud.google.com/go/automl/apiv1"
	"cloud.google.com/go/automl/apiv1/automlpb"
)

// languageSentimentAnalysisCreateDataset creates a dataset for text sentiment analysis.
func languageSentimentAnalysisCreateDataset(w io.Writer, projectID string, location string, datasetName string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetName := "dataset_display_name"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	req := &automlpb.CreateDatasetRequest{
		Parent: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		Dataset: &automlpb.Dataset{
			DisplayName: datasetName,
			DatasetMetadata: &automlpb.Dataset_TextSentimentDatasetMetadata{
				TextSentimentDatasetMetadata: &automlpb.TextSentimentDatasetMetadata{
					SentimentMax: 4, // Possible max sentiment score: 1-10
				},
			},
		},
	}

	op, err := client.CreateDataset(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateDataset: %w", err)
	}
	fmt.Fprintf(w, "Processing operation name: %q\n", op.Name())

	dataset, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("Wait: %w", err)
	}

	fmt.Fprintf(w, "Dataset name: %v\n", dataset.GetName())

	return nil
}

Additional languages

C#: Please follow the C# setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for .NET.

PHP: Please follow the PHP setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for PHP.

Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for Ruby.

Importing training data into a dataset

After you have created a dataset, you can import document URIs and labels for documents from a CSV file stored in a Cloud Storage bucket. For details on preparing your data and creating a CSV file for import, see Preparing your training data.

You can import documents into an empty dataset or import additional documents into an existing dataset.

Web UI

To import documents into a dataset:

  1. Select the dataset you want to import documents into from the Datasets page.

  2. On the Import tab, specify where to find the training documents.

    You can:

    • Upload a .csv file that contains the training documents and their associated category labels from your local computer or from Cloud Storage.

    • Upload a collection of .txt, .pdf, .tif, or .zip files that contain the training documents from your local computer.

  3. Select the file(s) to import and the Cloud Storage path for the imported documents.

  4. Click Import.

Code samples

REST

Before using any of the request data, make the following replacements:

  • project-id: your project ID
  • location-id: the location for the resource, us-central1 for the Global location or eu for the European Union
  • dataset-id: your dataset ID
  • bucket-name: your Cloud Storage bucket
  • csv-file-name: your CSV training data file

HTTP method and URL:

POST https://automl.googleapis.com/v1/projects/project-id/locations/location-id/datasets/dataset-id:importData

Request JSON body:

{
  "inputConfig": {
    "gcsSource": {
      "inputUris": ["gs://bucket-name/csv-file-name.csv"]
      }
  }
}

To send your request, expand one of these options:

You should see output similar to the following. You can use the operation ID to get the status of the task. For an example, see Getting the status of an operation.

{
  "name": "projects/434039606874/locations/us-central1/operations/1979469554520650937",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata",
    "createTime": "2018-04-27T01:28:36.128120Z",
    "updateTime": "2018-04-27T01:28:36.128150Z",
    "cancellable": true
  }
}

Python

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Python API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google.cloud import automl

# TODO(developer): Uncomment and set the following variables
# project_id = "YOUR_PROJECT_ID"
# dataset_id = "YOUR_DATASET_ID"
# path = "gs://YOUR_BUCKET_ID/path/to/data.csv"

client = automl.AutoMlClient()
# Get the full path of the dataset.
dataset_full_id = client.dataset_path(project_id, "us-central1", dataset_id)
# Get the multiple Google Cloud Storage URIs
input_uris = path.split(",")
gcs_source = automl.GcsSource(input_uris=input_uris)
input_config = automl.InputConfig(gcs_source=gcs_source)
# Import data from the input URI
response = client.import_data(name=dataset_full_id, input_config=input_config)

print("Processing import...")
print(f"Data imported. {response.result()}")

Java

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Java API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.automl.v1.AutoMlClient;
import com.google.cloud.automl.v1.DatasetName;
import com.google.cloud.automl.v1.GcsSource;
import com.google.cloud.automl.v1.InputConfig;
import com.google.cloud.automl.v1.OperationMetadata;
import com.google.protobuf.Empty;
import java.io.IOException;
import java.util.Arrays;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

class ImportDataset {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String datasetId = "YOUR_DATASET_ID";
    String path = "gs://BUCKET_ID/path_to_training_data.csv";
    importDataset(projectId, datasetId, path);
  }

  // Import a dataset
  static void importDataset(String projectId, String datasetId, String path)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // Get the complete path of the dataset.
      DatasetName datasetFullId = DatasetName.of(projectId, "us-central1", datasetId);

      // Get multiple Google Cloud Storage URIs to import data from
      GcsSource gcsSource =
          GcsSource.newBuilder().addAllInputUris(Arrays.asList(path.split(","))).build();

      // Import data from the input URI
      InputConfig inputConfig = InputConfig.newBuilder().setGcsSource(gcsSource).build();
      System.out.println("Processing import...");

      // Start the import job
      OperationFuture<Empty, OperationMetadata> operation =
          client.importDataAsync(datasetFullId, inputConfig);

      System.out.format("Operation name: %s%n", operation.getName());

      // If you want to wait for the operation to finish, adjust the timeout appropriately. The
      // operation will still run if you choose not to wait for it to complete. You can check the
      // status of your operation using the operation's name.
      Empty response = operation.get(45, TimeUnit.MINUTES);
      System.out.format("Dataset imported. %s%n", response);
    } catch (TimeoutException e) {
      System.out.println("The operation's polling period was not long enough.");
      System.out.println("You can use the Operation's name to get the current status.");
      System.out.println("The import job is still running and will complete as expected.");
      throw e;
    }
  }
}

Node.js

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Node.js API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const datasetId = 'YOUR_DISPLAY_ID';
// const path = 'gs://BUCKET_ID/path_to_training_data.csv';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1;

// Instantiates a client
const client = new AutoMlClient();

async function importDataset() {
  // Construct request
  const request = {
    name: client.datasetPath(projectId, location, datasetId),
    inputConfig: {
      gcsSource: {
        inputUris: path.split(','),
      },
    },
  };

  // Import dataset
  console.log('Proccessing import');
  const [operation] = await client.importData(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();
  console.log(`Dataset imported: ${response}`);
}

importDataset();

Go

To learn how to install and use the client library for AutoML Natural Language, see AutoML Natural Language client libraries. For more information, see the AutoML Natural Language Go API reference documentation.

To authenticate to AutoML Natural Language, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import (
	"context"
	"fmt"
	"io"

	automl "cloud.google.com/go/automl/apiv1"
	"cloud.google.com/go/automl/apiv1/automlpb"
)

// importDataIntoDataset imports data into a dataset.
func importDataIntoDataset(w io.Writer, projectID string, location string, datasetID string, inputURI string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetID := "TRL123456789..."
	// inputURI := "gs://BUCKET_ID/path_to_training_data.csv"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	req := &automlpb.ImportDataRequest{
		Name: fmt.Sprintf("projects/%s/locations/%s/datasets/%s", projectID, location, datasetID),
		InputConfig: &automlpb.InputConfig{
			Source: &automlpb.InputConfig_GcsSource{
				GcsSource: &automlpb.GcsSource{
					InputUris: []string{inputURI},
				},
			},
		},
	}

	op, err := client.ImportData(ctx, req)
	if err != nil {
		return fmt.Errorf("ImportData: %w", err)
	}
	fmt.Fprintf(w, "Processing operation name: %q\n", op.Name())

	if err := op.Wait(ctx); err != nil {
		return fmt.Errorf("Wait: %w", err)
	}

	fmt.Fprintf(w, "Data imported.\n")

	return nil
}

Additional languages

C#: Please follow the C# setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for .NET.

PHP: Please follow the PHP setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for PHP.

Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the AutoML Natural Language reference documentation for Ruby.

Labeling training documents

To be useful for training a model, each document in a dataset needs to be labeled in the way you want AutoML Natural Language to label similar documents. The quality of your training data strongly impacts the effectiveness of the model you create, and by extension, the quality of the predictions returned from that model. AutoML Natural Language ignores non-labeled documents during training.

You can provide labels for your training documents in three ways:

  • Include labels in your .csv file (for classification and sentiment analysis only)
  • Label your documents in the AutoML Natural Language UI
  • Request labeling from human labelers using the AI Platform Data Labeling Service

The AutoML API does not include methods for labeling.

For details about labeling documents in your .csv file, see Preparing your training data.

Labeling for classification and sentiment analysis

To label documents in the AutoML Natural Language UI, select the dataset from the dataset listing page to see its details. The display name of the selected dataset appears in the title bar, and the page lists the individual documents in the dataset along with their current labels. The navigation bar along the left summarizes the number of labeled and unlabeled documents and enables you to filter the document list by label or sentiment value.

Text items page

Text items page

To assign labels or sentiment values to unlabeled documents or change document labels, select the documents you want to update and the label(s) or value you want to assign to them. There are two ways to update an document's label:

  • Click the check box next to the documents you want to update, then select the label(s) to apply from the Label drop-down list that appears at the top of the document list.

  • Click the row of the document you want to update, then select the label(s) or value to apply from the list that appears on the Text detail page.

Identifying entities for entity extraction

Before training your custom model, you need to annotate the training documents in the dataset. You can annotate training documents before importing them, or you can add annotations in the AutoML Natural Language UI.

To annotate in the AutoML Natural Language UI, select the dataset from the dataset listing page to see its details. The display name of the selected dataset appears in the title bar, and the page lists the individual documents in the dataset along with any annotations in them. The navigation bar along the left summarizes the labels and the number of times each label appears. You can also filter the document list by label.

Annotation list

To add or delete annotations within a document, double-click the document you want to update. The Edit page shows the complete text of the selected document, with all previous annotations highlighted.

Entity editor

For PDF training documents or documents imported with layout information, the Edit page has two tabs: Plain text and Structured text. The Plain text tab shows the raw contents of the training document without any formatting. The Structured text tab recreates the basic layout of the training document. (The Plain text tab also has a link to the original PDF file.)

Structured text editor

To add a new annotation, highlight the text that represents the entity, select the label from the Annotate dialog box, and click Save. When you add annotations on the Structured text tab, AutoML Natural Language captures the annotation's position on the page as a factor considered during training.

Add annotation

To remove an annotation, locate the text in the list of labels on the right and click the garbage can icon next to it.