A dataset contains representative samples of the type of content you want to translate, as matching sentence pairs in the source and target languages. The dataset serves as the input for training a model.
The main steps for building a dataset are:
- Create a dataset and identify the source and target languages.
- Import sentence pairs into the dataset.
A project can have multiple datasets, each used to train a separate model. You can get a list of the available datasets and you can delete datasets no longer needed.
Creating a dataset
The first step in creating a custom model is to create an empty dataset that will eventually hold the training data for the model. When you create a dataset, you identify the source and target languages for the model. For more information about the supported languages and variants, see Language support for custom models.
Web UI
The AutoML Translation UI enables you to create a new dataset and import items into it from the same page.
Visit the AutoML Translation UI.
Select the project for which you enabled AutoML Translation from the drop-down list in the upper right of the title bar.
On the Datasets tab, click Create Dataset.
In the Create dataset dialog, do the following:
- Enter a name for the dataset.
Select the source and target languages from the drop-down lists. When you select a Translate from language, the available Translate to languages appear.
Click Create. The Import tab opens up.
REST
Send the create dataset request
The following shows how to send a POST
request to the
project.locations.datasets/create
method.
The example uses the access token for a service account set up for the
project using the Google Cloud CLI.
Before using any of the request data, make the following replacements:
- project-id: your Google Cloud Platform project ID
- dataset-name: the name of your new dataset
- source-language-code: the language you want to translate from, as an ISO 639-1 code such as 'en'
- target-language-code: the language you want to translate to, as an ISO 639-1 code such as 'es'
HTTP method and URL:
POST https://automl.googleapis.com/v1/projects/project-id/locations/us-central1/datasets
Request JSON body:
{ "displayName": "dataset-name", "translationDatasetMetadata": { "sourceLanguageCode": "source-language-code", "targetLanguageCode": "target-language-code" } }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/project-number/locations/us-central1/operations/operation-id", "metadata": { "@type": "type.googleapis.com/google.cloud.automl.v1.OperationMetadata", "createTime": "2019-10-01T22:13:48.155710Z", "updateTime": "2019-10-01T22:13:48.155710Z", "createDatasetDetails": {} } }
Get the results
To get the results of your request, you must send a GET
request to
the operations
resource. The following shows how to send such a
request.
Before using any of the request data, make the following replacements:
- operation-name: the name of the operation as returned in the response to the original call to the API
- project-id: your Google Cloud Platform project ID
HTTP method and URL:
GET https://automl.googleapis.com/v1/operation-name
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "metadata": { "@type": "type.googleapis.com/google.cloud.automl.v1.OperationMetadata", "createTime": "2019-10-01T22:13:48.155710Z", "updateTime": "2019-10-01T22:13:52.321072Z", ... }, "done": true, "response": { "@type": "resource-type", "name": "resource-name" } }
Go
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Go API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Java API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Node.js API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Python API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for Ruby.
Importing items into a dataset
After you have created a dataset, you can import training sentence pairs into it. For details on preparing your training data, see Preparing training data.
Web UI
The AutoML Translation UI enables you to create a new dataset and import items into it from the same page (see: Creating a dataset). The steps below import items into an existing dataset.
After creating the dataset folder, you then upload your data.Upload the sentence pairs to use for training the model.
On the Import tab, you can upload TSV or TMX files from your local computer or from Cloud Storage. For locally imported files, after selecting your file, click Browse. A list of folders appears. Select the folder where you want your file uploaded to. This directory hosted on Cloud Storage is required to guarantee data residency.
Select the checkbox for Use separate files for training, validation, and testing (advanced), if you want to upload separate files containing the sentence pairs. This option is recommended if your dataset has more than 100,000 sentence pairs. You must allocate 10,000 sentence pairs at most for validation and test sets; otherwise, AutoML Translation returns an error.
Click Continue.
You're returned to the Datasets page. Your dataset shows an in progress animation while your documents are being imported. When your dataset is successfully uploaded, you will receive a message at the email address that you used to sign up for the program.
Review the dataset.
After your data has been successfully imported, select the dataset from the Datasets tab to see the dataset details. The Sentence tab is enabled, and shows the name of the dataset. The sentence pairs are listed. Each pair is assigned "training," "validation" or "testing," indicating at which stage of processing the pair will be used.
REST
Use the
projects.locations.datasets.importData
method to import items into a dataset.
Before using any of the request data, make the following replacements:
- dataset-name: the name of your dataset, as returned by the API when you created the dataset
- bucket-name: the Cloud Storage bucket that contains the input CSV that describes your dataset
- csv-file-name: the name of the input CSV file that describes your dataset
- project-id: your Google Cloud Platform project ID
HTTP method and URL:
POST https://automl.googleapis.com/v1/dataset-name:importData
Request JSON body:
{ "inputConfig": { "gcsSource": { "inputUris": "gs://bucket-name/csv-file-name" } } }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/project-number/locations/us-central1/operations/operation-id", "metadata": { "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata", "createTime": "2018-04-27T01:28:36.128120Z", "updateTime": "2018-04-27T01:28:36.128150Z", "cancellable": true } }
Go
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Go API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Java API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Node.js API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Python API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for Ruby.
Once you have created and populated the dataset, you are ready to train the model (see: Creating and managing models).
Managing datasets
Listing datasets
A project can include numerous datasets. This section describes how to retrieve a list of the available datasets for a project.
Web UI
To see a list of the available datasets using the AutoML Translation UI, click the Datasets link at the top of the left navigation menu.
To see the datasets for a different project, select the project from the drop-down list in the upper right of the title bar.
REST
Before using any of the request data, make the following replacements:
- project-id: your Google Cloud Platform project ID
HTTP method and URL:
GET https://automl.googleapis.com/v1/projects/project-id/locations/us-central1/datasets
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "datasets": [ { "name": "projects/project-number/locations/us-central1/datasets/dataset-id", "displayName": "dataset-display-name", "createTime": "2019-10-01T22:47:38.347689Z", "etag": "AB3BwFpPWn6klFqJ867nz98aXr_JHcfYFQBMYTf7rcO-JMi8Ez4iDSNrRW4Vv501i488", "translationDatasetMetadata": { "sourceLanguageCode": "source-language", "targetLanguageCode": "target-language" } }, ... ] }
Go
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Go API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Java API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Node.js API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Python API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for Ruby.
Deleting a dataset
Web UI
In the AutoML Translation UI, click the Datasets link at the top of the left navigation menu to display the list of available datasets.
Click the three-dot menu at the far right of the row you want to delete and select Delete.
Click Confirm in the confirmation dialog box.
REST
- Replace dataset-name with the full name of your dataset, from the
response when you created the dataset. The full name has the format:
projects/{project-id}/locations/us-central1/datasets/{dataset-id}
Before using any of the request data, make the following replacements:
- dataset-name: the name of the dataset that you
want to delete, in the format
project/project-id/locations/us-central1/datasets/dataset-id
HTTP method and URL:
DELETE https://automl.googleapis.com/v1/dataset-name
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/project-number/locations/us-central1/operations/operation-id", "metadata": { "@type": "type.googleapis.com/google.cloud.automl.v1.OperationMetadata", "createTime": "2019-10-02T16:43:03.923442Z", "updateTime": "2019-10-02T16:43:03.923442Z", "deleteDetails": {} }, "done": true, "response": { "@type": "type.googleapis.com/google.protobuf.Empty" } }
Go
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Go API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Java API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Node.js API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for AutoML Translation, see AutoML Translation client libraries. For more information, see the AutoML Translation Python API reference documentation.
To authenticate to AutoML Translation, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the AutoML Translation reference documentation for Ruby.
Import issues
When you create a dataset, AutoML Translation might drop sentence pairs if they are too long or if the pairs are exactly the same in the source and target languages.
For sentence pairs that are too long, we recommend that you break up sentences to roughly 200 words or less, and then recreate the dataset to include the dropped pairs. While processing your data, AutoML Translation uses an internal process to tokenize your input data, which can increase the size of your sentences. This tokenized data is what AutoML Translation uses to measure data size. Therefore, the 200 word limit is an estimate for the maximum length.
For sentences pairs that are the same in the source and target languages, you can remove them from your dataset. If you want to keep these sentences untranslated, use a glossary resource to build a custom dictionary that defines how AutoML Translation handles specific terms.