Create and manage datasets
A dataset contains representative samples of the type of content that you want to translate, as matching segment pairs in the source and target languages. The dataset serves as the input for training a model.
A project can have multiple datasets; each one can be used to train a separate model.
Create a dataset
Create a dataset to contain the training data for your model. When you create a dataset, you specify the source and target languages of your training data. For more information about the supported languages and variants, see Language support for custom models.
Web UI
The AutoML Translation console lets you to create a new dataset and import items into it.Go to the AutoML Translation console.
In the navigation pane, click Datasets.
On the Datasets page, click Create dataset.
In the Create dataset dialog, specify details about the dataset:
- Enter a name for the dataset.
- Select the source and target languages from the drop-down lists.
- Click Create.
REST
The following example shows how to send aPOST
request to the
project.locations.datasets/create
method.
Before using any of the request data, make the following replacements:
- PROJECT_ID: Your Google Cloud project ID.
- LOCATION: The region where the dataset will be located, such as
us-central1
. - DATASET_NAME: A name for the dataset.
- SOURCE_LANG_CODE: The language code that specifies the dataset's source language.
- TARGET_LANG_CODE: The language code that specifies the dataset's target language.
HTTP method and URL:
POST https://translation.googleapis.com/v3/projects/PROJECT_ID/locations/LOCATION/datasets
Request JSON body:
{ "display_name": "DATASET_NAME", "source_language_code": "SOURCE_LANG_CODE", "target_language_code": "TARGET_LANG_CODE" }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NAME/locations/LOCATION/operations/OPERATION_ID" }
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for Ruby.
Import segments into a dataset
After you have created a dataset, you can import segment pairs into the dataset. For details on preparing your source data, see Preparing training data.
For each file, the Google Cloud console lets you tag imported segment pairs with one
or more key-value pairs. Tagging makes it easier to find and filter segments by
source. For example, a key-value pair could be Domain:costmetics
or
Year:2020
.
You can add tags when you import segments through the Google Cloud console; tagging isn't supported by the API. Also, you can't modify tags or add tags to segments that have already been imported.
Web UI
The following steps import items into an existing dataset.
Go to the AutoML Translation console.
In the navigation pane, click Datasets.
From the dataset list, click the name of the dataset that want to add training data to.
Go to the Import tab.
Add files to import segment pairs for model training.
Upload files from your local computer to a Cloud Storage bucket or select existing files from Cloud Storage.
By default, Cloud Translation automatically splits your data into training, validation, and test sets. If you want to upload separate files for each split, select Use separate files for training, validation, and testing (advanced). Use this option if your dataset has more than 100,000 segment pairs to avoid exceeding the maximum 10,000 segment pair limit for the validation and test sets.
To add tags to segment pairs, expand Tags (optional).
From the list of files, click
Edit to add one or more tags to all segment pairs for a given file.In the Tags pane, click Add tag.
Enter a key and value. You'll be able to filter segments by this key-value pair.
To add more tags, click Add tag.
Click Continue when you're done adding tags.
Click Continue to import segment pairs.
After the import is complete, you can view the imported sentence pairs in the Sentences tab of your dataset. You filter segments by split (training, validation, or testing) and by one or more tags.
REST
Use theprojects.locations.datasets.importData
method to
import items into a dataset.
Before using any of the request data, make the following replacements:
- PROJECT_ID: Your Google Cloud project ID.
- LOCATION: The region where the dataset will be located, such as
us-central1
. - DATASET_ID: The ID of the dataset to add data to.
- FILE_DISPLAY_NAME: The name of the file that contains data to import.
- USAGE: Specifies the data split for these segment pairs
(
TRAIN
,VALIDATION
, orTEST
). - FILE_PATH: The path to the source data file in Cloud Storage.
HTTP method and URL:
POST https://translation.googleapis.com/v3/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:importData
Request JSON body:
{ "input_config": { "input_files": [ { "display_name": "FILE_DISPLAY_NAME", "usage": "USAGE", "gcs_source": { "input_uris": "gs://FILE_PATH" } }, ... ] } }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID" }
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for Ruby.
After you have created and populated the dataset, you can train a model. For more information, see Creating and managing models).
Import issues
When you create a dataset, AutoML Translation might drop segment pairs if they are too long, if segments in the source and target languages are identical (untranslated), or if there are duplicates (multiple segments with the same source language text).
For segment pairs that are too long, we recommend that you break up segments to roughly 200 words or less, and then recreate the dataset. The 200 word limit is an estimate for the maximum length. While processing your data, AutoML Translation uses an internal process to tokenize your input data, which can increase the size of your segments. This tokenized data is what AutoML Translation uses to measure data size.
For segment pairs that are identical, remove them from your dataset. If you want to prevent some segments from being translated, use a glossary resource to build a custom dictionary instead.
Export data
You can export segment pairs from existing datasets to a Cloud Storage bucket.
Web UI
Go to the AutoML Translation console.
In the navigation pane, click Datasets to view a list of your datasets.
Click the name of the dataset for which you want to export data.
On the dataset details page, click Export data.
Select a Cloud Storage destination where the exported TSV files are saved.
Click Export.
AutoML Translation outputs TSV files that are named according to their dataset set (train, validation, and test).
REST
Use theprojects.locations.datasets.exportData
method to
export data to Cloud Storage as TSV files.
Before using any of the request data, make the following replacements:
- PROJECT_ID: Your Google Cloud project ID.
- LOCATION: The region where the dataset to export is located,
such as
us-central1
. - DATASET_ID: The ID of the dataset to export.
- DESTINATION_DIRECTORY: The Cloud Storage path where the output is sent.
HTTP method and URL:
POST https://translation.googleapis.com/v3/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:exportData
Request JSON body:
{ "output_config": { "gcs_destination": { "output_uri_prefix": "gs://DESTINATION_DIRECTORY" } } }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID" }
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for Ruby.
List datasets
List the available datasets in your project.
Web UI
To see a list of the available datasets by using the AutoML Translation console, click Datasets from the navigation pane.
To see the datasets for a different project, select the project from the drop-down list in the upper right of the title bar.
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: Your Google Cloud project ID.
- LOCATION: The region where the datasets to list are located,
such as
us-central1
.
HTTP method and URL:
GET https://translation.googleapis.com/v3/projects/PROJECT_ID/locations/LOCATION/datasets
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "datasets": [ { "name": "projects/PROJECT_NUMBER/locations/us-central1/datasets/DATASET_ID", "displayName": "DATASET_NAME", "sourceLanguageCode": "SOURCE_LANG_CODE", "targetLanguageCode": "TARGET_LANG_CODE", "exampleCount": 8720, "createTime": "2022-10-19T23:24:34.734549Z", "updateTime": "2022-10-19T23:24:35.357525Z" }, ... ] }
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for Ruby.
Deleting a dataset
Web UI
In the AutoML Translation console, click Datasets from the navigation pane to display the list of available datasets.
For the dataset to delete, select
More > Delete.Click Confirm in the confirmation dialog box.
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: Your Google Cloud project ID.
- LOCATION: The region where the datasets to list are located,
such as
us-central1
. - DATASET_ID: The ID of the dataset to delete.
HTTP method and URL:
DELETE https://translation.googleapis.com/v3/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.translation.v3.DeleteDatasetMetadata" }, "done": true }
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Cloud Translation reference documentation for Ruby.