Method: projects.locations.datasets.importData

Imports data into a dataset.

For more information, see Importing items into a dataset

HTTP request

POST https://automl.googleapis.com/v1beta1/{name}:importData

Path parameters

Parameters
name

string

Required. Dataset name. Dataset must already exist. All imported annotations and examples will be added.

Authorization requires the following Google IAM permission on the specified resource name:

  • automl.datasets.import

Request body

The request body contains data with the following structure:

JSON representation
{
  "inputConfig": {
    object(InputConfig)
  }
}
Fields
inputConfig

object(InputConfig)

Required. The desired input location and its domain specific semantics, if any.

Response body

If successful, the response body contains an instance of Operation.

Authorization Scopes

Requires the following OAuth scope:

  • https://www.googleapis.com/auth/cloud-platform

For more information, see the Authentication Overview.

InputConfig

Input configuration for datasets.importData Action.

The format of input depends on dataset_metadata the Dataset into which the import is happening has. As input source the gcsSource is expected, unless specified otherwise. Additionally any input .CSV file by itself must be 100MB or smaller, unless specified otherwise. If an "example" file (that is, image, video etc.) with identical content (even if it had different GCS_FILE_PATH) is mentioned multiple times, then its label, bounding boxes etc. are appended. The same file should be always provided with the same ML_USE and GCS_FILE_PATH, if it is not, then these values are nondeterministically selected from the given ones.

The formats are represented in EBNF with commas being literal and with non-terminal symbols defined near the end of this comment. The formats are:

AutoML Natural Language

Entity Extraction

See Preparing your training data for more information.

A CSV file(s) with each line in format:

ML_USE,GCS_FILE_PATH

  • ML_USE - Identifies the data set that the current row (file) applies to. This value can be one of the following:

    • TRAIN - Rows in this file are used to train the model.
    • TEST - Rows in this file are used to test the model during training.
    • UNASSIGNED - Rows in this file are not categorized. They are automatically divided into train and test data. 80% for training and 20% for testing.
  • GCS_FILE_PATH - a Identifies JSON Lines (.JSONL) file stored in Google Cloud Storage that contains in-line text in-line as documents for model training.

After the training data set has been determined from the TRAIN and UNASSIGNED CSV files, the training data is divided into train and validation data sets. 70% for training and 30% for validation.

For example:

TRAIN,gs://folder/file1.jsonl
VALIDATE,gs://folder/file2.jsonl
TEST,gs://folder/file3.jsonl

In-line JSONL files

In-line .JSONL files contain, per line, a JSON document that wraps a textSnippet field followed by one or more annotations fields, which have displayName and textExtraction fields to describe the entity from the text snippet. Multiple JSON documents can be separated using line breaks (\n).

The supplied text must be annotated exhaustively. For example, if you include the text "horse", but do not label it as "animal", then "horse" is assumed to not be an "animal".

Any given text snippet content must have 30,000 characters or less, and also be UTF-8 NFC encoded. ASCII is accepted as it is UTF-8 NFC encoded.

For example:

{
  "textSnippet": {
    "content": "dog car cat"
  },
  "annotations": [
     {
       "displayName": "animal",
       "textExtraction": {
         "textSegment": {"startOffset": 0, "endOffset": 2}
      }
     },
     {
      "displayName": "vehicle",
       "textExtraction": {
         "textSegment": {"startOffset": 4, "endOffset": 6}
       }
     },
     {
       "displayName": "animal",
       "textExtraction": {
         "textSegment": {"startOffset": 8, "endOffset": 10}
       }
     }
 ]
}\n
{
   "textSnippet": {
     "content": "This dog is good."
   },
   "annotations": [
      {
        "displayName": "animal",
        "textExtraction": {
          "textSegment": {"startOffset": 5, "endOffset": 7}
        }
      }
   ]
}

JSONL files that reference documents

.JSONL files contain, per line, a JSON document that wraps a inputConfig that contains the path to a source PDF document. Multiple JSON documents can be separated using line breaks (\n).

For example:

{
  "document": {
    "inputConfig": {
      "gcsSource": { "inputUris": [ "gs://folder/document1.pdf" ]
      }
    }
  }
}\n
{
  "document": {
    "inputConfig": {
      "gcsSource": { "inputUris": [ "gs://folder/document2.pdf" ]
      }
    }
  }
}

In-line JSONL files with layout information

Note: You can only annotate PDF files using the UI. The format described below applies to annotated PDF files exported using the UI or exportData.

In-line .JSONL files for PDF documents contain, per line, a JSON document that wraps a document field that provides the textual content of the document and the layout information.

For example:

{
  "document": {
          "documentText": {
            "content": "dog car cat"
          }
          "layout": [
            {
              "textSegment": {
                "startOffset": 0,
                "endOffset": 11,
               },
               "pageNumber": 1,
               "boundingPoly": {
                  "normalizedVertices": [
                    {"x": 0.1, "y": 0.1},
                    {"x": 0.1, "y": 0.3},
                    {"x": 0.3, "y": 0.3},
                    {"x": 0.3, "y": 0.1},
                  ],
                },
                "textSegmentType": TOKEN,
            }
          ],
          "documentDimensions": {
            "width": 8.27,
            "height": 11.69,
            "unit": INCH,
          }
          "pageCount": 3,
        },
        "annotations": [
          {
            "displayName": "animal",
            "textExtraction": {
              "textSegment": {"startOffset": 0, "endOffset": 3}
            }
          },
          {
            "displayName": "vehicle",
            "textExtraction": {
              "textSegment": {"startOffset": 4, "endOffset": 7}
            }
          },
          {
            "displayName": "animal",
            "textExtraction": {
              "textSegment": {"startOffset": 8, "endOffset": 11}
            }
          },
        ],
      }

Classification

See Preparing your training data for more information.

CSV file(s) with each line in format:

ML_USE,(TEXT_SNIPPET | GCS_FILE_PATH),LABEL,LABEL,...

TEXT_SNIPPET and GCS_FILE_PATH are distinguished by a pattern. If the column content is a valid Google Cloud Storage file path, i.e. prefixed by "gs://", it will be treated as a GCS_FILE_PATH, else if the content is enclosed within double quotes (""), it is treated as a TEXT_SNIPPET. In the GCS_FILE_PATH case, the path must lead to a .txt file with UTF-8 encoding, for example, "gs://folder/content.txt", and the content in it is extracted as a text snippet. In TEXT_SNIPPET case, the column content excluding quotes is treated as to be imported text snippet. In both cases, the text snippet/file size must be within 128kB. Maximum 100 unique labels are allowed per CSV row.

Sample rows:

TRAIN,"They have bad food and very rude",RudeService,BadFood
TRAIN,gs://folder/content.txt,SlowService
TEST,"Typically always bad service there.",RudeService
VALIDATE,"Stomach ache to go.",BadFood

Sentiment Analysis

See Preparing your training data for more information.

CSV file(s) with each line in format:

ML_USE,(TEXT_SNIPPET | GCS_FILE_PATH),SENTIMENT

TEXT_SNIPPET and GCS_FILE_PATH are distinguished by a pattern. If the column content is a valid Google Cloud Storage file path, that is, prefixed by "gs://", it is treated as a GCS_FILE_PATH, otherwise it is treated as a TEXT_SNIPPET. In the GCS_FILE_PATH case, the path must lead to a .txt file with UTF-8 encoding, for example, "gs://folder/content.txt", and the content in it is extracted as a text snippet. In TEXT_SNIPPET case, the column content itself is treated as to be imported text snippet. In both cases, the text snippet must be up to 500 characters long.

Sample rows:

TRAIN,"@freewrytin this is way too good for your product",2
TRAIN,"I need this product so bad",3
TEST,"Thank you for this product.",4
VALIDATE,gs://folder/content.txt,2

Input field definitions:
ML_USE
("TRAIN" | "VALIDATE" | "TEST" | "UNASSIGNED") Describes how the given example (file) should be used for model training. "UNASSIGNED" can be used when user has no preference.
GCS_FILE_PATH
A path to file on Google Cloud Storage, e.g. "gs://folder/image1.png".
LABEL
A display name of an object on an image, video etc., e.g. "dog". Must be up to 32 characters long and can consist only of ASCII Latin letters A-Z and a-z, underscores(_), and ASCII digits 0-9. For each label an AnnotationSpec is created which displayName becomes the label; AnnotationSpecs are given back in predictions.
TEXT_SNIPPET
The content of a text snippet, UTF-8 encoded, enclosed within double quotes ("").
DOCUMENT
A field that provides the textual content with document and the layout information.
SENTIMENT
An integer between 0 and Dataset.text_sentiment_dataset_metadata.sentiment_max (inclusive). Describes the ordinal of the sentiment - higher value means a more positive sentiment. All the values are completely relative, i.e. neither 0 needs to mean a negative or neutral sentiment nor sentimentMax needs to mean a positive one - it is just required that 0 is the least positive sentiment in the data, and sentimentMax is the most positive one. The SENTIMENT shouldn't be confused with "score" or "magnitude" from the previous Natural Language Sentiment Analysis API. All SENTIMENT values between 0 and sentimentMax must be represented in the imported data. On prediction the same 0 to sentimentMax range will be used. The difference between neighboring sentiment values needs not to be uniform, e.g. 1 and 2 may be similar whereas the difference between 2 and 3 may be huge.

Errors: If any of the provided CSV files can't be parsed or if more than certain percent of CSV rows cannot be processed then the operation fails and nothing is imported. Regardless of overall success or failure the per-row failures, up to a certain count cap, is listed in Operation.metadata.partial_failures.

JSON representation
{
  "params": {
    string: string,
    ...
  },

  // Union field source can be only one of the following:
  "gcsSource": {
    object (GcsSource)
  },
  "bigquerySource": {
    object (BigQuerySource)
  }
  // End of list of possible types for union field source.
}
Fields
params

map (key: string, value: string)

Additional domain-specific parameters describing the semantic of the imported data, any string must be up to 25000 characters long.

  • For Tables: schema_inference_version - (integer) Required. The version of the algorithm that should be used for the initial inference of the schema (columns' DataTypes) of the table the data is being imported into. Allowed values: "1".

Union field source. The source of the input. source can be only one of the following:
gcsSource

object (GcsSource)

The Google Cloud Storage location for the input content. In datasets.importData, the gcsSource points to a csv with structure described in the comment.

bigquerySource

object (BigQuerySource)

The BigQuery location for the input content.

Was this page helpful? Let us know how we did:

Send feedback about...

AutoML Natural Language Entity Extraction