Method: projects.locations.datasets.importData

Imports data into a dataset.

For more information, see Importing items into a dataset

HTTP request

POST https://automl.googleapis.com/v1beta1/{name}:importData

Path parameters

Parameters
name

string

Required. Dataset name. Dataset must already exist. All imported annotations and examples will be added.

Authorization requires the following Google IAM permission on the specified resource name:

  • automl.datasets.import

Request body

The request body contains data with the following structure:

JSON representation
{
  "inputConfig": {
    object(InputConfig)
  }
}
Fields
inputConfig

object(InputConfig)

Required. The desired input location and its domain specific semantics, if any.

Response body

If successful, the response body contains an instance of Operation.

Authorization Scopes

Requires the following OAuth scope:

  • https://www.googleapis.com/auth/cloud-platform

For more information, see the Authentication Overview.

InputConfig

Input configuration for datasets.importData action.

The format of input depends on dataset_metadata the Dataset into which the import is happening has. As input source the gcsSource is expected, unless specified otherwise. If a file with identical content (even if it had different GCS_FILE_PATH) is mentioned multiple times , then its label, bounding boxes etc. are appended. The same file should be always provided with the same ML_USE and GCS_FILE_PATH, if it is not then these values are nondeterministically selected from the given ones.

The formats are represented in EBNF with commas being literal and with non-terminal symbols defined near the end of this comment. The formats are:

AutoML Natural Language Text Classification

See Preparing your training data for more information.

A CSV file(s) with each line in format:

ML_USE,GCS_FILE_PATH or TEXT_SNIPPET,LABEL(S)
  • ML_USE - Identifies the data set that the current row (file) applies to. This value can be one of the following:

    • TRAIN - Rows in this file are used to train the model.
    • TEST - Rows in this file are used to test the model during training.
    • UNASSIGNED - Rows in this file are not categorized. They are Automatically divided into train and test data. 80% for training and 10% for validating, and 10% for testing.
  • GCS_FILE_PATH - Identifies a text file stored in Google Cloud Storage that contains content used to train the model. The shortest document is one sentence. An individual document cannot be larger than 128kB.

  • TEXT_SNIPPET - Content used to train the model. The shortest document is one sentence. An individual document cannot be larger than 128kB.

  • LABEL(S) - A comma-separated list of labels that identify how the content is categorized. Labels must start with a letter and only contain letters, numbers, and underscores.

For example:

gs://my-project-lcm/training-data/file1.txt,Sports,Basketball
gs://my-project-lcm/training-data/ubuntu.zip,Computers,Software,Operating_Systems,Linux,Ubuntu
file://news/documents/file2.txt,Sports,Baseball
"Miles Davis was an American jazz trumpeter, bandleader, and composer.",Arts_Entertainment,Music,Jazz
TRAIN,gs://my-project-lcm/training-data/astros.txt,Sports,Baseball
VALIDATE,gs://my-project-lcm/training-data/mariners.txt,Sports,Baseball
TEST,gs://my-project-lcm/training-data/cubs.txt,Sports,Baseball2

AutoML Natural Language Entity Extraction

See Preparing your training data for more information.

A CSV file(s) with each line in format:

ML_USE,GCS_FILE_PATH
  • ML_USE - Identifies the data set that the current row (file) applies to. This value can be one of the following:

    • TRAIN - Rows in this file are used to train the model.
    • TEST - Rows in this file are used to test the model during training.
    • UNASSIGNED - Rows in this file are not categorized. They are Automatically divided into train and test data. 80% for training and 20% for testing.
  • GCS_FILE_PATH - Identifies a JSON Lines (.JSONL) file stored in Google Cloud Storage that contains in-line text in-line as documents for model training.

After the training data set has been determined from the TRAIN and UNASSIGNED CSV files, the training data is divided into train and validation data sets. 70% for training and 30% for validation.

For example:

TRAIN,gs://folder/file1.jsonl
VALIDATE,gs://folder/file2.jsonl
TEST,gs://folder/file3.jsonl
In-line JSONL files

In-line .JSONL files contain, per line, a JSON document that wraps a textSnippet field followed by one or more annotations fields, which have displayName and textExtraction fields to describe the entity from the text snippet. Multiple JSON documents can be separated using line breaks (\n).

The supplied text must be annotated exhaustively. For example, if you include the text "horse", but do not label it as "animal", then "horse" is assumed to not be an "animal".

Any given text snippet content must have 30,000 characters or less, and also be UTF-8 NFC encoded. ASCII is accepted as it is UTF-8 NFC encoded.

For example:

{
  "textSnippet": {
    "content": "dog car cat"
  },
  "annotations": [
     {
       "displayName": "animal",
       "textExtraction": {
         "textSegment": {"startOffset": 0, "endOffset": 2}
       }
     },
     {
       "displayName": "vehicle",
       "textExtraction": {
         "textSegment": {"startOffset": 4, "endOffset": 6}
       }
     },
     {
       "displayName": "animal",
       "textExtraction": {
         "textSegment": {"startOffset": 8, "endOffset": 10}
       }
     }
  ]
}\n
{
   "textSnippet": {
     "content": "This dog is good."
   },
   "annotations": [
      {
        "displayName": "animal",
        "textExtraction": {
          "textSegment": {"startOffset": 5, "endOffset": 7}
        }
      }
   ]
}
JSONL Files that reference documents

.JSONL files contain, per line, a JSON document that wraps a inputConfig that contains the path to a source document. Multiple JSON documents can be separated using line breaks (\n).

For example:

{
  "document": {
    "inputConfig": {
      "gcsSource": { "inputUris": [ "gs://folder/document1.pdf" ]
      }
    }
  }
}\n
{
  "document": {
    "inputConfig": {
      "gcsSource": { "inputUris": [ "gs://folder/document2.pdf" ]
      }
    }
  }
}

AutoML Natural Language Text Sentiment

See Preparing your training data for more information.

A CSV file(s) with each line in format:

ML_USE,GCS_FILE_PATH or TEXT_SNIPPET,SENTIMENT
  • ML_USE - Identifies the data set that the current row (file) applies to. This value can be one of the following:

    • TRAIN - Rows in this file are used to train the model.
    • TEST - Rows in this file are used to test the model during training.
    • UNASSIGNED - Rows in this file are not categorized. They are Automatically divided into train and test data. 80% for training and 20% for testing.
  • GCS_FILE_PATH - Identifies a text file stored in Google Cloud Storage that contains content used to train the model. The content must be no longer then 500 characters.

  • TEXT_SNIPPET - Content used to train the model. The content must be no longer then 500 characters.

  • SENTIMENT - An integer indicating the sentiment value for the content. The sentiment value ranges from 0 (strongly negative) to a maximum value of 10 (strongly positive).

After the training data set has been determined from the TRAIN and UNASSIGNED CSV files, the training data is divided into train and validation data sets. 70% for training and 30% for validation.

For example:

TRAIN,"@sampleid I haven't seen good results from Claritin",2
TRAIN,"I need Claritin so bad",3
TEST,"Thankful for Claritin.",4
VALIDATE,gs://folder/content.txt,2

Errors

If any of the provided CSV files can't be parsed or if more than certain percent of CSV rows cannot be processed then the operation fails and nothing is imported. Regardless of overall success or failure the per-row failures, up to a certain count cap, will be listed in Operation.metadata.partial_failures.

JSON representation
{
  "params": {
    string: string,
    ...
  },
  "gcsSource": {
    object(GcsSource)
  }
}
Fields
params

map (key: string, value: string)

Additional domain-specific parameters describing the semantic of the imported data, any string must be up to 25000 characters long.

gcsSource

object(GcsSource)

The Google Cloud Storage location for the input content.

¿Te sirvió esta página? Envíanos tu opinión:

Enviar comentarios sobre…

AutoML Natural Language
¿Necesitas ayuda? Visita nuestra página de asistencia.