创建数据集并导入数据

本页介绍如何创建数据集并将表格数据导入其中。然后,您可以使用 AutoML Tables 根据该数据集训练模型。

简介

数据集是一个 Google Cloud 对象,它包含您的源表数据以及用于确定模型训练参数的架构信息。数据集用作训练模型的输入。

一个项目可以有多个数据集。您可以获取可用数据集列表,并且可以删除不再需要的数据集

更新数据集或其架构信息会影响将来使用该数据集的所有模型。已经开始训练的模型不受影响。

准备工作

您必须先按照准备工作所述设置项目,然后才能使用 AutoML Tables。您必须先按照准备训练数据所述创建训练数据,然后才能创建数据集。

创建数据集

控制台

  1. 访问 Google Cloud 控制台中的 AutoML Tables 页面,开始创建数据集。

    转到 AutoML Tables 页面

  2. 选择数据集,然后选择新建数据集

  3. 输入数据集的名称,并指定要在其中创建数据集的区域

    如需了解详情,请参阅位置

  4. 点击创建数据集

    系统随即显示导入标签页。现在,您可以导入数据

REST

如需创建数据集,请使用 datasets.create 方法。

在使用任何请求数据之前,请先进行以下替换:

  • endpoint:全球位置为 automl.googleapis.com,欧盟地区为 eu-automl.googleapis.com
  • project-id:您的 Google Cloud 项目 ID。
  • location:资源的位置:全球位置为 us-central1,欧盟位置为 eu
  • dataset-display-name:数据集的显示名。

HTTP 方法和网址:

POST https://endpoint/v1beta1/projects/project-id/locations/location/datasets

请求 JSON 正文:

{
  "displayName": "dataset-display-name",
  "tablesDatasetMetadata": { },
}

如需发送请求,请选择以下方式之一:

curl

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "x-goog-user-project: project-id" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://endpoint/v1beta1/projects/project-id/locations/location/datasets"

PowerShell

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "project-id" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://endpoint/v1beta1/projects/project-id/locations/location/datasets" | Select-Object -Expand Content

您应该收到类似以下内容的 JSON 响应:

{
  "name": "projects/1234/locations/us-central1/datasets/TBL6543",
  "displayName": "sample_dataset",
  "createTime": "2019-12-23T23:03:34.139313Z",
  "updateTime": "2019-12-23T23:03:34.139313Z",
  "etag": "AB3BwFq6VkX64fx7z2Y4T4z-0jUQLKgFvvtD1RcZ2oikA=",
  "tablesDatasetMetadata": {
    "areStatsFresh": true
    "statsUpdateTime": "1970-01-01T00:00:00Z",
    "tablesDatasetType": "BASIC"
  }
}

保存新数据集的 name(来自响应)以用于其他操作,例如将项目导入数据集和训练模型。

现在,您可以导入数据

Java

如果资源位于欧盟区域,您必须明确设置端点。了解详情

import com.google.cloud.automl.v1beta1.AutoMlClient;
import com.google.cloud.automl.v1beta1.Dataset;
import com.google.cloud.automl.v1beta1.LocationName;
import com.google.cloud.automl.v1beta1.TablesDatasetMetadata;
import java.io.IOException;

class TablesCreateDataset {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String displayName = "YOUR_DATASET_NAME";
    createDataset(projectId, displayName);
  }

  // Create a dataset
  static void createDataset(String projectId, String displayName) throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");
      TablesDatasetMetadata metadata = TablesDatasetMetadata.newBuilder().build();
      Dataset dataset =
          Dataset.newBuilder()
              .setDisplayName(displayName)
              .setTablesDatasetMetadata(metadata)
              .build();

      Dataset createdDataset = client.createDataset(projectLocation, dataset);

      // Display the dataset information.
      System.out.format("Dataset name: %s%n", createdDataset.getName());
      // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
      // required for other methods.
      // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
      String[] names = createdDataset.getName().split("/");
      String datasetId = names[names.length - 1];
      System.out.format("Dataset id: %s%n", datasetId);
    }
  }
}

Node.js

如果资源位于欧盟区域,您必须明确设置端点。了解详情

const automl = require('@google-cloud/automl');
const util = require('util');
const client = new automl.v1beta1.AutoMlClient();

/**
 * Demonstrates using the AutoML client to create a dataset
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const datasetName = '[DATASET_NAME]' e.g., “myDataset”;

// A resource that represents Google Cloud Platform location.
const projectLocation = client.locationPath(projectId, computeRegion);

// Set dataset name and metadata.
const myDataset = {
  displayName: datasetName,
  tablesDatasetMetadata: {},
};

// Create a dataset with the dataset metadata in the region.
client
  .createDataset({parent: projectLocation, dataset: myDataset})
  .then(responses => {
    const dataset = responses[0];
    // Display the dataset information.
    console.log(`Dataset name: ${dataset.name}`);
    console.log(`Dataset Id: ${dataset.name.split('/').pop(-1)}`);
    console.log(`Dataset display name: ${dataset.displayName}`);
    console.log(`Dataset example count: ${dataset.exampleCount}`);
    console.log(
      `Tables dataset metadata: ${util.inspect(
        dataset.tablesDatasetMetadata,
        false,
        null
      )}`
    );
  })
  .catch(err => {
    console.error(err);
  });

Python

AutoML Tables 的客户端库包含其他 Python 方法,这些方法使用 AutoML Tables API 进行简化。这些方法按名称而不是 ID 来引用数据集和模型。您的数据集和模型的名称必须是唯一的。如需了解详情,请参阅客户端参考

如果资源位于欧盟区域,您必须明确设置端点。了解详情

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# dataset_display_name = 'DATASET_DISPLAY_NAME_HERE'

from google.cloud import automl_v1beta1 as automl

client = automl.TablesClient(project=project_id, region=compute_region)

# Create a dataset with the given display name
dataset = client.create_dataset(dataset_display_name)

# Display the dataset information.
print(f"Dataset name: {dataset.name}")
print("Dataset id: {}".format(dataset.name.split("/")[-1]))
print(f"Dataset display name: {dataset.display_name}")
print("Dataset metadata:")
print(f"\t{dataset.tables_dataset_metadata}")
print(f"Dataset example count: {dataset.example_count}")
print(f"Dataset create time: {dataset.create_time}")

将数据导入数据集

您无法将数据导入到已包含数据的数据集中。您必须首先创建一个新数据集

控制台

  1. 如果需要,请从数据集页面上的列表中选择数据集,以打开其导入标签页。

  2. 为数据选择导入源:BigQuery、Cloud Storage 或本地计算机。提供所需信息。

    如果您从本地计算机加载 CSV 文件,则必须提供 Cloud Storage 存储桶。您的文件会先加载到该存储桶,然后再导入到 AutoML Tables 中。数据导入后,文件会保留在那里,除非您将其删除。

    存储桶必须位于与数据集相同的位置。了解详情

  3. 点击导入以开始导入。

    导入过程完成后,系统会显示训练标签页,您可以开始训练模型

REST

使用 datasets.importData 方法导入数据。

请确保您的导入源符合准备导入源中所述的要求。

在使用任何请求数据之前,请先进行以下替换:

  • endpoint:全球位置为 automl.googleapis.com,欧盟地区为 eu-automl.googleapis.com
  • project-id:您的 Google Cloud 项目 ID。
  • location:资源的位置:全球位置为 us-central1,欧盟位置为 eu
  • dataset-id:您的数据集的 ID。例如 TBL6543
  • input-config:您的数据源位置信息:
    • 对于 BigQuery:{ "bigquerySource": { "inputUri": "bq://projectId.bqDatasetId.bqTableId } }"
    • 对于 Cloud Storage:{ "gcsSource": { "inputUris": ["gs://bucket-name/csv-file-name.csv"] } }

HTTP 方法和网址:

POST https://endpoint/v1beta1/projects/project-id/locations/location/datasets/dataset-id:importData

请求 JSON 正文:

{
  "inputConfig": input-config,
}

如需发送请求,请选择以下方式之一:

curl

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "x-goog-user-project: project-id" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://endpoint/v1beta1/projects/project-id/locations/location/datasets/dataset-id:importData"

PowerShell

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "project-id" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://endpoint/v1beta1/projects/project-id/locations/location/datasets/dataset-id:importData" | Select-Object -Expand Content

您应该收到类似以下内容的 JSON 响应:

{
  "name": "projects/292381/locations/us-central1/operations/TBL6543",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata",
    "createTime": "2019-12-26T20:42:06.092180Z",
    "updateTime": "2019-12-26T20:42:06.092180Z",
    "cancellable": true,
    "worksOn": [
      "projects/292381/locations/us-central1/datasets/TBL6543"
    ],
    "importDataDetails": {},
    "state": "RUNNING"
  }
}

将数据导入数据集是一项长时间运行的操作。您可以轮询操作状态或等待操作返回。了解详情

导入过程完成后,您就可以开始训练模型

Java

如果资源位于欧盟区域,您必须明确设置端点。了解详情

import com.google.cloud.automl.v1beta1.AutoMlClient;
import com.google.cloud.automl.v1beta1.BigQuerySource;
import com.google.cloud.automl.v1beta1.DatasetName;
import com.google.cloud.automl.v1beta1.GcsSource;
import com.google.cloud.automl.v1beta1.InputConfig;
import com.google.protobuf.Empty;
import java.io.IOException;
import java.util.Arrays;
import java.util.concurrent.ExecutionException;

class TablesImportDataset {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String datasetId = "YOUR_DATASET_ID";
    String path = "gs://BUCKET_ID/path/to//data.csv or bq://project_id.dataset_id.table_id";
    importDataset(projectId, datasetId, path);
  }

  // Import a dataset via BigQuery or Google Cloud Storage
  static void importDataset(String projectId, String datasetId, String path)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // Get the complete path of the dataset.
      DatasetName datasetFullId = DatasetName.of(projectId, "us-central1", datasetId);

      InputConfig.Builder inputConfigBuilder = InputConfig.newBuilder();

      // Determine which source type was used for the input path (BigQuery or GCS)
      if (path.startsWith("bq")) {
        // Get training data file to be imported from a BigQuery source.
        BigQuerySource.Builder bigQuerySource = BigQuerySource.newBuilder();
        bigQuerySource.setInputUri(path);
        inputConfigBuilder.setBigquerySource(bigQuerySource);
      } else {
        // Get multiple Google Cloud Storage URIs to import data from
        GcsSource gcsSource =
            GcsSource.newBuilder().addAllInputUris(Arrays.asList(path.split(","))).build();
        inputConfigBuilder.setGcsSource(gcsSource);
      }

      // Import data from the input URI
      System.out.println("Processing import...");

      Empty response = client.importDataAsync(datasetFullId, inputConfigBuilder.build()).get();
      System.out.format("Dataset imported. %s%n", response);
    }
  }
}

Node.js

如果资源位于欧盟区域,您必须明确设置端点。了解详情

const automl = require('@google-cloud/automl');
const client = new automl.v1beta1.AutoMlClient();

/**
 * Demonstrates using the AutoML client to import data.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const datasetId = '[DATASET_ID]' e.g., "TBL2246891593778855936";
// const path = '[GCS_PATH]' | '[BIGQUERY_PATH]'
// e.g., "gs://<bucket-name>/<csv file>" or
// "bq://<project_id>.<dataset_id>.<table_id>",
// `string or array of paths in AutoML Tables format`;

// Get the full path of the dataset.
const datasetFullId = client.datasetPath(projectId, computeRegion, datasetId);

let inputConfig = {};
if (path.startsWith('bq')) {
  // Get Bigquery URI.
  inputConfig = {
    bigquerySource: {
      inputUri: path,
    },
  };
} else {
  // Get the multiple Google Cloud Storage URIs.
  const inputUris = path.split(',');
  inputConfig = {
    gcsSource: {
      inputUris: inputUris,
    },
  };
}

// Import the dataset from the input URI.
client
  .importData({name: datasetFullId, inputConfig: inputConfig})
  .then(responses => {
    const operation = responses[0];
    console.log('Processing import...');
    return operation.promise();
  })
  .then(responses => {
    // The final result of the operation.
    const operationDetails = responses[2];

    // Get the data import details.
    console.log('Data import details:');
    console.log('\tOperation details:');
    console.log(`\t\tName: ${operationDetails.name}`);
    console.log(`\t\tDone: ${operationDetails.done}`);
  })
  .catch(err => {
    console.error(err);
  });

Python

AutoML Tables 的客户端库包含其他 Python 方法,这些方法使用 AutoML Tables API 进行简化。这些方法按名称而不是 ID 来引用数据集和模型。您的数据集和模型的名称必须是唯一的。如需了解详情,请参阅客户端参考

如果资源位于欧盟区域,您必须明确设置端点。了解详情

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# dataset_display_name = 'DATASET_DISPLAY_NAME'
# path = 'gs://path/to/file.csv' or 'bq://project_id.dataset.table_id'

from google.cloud import automl_v1beta1 as automl

client = automl.TablesClient(project=project_id, region=compute_region)

response = None
if path.startswith("bq"):
    response = client.import_data(
        dataset_display_name=dataset_display_name,
        bigquery_input_uri=path,
        dataset_name=dataset_name,
    )
else:
    # Get the multiple Google Cloud Storage URIs.
    input_uris = path.split(",")
    response = client.import_data(
        dataset_display_name=dataset_display_name,
        gcs_input_uris=input_uris,
        dataset_name=dataset_name,
    )

print("Processing import...")
# synchronous check of operation status.
print(f"Data imported. {response.result()}")

后续步骤