Importing and exporting data in CSV format

This page describes how to export data from Cloud Spanner to CSV files or import data from CSV files into a Cloud Spanner database.

The process uses Dataflow. You can export data from Cloud Spanner to a Cloud Storage bucket, or you can import data into Cloud Spanner from a Cloud Storage bucket that contains a JSON manifest file and a set of CSV files.

Before you begin

To import or export a Cloud Spanner database, first you need to enable the Cloud Spanner, Cloud Storage, Compute Engine, and Dataflow APIs:

APIを有効にする

You also need enough quota and the required Cloud IAM permissions.

Quota requirements

The quota requirements for import or export jobs, by Google Cloud service, are as follows:

  • Cloud Spanner: You must have enough nodes to support the amount of data that you are importing. No additional nodes are required to import or export a database, though you might need to add more nodes so that your job finishes in a reasonable amount of time. See Optimizing jobs for more details.
  • Cloud Storage: To import, you must have a bucket containing your previously exported files. To export, you must create a bucket for your exported files if you do not already have one. You can do this in the Cloud Console, either through the Cloud Storage page or while creating your export through the Cloud Spanner page. You do not need to set a size for your bucket.
  • Dataflow: Import or export jobs are subject to the same CPU, disk usage, and IP address Compute Engine quotas as other Dataflow jobs.
  • Compute Engine: Before running your import or export job, you must set up initial quotas for Compute Engine, which Dataflow uses. These quotas represent the maximum number of resources that you allow Dataflow to use for your job. Recommended starting values are:

    • CPUs: 200
    • In-use IP addresses: 200
    • Standard persistent disk: 50 TB

    Generally, you do not have to make any other adjustments. Dataflow provides autoscaling so that you only pay for the actual resources used during the import or export. If your job can make use of more resources, the Dataflow UI displays a warning icon. The job should finish even if there is a warning icon.

Cloud IAM requirements

To import or export a database, you also need to have Cloud IAM roles with sufficient permissions to use all of the services involved in an import or export job. For information on granting roles and permissions, see Applying IAM roles.

To import or export a database, you need the following roles:

  • At the Google Cloud project level:
    • Cloud Spanner Viewer
    • Dataflow Admin
    • Storage Admin
  • At the Cloud Spanner database or instance level, or at the Google Cloud project level:
    • Cloud Spanner Reader
    • Cloud Spanner Database Admin (required only for import jobs)

Exporting Cloud Spanner data to CSV files

To export data from Cloud Spanner to CSV files in Cloud Storage, follow the instructions for using the gcloud command-line tool to run a job with the Cloud Spanner to Cloud Storage Text template.

You can also refer to the information in this document about viewing or troubleshooting jobs, optimizing slow jobs, and factors affecting job performance.

Importing data from CSV files into Cloud Spanner

The process to import data from CSV files includes the following steps:

  • Export your data to CSV files and store those files in Cloud Storage.
  • Create a JSON manifest file and store the file along with your CSV files.
  • Create empty target tables in your Cloud Spanner database or ensure that the data types for columns in your CSV files match any corresponding columns in your existing tables.
  • Run your import job.

Exporting data from a non-Cloud Spanner database to CSV files

The import process brings data in from CSV files located in a Cloud Storage bucket. You can export data in CSV format from any source.

Keep the following things in mind when exporting your data:

  • Text files to be imported must be in CSV format.
  • Data must match one of the following types:

    • INT64
    • FLOAT64
    • BOOL
    • STRING
    • DATE
    • TIMESTAMP
  • You do not have to include or generate any metadata when you export the CSV files.

  • You do not have to follow any particular naming convention for your files.

If you do not export your files directly to Cloud Storage, you must upload the CSV files to a Cloud Storage bucket.

Creating a JSON manifest file

You must also create a manifest file with a JSON description of files to import and place it in the same Cloud Storage bucket where you stored your CSV files. This manifest file contains a tables array that lists the name and data file locations for each table.

The format of the manifest file corresponds to the following message type, shown here in protocol buffer format:

message ImportManifest {
  // The per-table import manifest.
  message TableManifest {
    // Required. The name of the destination table.
    string table_name = 1;
    // Required. The CSV files to import. This value can be either a filepath or a glob pattern.
    repeated string file_patterns = 2;
    // The schema for a table column.
    message Column {
      // Required for each Column that you specify. The name of the column in the
      // destination table.
      string column_name = 1;
      // Required for each Column that you specify. The type of the column. Supports the
      // following data types: BOOL, INT64, FLOAT64, STRING, DATE, and TIMESTAMP.
      string type_name = 2;
    }
    // Optional. The schema for the table columns.
    repeated Column columns = 3;
  }
  // Required. The TableManifest of the tables to be imported.
  repeated TableManifest tables = 1;
}

The following example shows a manifest file for importing tables called Albums and Singers. The Albums table uses the column schema that the job retrieves from the database, and the Singers table uses the schema that the manifest file specifies:

{
  "tables": [
    {
      "table_name": "Albums",
      "file_patterns": [
        "gs://bucket1/Albums_1.csv",
        "gs://bucket1/Albums_2.csv"
      ]
    },
    {
      "table_name": "Singers",
      "file_patterns": [
        "gs://bucket1/Singers*.csv"
      ],
      "columns": [
        {"column_name": "SingerId", "type_name": "INT64"},
        {"column_name": "FirstName", "type_name": "STRING"},
        {"column_name": "LastName", "type_name": "STRING"}
      ]
    }
  ]
}

Creating the table for your Cloud Spanner database

Before you run your import, you must create the target tables in your Cloud Spanner database. If the target Cloud Spanner table already has a schema, any columns specified in the manifest file must have the same data types as the corresponding columns in the target table's schema.

Running a Dataflow import job using gcloud

To start your import job, follow the instructions for using the gcloud command-line tool to run a job with the CSV to Cloud Spanner template.

After you have started an import job, you can see details about the job in the Cloud Console.

After the import job is finished, add any necessary secondary indexes.

Choosing a region for your import job

You might want to choose a different region based on whether your Cloud Storage bucket uses a regional or multi-regional configuration. To avoid network egress charges, choose a region that overlaps with your Cloud Storage bucket's location.

Regional bucket locations

If your Cloud Storage bucket location is regional, choose the same region for your import job if that region is available to take advantage of free network usage.

If the same region is not available, egress charges will apply. Refer to the Cloud Storage network egress pricing to choose a region that will incur the lowest network egress charges.

Multi-regional bucket locations

If your Cloud Storage bucket location is multi-regional, choose one of the regions that make up the multi-regional location to take advantage of free network usage.

If an overlapping region is not available, egress charges will apply. Refer to the Cloud Storage network egress pricing to choose a region that will incur the lowest network egress charges.

Viewing or troubleshooting jobs in the Dataflow UI

After you start an import or export job, you can view details of the job, including logs, in the Dataflow section of the Cloud Console.

Viewing Dataflow job details

To see details for a currently running job:

  1. Navigate to the Database details page for the database.
  2. Click View job details in Dataflow in the job status message, which looks similar to the following:

    In-progress job status message

    The Cloud Console displays details of the Dataflow job.

To view a job that you ran recently:

  1. Navigate to the Database details page for the database.
  2. Click the Import/Export tab.
  3. Click your job's name in the list.

    The Cloud Console displays details of the Dataflow job.

To view a job that you ran more than one week ago:

  1. Go to the Dataflow jobs page in the Cloud Console.

    Go to the jobs page

  2. Find your job in the list, then click its name.

    The Cloud Console displays details of the Dataflow job.

Viewing Dataflow logs for your job

To view a Dataflow job's logs, navigate to the job's details page as described above, then click Logs to the right of the job's name.

If a job fails, look for errors in the logs. If there are errors, the error count displays next to Logs:

Error count example next to Logs button

To view job errors:

  1. Click on the error count next to Logs.

    The Cloud Console displays the job's logs. You may need to scroll to see the errors.

  2. Locate entries with the error icon Error icon.

  3. Click on an individual log entry to expand its contents.

For more information about troubleshooting Dataflow jobs, see Troubleshooting your pipeline.

Optimizing slow running import or export jobs

If you have followed the suggestions in initial settings, you should generally not have to make any other adjustments. If your job is running slowly, there are a few other optimizations you can try:

  • Optimize the job and data location: Run your Dataflow job in the same region where your Cloud Spanner instance and Cloud Storage bucket are located.

  • Ensure sufficient Dataflow resources: If the relevant Compute Engine quotas limit your Dataflow job's resources, the job's Dataflow page in the Google Cloud Console displays a warning icon Warning icon and log messages:

    Screenshot of quota limit warning

    In this situation, increasing the quotas for CPUs, in-use IP addresses, and standard persistent disk might shorten the run time of the job, but you might incur more Compute Engine charges.

  • Check the Cloud Spanner CPU utilization: If you see that the CPU utilization for the instance is over 65%, you can increase the number of nodes in that instance. The extra nodes add more Cloud Spanner resources and the job should speed up, but you incur more Cloud Spanner charges.

Factors affecting import or export job performance

Several factors influence the time it takes to complete an import or export job.

  • Cloud Spanner database size: Processing more data takes more time and resources.

  • Cloud Spanner database schema (including indexes): The number of tables, the size of the rows, and the number of secondary indexes influence the time it takes to run an import or export job.

  • Data location: Data is transferred between Cloud Spanner and Cloud Storage using Dataflow. Ideally all three components are located in the same region. If the components are not in the same region, moving the data across regions slows the job down.

  • Number of Dataflow workers: By using autoscaling, Dataflow chooses the number of workers for the job depending on the amount of work that needs to be done. The number of workers will, however, be capped by the quotas for CPUs, in-use IP addresses, and standard persistent disk. The Dataflow UI displays a warning icon if it encounters quota caps. In this situation, progress is slower, but the job should still complete.

  • Existing load on Cloud Spanner: An import job adds significant CPU load on a Cloud Spanner instance. An export job typically adds a light load on a Cloud Spanner instance. If the instance already has a substantial existing load, then the job runs more slowly.

  • Number of Cloud Spanner nodes: If the CPU utilization for the instance is over 65%, then the job runs more slowly.

このページは役立ちましたか?評価をお願いいたします。

フィードバックを送信...

Cloud Spanner Documentation