Developers & Practitioners

Testing Dataflow pipelines with the Cloud Spanner emulator

#spanner

The Cloud SDK includes a local, in-memory Cloud Spanner Emulator, which you can use to develop and test your applications for free without creating a GCP Project or a billing account. The emulator offers the same APIs as the Cloud Spanner production service and is intended for local development and testing, not for production deployments.

We are excited to announce the support for Dataflow Import & Export of your database with the Cloud Spanner Emulator.

Existing Dataflow pipelines that are configured to work with Cloud Spanner can be tested with the emulator. This allows for easy offline testing of existing pipelines without connecting to the actual Spanner backend.

 The following guide will demonstrate how to test your dataflow import/export pipelines with the emulator. The sample program we will be using can be found at https://github.com/cloudspannerecosystem/emulator-samples/tree/master/dataflow


To start the test pipeline, configure the emulator endpoint within the pipeline. The host used should be the grpc endpoint exposed by the emulator (by default: http://localhost:9010).


  • The Java client library Spanner options should be set to use emulator host:

  import com.google.cloud.spanner.SpannerOptions;
...
...
SpannerOptions.Builder builder = SpannerOptions.newBuilder();
builder.setEmulatorHost(...);

  import org.apache.beam.sdk.io.gcp.spanner.SpannerConfig;
...
...
SpannerConfig config = SpannerConfig.create();
config = config.withEmulatorHost(...);

The sample program 'SpannerEmulatorPopulator' creates a dataflow pipeline with the apache beam library. This program demonstrates both import and export. They can be toggled by using the command line arguments:

runExport=<true/false>

runImport=<true/false>


To get the sample running, first gcloud should be configured.

  gcloud config configurations create emulator-config
gcloud config set auth/disable_credentials true
gcloud config set project test-project
gcloud config set api_endpoint_overrides/spanner http://localhost:9020/
gcloud config configurations activate emulator

Next, the emulator should be started. Port 9010 (default port) will be used in this case:

  gcloud emulators spanner start

Optionally request logging can be turned on with "-log_requests". The port can be changed with "-grpc_port=<port>". 

After the emulator has started, you will need to create an instance that will be used.

  gcloud spanner instances create test-instance --config=emulator-config --description="Test Instance" --nodes=1

The sample program can now be run with a command like the following:

  mvn compile exec:java -Dexec.mainClass=com.google.spanner.SpannerEmulatorPopulator -Dexec.args="--projectId=test-project --endpoint=http://localhost:9010 --instanceId=test-instance --createDatabase=true --runImport=true --runExport=false --csvImportFile=test.csv --table=users --numRecords=10000 --numWorkers=1 --maxNumWorkers=1 --runner=DirectRunner"

Since the emulator only runs locally, the runner must be set to 'DirectRunner'.


Data can be loaded from a CSV file using something like the following code:

  private static List<List<String>> readCsvFile(String file, int numColumns) throws IOException {
    Path path = FileSystems.getDefault().getPath("./", file);
    Reader reader = Files.newBufferedReader(path);
    CSVParser parser = new CSVParser(reader, CSVFormat.DEFAULT);
    List<CSVRecord> records = parser.getRecords();

    List<List<String>> entries = new ArrayList<>();
    for (CSVRecord record : records) {
      List<String> row = new ArrayList<>();
      for (int i = 0; i < numColumns; ++i) {
        row.add(record.get(i));
      }   
      entries.add(row);
    }   
    return entries;
  }

Once the data from the CSV file has been read, we can create a list of mutations that will add those rows to the database.


We can now verify that the data has been successfully entered into the database using gcloud:

  gcloud spanner databases execute-sql db-inbound --instance=test-instance --sql="SELECT * FROM users WHERE Key=1"

One caveat to note is that the emulator does not have the same concurrency model as production Spanner. The emulator serializes all transactions and my return transaction abort errors when transactions are trying to execute in parallel. To account for this, we set the dataflow worker count to 1 when using the emulator. In addition, some sets of PCollections may need to be flattened into a single PCollection.

The emulator can be used for quickly testing Cloud Spanner dataflow pipelines offline. This sample demonstrates how to setup a basic pipeline with the emulator.

The emulator supports all languages of the client libraries. You can also use the emulator with the gcloud command-line tool and REST APIs. The emulator is also available as an open source project in GitHub.