Testing Dataflow pipelines with the Cloud Spanner emulator
Jim King
Software Engineer, Cloud Spanner
The Cloud SDK includes a local, in-memory Cloud Spanner Emulator, which you can use to develop and test your applications for free without creating a GCP Project or a billing account. The emulator offers the same APIs as the Cloud Spanner production service and is intended for local development and testing, not for production deployments.
We are excited to announce the support for Dataflow Import & Export of your database with the Cloud Spanner Emulator.
Existing Dataflow pipelines that are configured to work with Cloud Spanner can be tested with the emulator. This allows for easy offline testing of existing pipelines without connecting to the actual Spanner backend.
The following guide will demonstrate how to test your dataflow import/export pipelines with the emulator. The sample program we will be using can be found at https://github.com/cloudspannerecosystem/emulator-samples/tree/master/dataflow
To start the test pipeline, configure the emulator endpoint within the pipeline. The host used should be the grpc endpoint exposed by the emulator (by default: http://localhost:9010).
The Java client library Spanner options should be set to use emulator host:
- The beam Spanner config should also be set to use the emulator host:
The sample program 'SpannerEmulatorPopulator' creates a dataflow pipeline with the apache beam library. This program demonstrates both import and export. They can be toggled by using the command line arguments:
runExport=<true/false>
runImport=<true/false>
To get the sample running, first gcloud should be configured.
Next, the emulator should be started. Port 9010 (default port) will be used in this case:
Optionally request logging can be turned on with "-log_requests". The port can be changed with "-grpc_port=<port>".
After the emulator has started, you will need to create an instance that will be used.
The sample program can now be run with a command like the following:
Since the emulator only runs locally, the runner must be set to 'DirectRunner'.
Data can be loaded from a CSV file using something like the following code:
Once the data from the CSV file has been read, we can create a list of mutations that will add those rows to the database.
We can now verify that the data has been successfully entered into the database using gcloud:
One caveat to note is that the emulator does not have the same concurrency model as production Spanner. The emulator serializes all transactions and my return transaction abort errors when transactions are trying to execute in parallel. To account for this, we set the dataflow worker count to 1 when using the emulator. In addition, some sets of PCollections may need to be flattened into a single PCollection.
The emulator can be used for quickly testing Cloud Spanner dataflow pipelines offline. This sample demonstrates how to setup a basic pipeline with the emulator.
The emulator supports all languages of the client libraries. You can also use the emulator with the gcloud command-line tool and REST APIs. The emulator is also available as an open source project in GitHub.