This tutorial assumes that you are familiar with database schemas, data types, the fundamentals of NoSQL, and relational database systems. The tutorial relies on running predefined tasks to perform an example migration. After the tutorial, you can modify the provided code and steps to match your environment.
The following architectural diagram outlines the components used in the tutorial to migrate data:
Objectives
- Migrate data from Amazon DynamoDB to Spanner.
- Create a Spanner database and migration table.
- Map a NoSQL schema to a relational schema.
- Create and export a sample dataset that uses Amazon DynamoDB.
- Transfer data between Amazon S3 and Cloud Storage.
- Use Dataflow to load data into Spanner.
Costs
This tutorial uses the following billable components of Google Cloud:
- GKE
- Pub/Sub
- Cloud Storage
- Dataflow
Spanner charges are based on the number of node-hours and the amount of data stored during the monthly billing cycle. During the tutorial, you use a minimal configuration of these resources, which are cleaned up at the end. For real-world scenarios, estimate your throughput and storage requirements, and then use the Spanner instances documentation to determine the number of nodes that you need.
In addition to Google Cloud resources, this tutorial uses the following Amazon Web Services (AWS) resources:
- Amazon EMR
- AWS Lambda
- Amazon S3
- Amazon DynamoDB
These services are only needed during the migration process. At the end of the tutorial, follow the instructions to clean up all resources to prevent unnecessary charges. Use the AWS pricing calculator to estimate these costs.
To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Spanner, Pub/Sub, Compute Engine, and Dataflow APIs.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.
Preparing your environment
In this tutorial, you run commands in Cloud Shell. Cloud Shell gives you access to the command line in Google Cloud, and includes the Cloud SDK and other tools that you need for Google Cloud development. Cloud Shell can take several minutes to initialize.
Activate Cloud Shell.
Set the default Compute Engine zone. For example,
us-central1-b
.gcloud config set compute/zone us-central1-b
Clone the GitHub repository containing the sample code.
git clone https://github.com/GoogleCloudPlatform/dynamodb-spanner-migration.git
Go to the cloned directory.
cd dynamodb-spanner-migration
Create a Python virtual environment.
virtualenv --python python2 env
Activate the virtual environment.
source env/bin/activate
Install the required Python modules.
pip install -r requirements.txt
Configuring AWS access
In this tutorial, you create and delete Amazon DynamoDB tables, Amazon S3 buckets, and other resources. To access these resources, you first need to create the required AWS Identity and Access Management (IAM) permissions. You can use a test or sandbox AWS account to avoid affecting production resources in the same account.
Create an AWS IAM role for AWS Lambda
In this section, you create an AWS IAM role that AWS Lambda uses at a later step in the tutorial.
- In the AWS console, go to the IAM section, click Roles, and then select Create role.
- Under Choose the service that will use this role, click Lambda, and then select Next:Permissions.
- In the Policy Type box, enter
AWSLambdaDynamoDBExecutionRole
. - Select the AWSLambdaDynamoDBExecutionRole checkbox, and then click Next:Review.
- In the Role name box, enter
dynamodb-spanner-lambda-role
, and then click Create role.
Create an AWS IAM user
Follow these steps to create an AWS IAM user with programmatic access to AWS resources, which are used throughout the tutorial.
- While you are still in the IAM section of the AWS console, click Users, and then select Add User.
- In the User name box, enter
dynamodb-spanner-migration
. Under Access type, click Programmatic access.
Click Next: Permissions.
Click Attach existing policies directly and select the following two policies:
AmazonDynamoDBFullAccesswithDataPipeline
AmazonS3FullAccess
Click Next: Review, and then click Create user.
Click Show to view the credentials. The access key ID and secret access key are displayed for the newly created user. Leave this window open for now because the credentials are needed in the following section. Safely store these credentials because with them, you can make changes to your account and affect your environment. At the end of this tutorial, you can delete the IAM user.
Configure AWS command-line interface
In Cloud Shell, configure the AWS Command Line Interface (CLI).
aws configure
The following output appears:
$ aws configure AWS Access Key ID [None]: PASTE_YOUR_ACCESS_KEY_ID AWS Secret Access Key [None]: PASTE_YOUR_SECRET_ACCESS_KEY Default region name [None]: us-west-2 Default output format [None]: user@project:~/dynamodb-spanner$
- Enter the
ACCESS KEY ID
andSECRET ACCESS KEY
from the AWS IAM account that you created. - In the Default region name field, enter
us-west-2
. Leave other fields at their default values.
- Enter the
Close the AWS IAM console window.
Understanding the data model
The following section outlines the similarities and differences between data types, keys, and indexes for Amazon DynamoDB and Spanner.
Data types
Spanner uses standard SQL data types. The following table describes how Amazon DynamoDB data types map to Spanner data types.
Amazon DynamoDB | Spanner |
---|---|
Number | Depending on precision or intended usage, might be mapped to INT64, FLOAT64, TIMESTAMP, or DATE. |
String | String |
Boolean | BOOL |
Null | No explicit type. Columns can contain null values. |
Binary | Bytes |
Sets | Array |
Map and List | Struct if the structure is consistent and can be described by using table DDL syntax. |
Primary key
An Amazon DynamoDB primary key establishes uniqueness and can be either a hash key or a combination of a hash key plus a range key. This tutorial starts by modeling the migration of a Amazon DynamoDB table whose primary key is a hash key. This hash key becomes the primary key of your Spanner table. Later, in the section on interleaved tables, you model a situation where an Amazon DynamoDB table uses a primary key composed of a hash key and a range key.
Secondary indexes
Both Amazon DynamoDB and Spanner support creating an index on a non-primary key attribute. Take note of any secondary indexes in your Amazon DynamoDB table so that you can create them on your Spanner table, which is covered in a later section of this tutorial.
Sample table
To facilitate this tutorial, you migrate the following sample table from Amazon DynamoDB to Spanner:
Amazon DynamoDB | Spanner | |
---|---|---|
Table name |
Migration
|
Migration
|
Primary key |
"Username" : String
|
"Username" : STRING(1024)
|
Key type | Hash | n/a |
Other fields |
Zipcode: Number
Subscribed: Boolean
ReminderDate: String
PointsEarned: Number
|
Zipcode: INT64
Subscribed: BOOL
ReminderDate: DATE
PointsEarned: INT64
|
Preparing the Amazon DynamoDB table
In the following section, you create an Amazon DynamoDB source table and populate it with data.
In Cloud Shell, create an Amazon DynamoDB table that uses the sample table attributes.
aws dynamodb create-table --table-name Migration \ --attribute-definitions AttributeName=Username,AttributeType=S \ --key-schema AttributeName=Username,KeyType=HASH \ --provisioned-throughput ReadCapacityUnits=75,WriteCapacityUnits=75
Verify that the table status is
ACTIVE
.aws dynamodb describe-table --table-name Migration \ --query 'Table.TableStatus'
Populate the table with sample data.
python make-fake-data.py --table Migration --items 25000
Creating a Spanner database
You create a single-node instance, which is appropriate for testing and the scope of this tutorial. For a production deployment, refer to the documentation for Spanner instances to determine the appropriate node count to meet your database performance requirements.
In this example, you create a table schema at the same time as the database. It is also possible, and common, to carry out schema updates after you create the database.
Create a Spanner instance in the same region where you set the default Compute Engine zone. For example,
us-central1
.gcloud spanner instances create spanner-migration \ --config=regional-us-central1 --nodes=1 \ --description="Migration Demo"
Create a database in the Spanner instance along with the sample table.
gcloud spanner databases create migrationdb \ --instance=spanner-migration \ --ddl "CREATE TABLE Migration ( \ Username STRING(1024) NOT NULL, \ PointsEarned INT64, \ ReminderDate DATE, \ Subscribed BOOL, \ Zipcode INT64, \ ) PRIMARY KEY (Username)"
Pausing the database
The next sections show you how to export the Amazon DynamoDB source table and set Pub/Sub replication to capture any changes to the database that occur while you export it. If changes to your database aren't idempotent and it isn't safe to write the same data more than once, it is best to carry out the following steps during a maintenance period when you can pause app changes to the table.
Stream changes to Pub/Sub
You use an AWS Lambda function to stream database changes to Pub/Sub.
In Cloud Shell, enable Amazon DynamoDB streams on your source table.
aws dynamodb update-table --table-name Migration \ --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES
Set up a Pub/Sub topic to receive the changes.
gcloud pubsub topics create spanner-migration
The following output appears:
$ gcloud pubsub topics create spanner-migration Created topic [projects/your-project/topics/spanner-migration].
Create an IAM service account to push table updates to the Pub/Sub topic.
gcloud iam service-accounts create spanner-migration \ --display-name="Spanner Migration"
The following output appears:
$ gcloud iam service-accounts create spanner-migration --display-name="Spanner Migration" Created service account [spanner-migration].
Create an IAM policy binding so that the service account has permission to publish to Pub/Sub. Replace
GOOGLE_CLOUD_PROJECT
with the name of your Google Cloud project.gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \ --role roles/pubsub.publisher \ --member serviceAccount:spanner-migration@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com
The following output appears:
$ gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \ --role roles/pubsub.publisher \ --member serviceAccount:spanner-migration@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com bindings: (...truncated...) - members: - serviceAccount:spanner-migration@solution-z.iam.gserviceaccount.com role: roles/pubsub.publisher
Create credentials for the service account.
gcloud iam service-accounts keys create credentials.json \ --iam-account spanner-migration@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com
The following output appears:
$ gcloud iam service-accounts keys create credentials.json --iam-account spanner-migration@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com created key [5e559d9f6bd8293da31b472d85a233a3fd9b381c] of type [json] as [credentials.json] for [spanner-migration@your-project.iam.gserviceaccount.com]
Prepare and package the AWS Lambda function to push Amazon DynamoDB table changes to the Pub/Sub topic.
pip install --ignore-installed --target=lambda-deps google-cloud-pubsub==0.35 cd lambda-deps; zip -r9 ../pubsub-lambda.zip *; cd - zip -g pubsub-lambda.zip ddbpubsub.py
Create a variable to capture the Amazon Resource Name (ARN) of the Lambda execution role that you created earlier.
LAMBDA_ROLE=$(aws iam list-roles \ --query 'Roles[?RoleName==`dynamodb-spanner-lambda-role`].[Arn]' \ --output text)
Use the
pubsub-lambda.zip
package to create the AWS Lambda function.aws lambda create-function --function-name dynamodb-spanner-lambda \ --runtime python2.7 --role $LAMBDA_ROLE \ --handler ddbpubsub.lambda_handler --zip fileb://pubsub-lambda.zip \ --environment Variables="{SVCACCT=$(base64 -w 0 credentials.json),PROJECT=$GOOGLE_CLOUD_PROJECT,TOPIC=spanner-migration}"
The following output appears:
$ aws lambda create-function --function-name dynamodb-spanner-lambda \ > --runtime python2.7 --role $LAMBDA_ROLE \ > --handler ddbpubsub.lambda_handler --zip fileb://pubsub-lambda.zip \ > --environment Variables="{SVCACCT=$(base64 -w 0 credentials.json),PROJECT=$GOOGLE_CLOUD_PROJECT,TOPIC=spanner-migration}" { "FunctionName": "dynamodb-spanner-lambda", "LastModified": "2018-07-07T12:53:58.670+0000", "RevisionId": "e58e8408-cd3a-4155-a184-4efc0da80bfb", "MemorySize": 128, ... truncated output...
Create a variable to capture the ARN of the Amazon DynamoDB stream for your table.
STREAMARN=$(aws dynamodb describe-table \ --table-name Migration \ --query "Table.LatestStreamArn" \ --output text)
Attach the Lambda function to the Amazon DynamoDB table.
aws lambda create-event-source-mapping --event-source $STREAMARN \ --function-name dynamodb-spanner-lambda --enabled \ --starting-position TRIM_HORIZON
To optimize responsiveness during testing, add
--batch-size 1
to the end of the previous command, which triggers the function every time you create, update, or delete an item.The following output appears:
$ aws lambda create-event-source-mapping --event-source $STREAMARN \ > --function-name dynamodb-spanner-lambda --enabled --starting-position TRIM_HORIZON { "UUID": "44e4c2bf-493a-4ba2-9859-cde0ae5c5e92", "StateTransitionReason": "User action", "LastModified": 1530662205.549, "BatchSize": 100, "EventSourceArn": "arn:aws:dynamodb:us-west-2:accountid:table/Migration/stream/2018-07-03T15:09:57.725", "FunctionArn": "arn:aws:lambda:us-west-2:accountid:function:dynamodb-spanner-lambda", "State": "Creating", "LastProcessingResult": "No records processed" }
Export the Amazon DynamoDB table to Amazon S3
In Cloud Shell, create a variable for a bucket name that you use in several of the following sections.
BUCKET=$DEVSHELL_PROJECT_ID-dynamodb-spanner-export
Create an Amazon S3 bucket to receive the DynamoDB export.
aws s3 mb s3://$BUCKET
In the AWS Management Console, click Data Pipeline.
Click Create new pipeline to define the export job.
In the Name field, enter Export to Amazon S3.
For the Source, select the following:
- Build using a template.
- Export DynamoDB table to Amazon S3.
In the Parameters section, define the following:
- In the Source DynamoDB table name field, enter
Migration
. - In the Output S3 folder field, click the Folder icon and select
the
[Your-Project-ID]-dynamodb-spanner-export
Amazon S3 bucket that you just created where[YOUR-PROJECT-ID]
represents your Google Cloud project ID. - To consume all available read-capacity during the export, in the
DynamoDB read throughput ratio field, enter
1
. In a production environment, you adjust this value so that it doesn't hinder live operations. - In the Region of your DynamoDB table field, enter the name of the
region, for example,
us-west-2
.
- In the Source DynamoDB table name field, enter
To start the backup jobs immediately, in the Schedule section for Run, click On pipeline activation.
Under Pipeline Configuration, in the Logging field, enter
Disabled
. If you are following this guide to migrate a production table, leave this option enabled and pointed at a separate Amazon S3 bucket for logs to help you troubleshoot errors. Leave other default parameters.To begin the backup process, click Activate.
If you are prompted to address validation warnings, click Activate. In a production situation, you set a maximum duration for the job and enable logging.
Click Refresh to update the status of the backup process. The job takes several minutes to create the resources and finish exporting. In a production environment, you can speed this process up by modifying the Data Pipeline jobs to use more EMR resources.
When the process finishes, look at the output bucket.
aws s3 ls --recursive s3://$BUCKET
The export job is done when there is a file named
_SUCCESS
.$ aws s3 ls --recursive s3://$BUCKET 2018-06-30 13:08:11 3736518 2018-06-30-20-01-21/76b53eea-46d1-4293-ba51-11759f5c65fa 2018-06-30 13:08:20 0 2018-06-30-20-01-21/_SUCCESS 2018-06-30 13:08:20 178 2018-06-30-20-01-21/manifest
Opening the database
If you paused writes to the database before exporting, it's time to reactivate writes to the database. Now that the Pub/Sub delivery is in place, you can push forward any table changes that occurred after the export.
Copy the exported table to Cloud Storage
In Cloud Shell, create a Cloud Storage bucket to receive the exported files from Amazon S3.
gsutil mb gs://$BUCKET
Sync the files from Amazon S3 into Cloud Storage. For most copy operations, the
rsync
command is effective. If your export files are large (several GBs or more), use the Cloud Storage transfer service to manage the transfer in the background.gsutil rsync -d -r s3://$BUCKET gs://$BUCKET
The following output appears:
$ gsutil rsync -d -r s3://$BUCKET gs://$BUCKET Building synchronization state... Starting synchronization... Copying s3://project-dynamodb-spanner-export/2018-06-30-20-01-21/76b53eea-46d1-4293-ba51-11759f5c65fa [Content-Type=binary/octet-stream]... Copying s3://project-dynamodb-spanner-export/2018-06-30-20-01-21/_SUCCESS [Content-Type=binary/octet-stream]... Copying s3://project-dynamodb-spanner-export/2018-06-30-20-01-21/manifest [Content-Type=binary/octet-stream]... / [3 files][ 3.6 MiB/ 3.6 MiB] Operation completed over 3 objects/3.6 MiB.
Batch import the data
To write the data from the exported files into the Spanner table, run a Dataflow job with sample Apache Beam code.
cd dataflow mvn compile mvn exec:java \ -Dexec.mainClass=com.example.spanner_migration.SpannerBulkWrite \ -Dexec.args="--project=$GOOGLE_CLOUD_PROJECT \ --instanceId=spanner-migration \ --databaseId=migrationdb \ --table=Migration \ --importBucket=$BUCKET \ --runner=dataflow"
To watch the progress of the import job, in the Cloud Console, go to Dataflow.
While the job is running, you can watch the execution graph to examine the logs. Click the job that shows the Status of Running.
Click each stage to see how many elements have been processed. The import is complete when all stages say Succeeded. The same number of elements that were created in your Amazon DynamoDB table display as processed at each stage.
Verify that the number of records in the destination Spanner table matches the number of items in the Amazon DynamoDB table.
aws dynamodb describe-table --table-name Migration --query Table.ItemCount gcloud spanner databases execute-sql migrationdb \ --instance=spanner-migration --sql="select count(*) from Migration"
The following output appears:
$ aws dynamodb describe-table --table-name Migration --query Table.ItemCount 25000 $ gcloud spanner databases execute-sql migrationdb --instance=spanner-migration --sql="select count(*) from Migration" 25000
Sample random entries in each table to make sure the data is consistent.
gcloud spanner databases execute-sql migrationdb \ --instance=spanner-migration \ --sql="select * from Migration limit 1"
The following output appears:
$ gcloud spanner databases execute-sql migrationdb --instance=spanner-migration --sql="select * from Migration limit 1" Username PointsEarned ReminderDate Subscribed Zipcode aallen2538 1606 2018-06-18 False 17303
Query the Amazon DynamoDB table with the same
Username
that was returned from the Spanner query in the previous step. For example,aallen2538
. Your value is specific to your database.aws dynamodb get-item --table-name Migration \ --key '{"Username": {"S": "aallen2538"}}'
The values of the other fields should match those from the Spanner output. The following output appears:
$ aws dynamodb get-item --table-name Migration --key '{"Username": {"S": "aallen2538"}}' { "Item": { "Username": { "S": "aallen2538" }, "ReminderDate": { "S": "2018-06-18" }, "PointsEarned": { "N": "1606" }, "Zipcode": { "N": "17303" }, "Subscribed": { "BOOL": false } } }
Replicate new changes
When the batch import job is complete, you set up a streaming job to write ongoing updates from the source table into Spanner. You subscribe to the events from Pub/Sub and write them to Spanner
The Lambda function you created is configured to capture changes to the source Amazon DynamoDB table and publish them to Pub/Sub.
Create a subscription to the Pub/Sub topic that AWS Lambda sends events to.
gcloud pubsub subscriptions create spanner-migration \ --topic spanner-migration
The following output appears:
$ gcloud pubsub subscriptions create spanner-migration --topic spanner-migration Created subscription [projects/your-project/subscriptions/spanner-migration].
To stream the changes coming into Pub/Sub to write to the Spanner table, run the Dataflow job from Cloud Shell.
cd ~/dynamodb-spanner-migration/dataflow mvn exec:java \ -Dexec.mainClass=com.example.spanner_migration.SpannerStreamingWrite \ -Dexec.args="--project=$GOOGLE_CLOUD_PROJECT \ --instanceId=spanner-migration \ --databaseId=migrationdb \ --table=Migration \ --experiments=allow_non_updatable_job \ --subscription=projects/$GOOGLE_CLOUD_PROJECT/subscriptions/spanner-migration"
Similar to the batch load step, to watch the progress of the job, in the Cloud Console, go to Dataflow.
Click the job that has the Status of Running.
The processing graph shows a similar output as before, but each processed item is counted in the status window. The system lag time is a rough estimate of how much delay to expect before changes appear in the Spanner table.
The Dataflow job that you ran in the batch loading phase was a finite set of input, also known as a bounded dataset. This Dataflow job uses Pub/Sub as a streaming source and is considered unbounded. For more information about these two types of sources, review the section on PCollections in the Apache Beam programming guide. The Dataflow job in this step is meant to stay active, so it doesn't terminate when finished. The streaming Dataflow job remains in the Running status, instead of the Succeeded status.
Verify replication
You make some changes to the source table to verify that the changes are replicated to the Spanner table.
Query a nonexistent row in Spanner.
gcloud spanner databases execute-sql migrationdb \ --instance=spanner-migration --sql=\ "SELECT * FROM Migration WHERE Username='my-test-username'"
Create a record in Amazon DynamoDB with the same key that you used in the Spanner query. If the command runs successfully, there is no output.
aws dynamodb put-item \ --table-name Migration \ --item '{"Username" : {"S" : "my-test-username"}, "Subscribed" : {"BOOL" : false}}'
Run the original query again to verify that the row is now in Spanner.
gcloud spanner databases execute-sql migrationdb \ --instance=spanner-migration \ --sql="SELECT * FROM Migration WHERE Username='my-test-username'"
The output shows the inserted row:
$ gcloud spanner databases execute-sql migrationdb --instance=spanner-migration --sql="SELECT * FROM Migration WHERE Username='my-test-username'" Username PointsEarned ReminderDate Subscribed Zipcode my-test-username None None False
Change some attributes in the original item and update the Amazon DynamoDB table.
aws dynamodb update-item \ --table-name Migration \ --key '{"Username": {"S":"my-test-username"}}' \ --update-expression "SET PointsEarned = :pts, Subscribed = :sub" \ --expression-attribute-values '{":pts": {"N":"4500"}, ":sub": {"BOOL":true}}'\ --return-values ALL_NEW
The output appears as follows:
$ aws dynamodb update-item \ > --table-name Migration \ > --key '{"Username": {"S":"my-test-username"}}' \ > --update-expression "SET PointsEarned = :pts, Subscribed = :sub" \ > --expression-attribute-values '{":pts": {"N":"4500"}, ":sub": {"BOOL":true}}'\ > --return-values ALL_NEW { "Attributes": { "Username": { "S": "my-test-username" }, "PointsEarned": { "N": "4500" }, "Subscribed": { "BOOL": true } } }
Verify that the changes are propagated to the Spanner table.
gcloud spanner databases execute-sql migrationdb \ --instance=spanner-migration \ --sql="SELECT * FROM Migration WHERE Username='my-test-username'"
The output appears as follows:
$ gcloud spanner databases execute-sql migrationdb --instance=spanner-migration --sql="SELECT * FROM Migration WHERE Username='my-test-username'" Username PointsEarned ReminderDate Subscribed Zipcode my-test-username 4500 None True
Delete the test item from the Amazon DynamoDB source table.
aws dynamodb delete-item \ --table-name Migration \ --key '{"Username": {"S":"my-test-username"}}'
Verify that the corresponding row is deleted from the Spanner table. When the change is propagated, the following command returns zero rows:
gcloud spanner databases execute-sql migrationdb \ --instance=spanner-migration \ --sql="SELECT * FROM Migration WHERE Username='my-test-username'"
Using interleaved tables
Spanner supports the concept of interleaving tables. This is a design model where a top-level item has several nested items that relate to that top-level item, such as a customer and their orders, or a player and their game scores. If your Amazon DynamoDB source table uses a primary key composed of a hash key and a range key, you can model an interleaved table schema as shown in the following diagram. This structure lets you efficiently query the interleaved table while joining fields in the parent table.
Apply secondary indexes
It is a best practice to apply secondary indexes to Spanner tables after you load the data. Now that replication is working, you set up a secondary index to speed up queries. Like Spanner tables, Spanner secondary indexes are fully consistent. They are not eventually consistent, which is common in many NoSQL databases. This feature can help simplify your app design
Run a query that doesn't use any indexes. You are looking for the top N occurrences, given a particular column value. This is a common query in Amazon DynamoDB for database efficiency.
Go to Spanner.
Click QUERY.
In the Query field, enter the following query, and then click Run query.
SELECT Username,PointsEarned FROM Migration WHERE Subscribed=true AND ReminderDate > DATE_SUB(DATE(current_timestamp()), INTERVAL 3 DAY)
After the query runs, click Explanation and take note of the Rows scanned versus Rows returned. Without an index, Spanner scans the entire table to return a small subset of data that matches the query.
If this represents a commonly occurring query, create a composite index on the Subscribed and ReminderDate columns. In the Spanner console, click Create Index.
Click to turn on Edit as Text.
In the DDL statements, enter the index definition.
CREATE INDEX SubscribedDate ON Migration ( Subscribed, ReminderDate )
To begin building the database in the background, click Create.
After the index is created, run the query again and add the index.
SELECT Username,PointsEarned FROM Migration@{FORCE_INDEX=SubscribedDate} WHERE Subscribed=true AND ReminderDate > DATE_SUB(DATE(current_timestamp()), INTERVAL 3 DAY)
Examine the query explanation again. Notice that the number of Rows scanned has decreased. The Rows returned at each step matches the number returned by the query.
Interleaved indexes
You can set up interleaved indexes in Spanner. The secondary indexes discussed in the previous section are at the root of the database hierarchy, and they use indexes the same way as a conventional database. An interleaved index is within the context of its interleaved row. See index options for more details about where to apply interleaved indexes.
Adjusting for your data model
In order to adapt the migration portion of this tutorial to your own situation, modify your Apache Beam source files. It is important that you don't change the source schema during the actual migration window, otherwise you can lose data.
To parse incoming JSON and build mutations, use GSON. Adjust the JSON definition to match your data.
Adjust the corresponding JSON mapping.
In the previous steps, you modified the Apache Beam source code for bulk import. Modify the source code for the streaming part of the pipeline in a similar manner. Finally, adjust the table creation scripts, schemas, and indexes of your Spanner target database.
Cleaning up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete AWS resources
If your AWS account is used outside of this tutorial, use caution when you delete the following resources:
- Delete the DynamoDB table called Migration.
- Delete the Amazon S3 bucket and Lambda function that you created during the migration steps.
- Finally, delete the AWS IAM user that you created during this tutorial.
What's next
- Read about how to optimize your Spanner schema.
- Learn how to use Dataflow for more complex situations.
- Review other Spanner how-to guides.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.