Migrating from DynamoDB to Cloud Spanner

This tutorial describes how to migrate from Amazon DynamoDB to Cloud Spanner. It is primarily intended for app owners who want to move from a NoSQL system to Cloud Spanner, a fully relational, fault-tolerant, highly scalable SQL database system that supports transactions. If you have consistent Amazon DynamoDB table usage, in terms of types and layout, mapping to Cloud Spanner is straightforward. If your Amazon DynamoDB tables contain arbitrary data types and values, it might be simpler to move to other NoSQL services, such as Cloud Datastore or Firebase.

This tutorial assumes that you are familiar with database schemas, data types, the fundamentals of NoSQL, and relational database systems. The tutorial relies on running predefined tasks to perform an example migration. After the tutorial, you can modify the provided code and steps to match your environment.

The following architectural diagram outlines the components used in the tutorial to migrate data:

Architectural diagram of migration components

Objectives

  • Migrate data from Amazon DynamoDB to Cloud Spanner.
  • Create a Cloud Spanner database and migration table.
  • Map a NoSQL schema to a relational schema.
  • Create and export a sample dataset that uses Amazon DynamoDB.
  • Transfer data between Amazon S3 and Cloud Storage.
  • Use Cloud Dataflow to load data into Cloud Spanner.

Costs

This tutorial uses the following billable components of Google Cloud Platform:

  • GKE
  • Cloud Pub/Sub
  • Cloud Storage
  • Cloud Dataflow

Cloud Spanner charges are based on the number of node-hours and the amount of data stored during the monthly billing cycle. During the tutorial, you use a minimal configuration of these resources, which are cleaned up at the end. For real-world scenarios, estimate your throughput and storage requirements, and then use the Cloud Spanner instances documentation to determine the number of nodes that you need.

In addition to GCP resources, this tutorial uses the following Amazon Web Services (AWS) resources:

  • Amazon EMR
  • AWS Lambda
  • Amazon S3
  • Amazon DynamoDB

These services are only needed during the migration process. At the end of the tutorial, follow the instructions to clean up all resources to prevent unnecessary charges. Use the AWS pricing calculator to estimate these costs.

You can use the pricing calculator to generate a cost estimate based on your projected usage. New GCP users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the project selector page

  3. Verifica che la fatturazione sia attivata per il tuo progetto.

    scopri come attivare la fatturazione

  4. Enable the Cloud Spanner, Cloud Pub/Sub, Compute Engine, and Cloud Dataflow APIs.

    Enable the APIs

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Preparing your environment

In this tutorial, you run commands in Cloud Shell. Cloud Shell gives you access to the command line in GCP, and includes the Cloud SDK and other tools that you need for GCP development. Cloud Shell can take several minutes to initialize.

  1. Activate Cloud Shell.

    ACTIVATE Cloud Shell

  2. Set the default Compute Engine zone. For example, us-central1-b.

    gcloud config set compute/zone us-central1-b
    
  3. Clone the GitHub repository containing the sample code.

    git clone https://github.com/GoogleCloudPlatform/dynamodb-spanner-migration.git
    
  4. Go to the cloned directory.

    cd dynamodb-spanner-migration
    
  5. Create a Python virtual environment.

    virtualenv --python python2 env
    
  6. Activate the virtual environment.

    source env/bin/activate
    
  7. Install the required Python modules.

    pip install -r requirements.txt
    

Configuring AWS access

In this tutorial, you create and delete Amazon DynamoDB tables, Amazon S3 buckets, and other resources. To access these resources, you first need to create the required AWS Identity and Access Management (IAM) permissions. You can use a test or sandbox AWS account to avoid affecting production resources in the same account.

Create an AWS IAM role for AWS Lambda

In this section, you create an AWS IAM role that AWS Lambda uses at a later step in the tutorial.

  1. In the AWS console, go to the IAM section, click Roles, and then select Create role.
  2. Under Choose the service that will use this role, click Lambda, and then select Next:Permissions.
  3. In the Policy Type box, enter AWSLambdaDynamoDBExecutionRole.
  4. Select the AWSLambdaDynamoDBExecutionRole checkbox, and then click Next:Review.
  5. In the Role name box, enter dynamodb-spanner-lambda-role, and then click Create role.

Create an AWS IAM user

Follow these steps to create an AWS IAM user with programmatic access to AWS resources, which are used throughout the tutorial.

  1. While you are still in the IAM section of the AWS console, click Users, and then select Add User.
  2. In the User name box, enter dynamodb-spanner-migration.
  3. Under Access type, click Programmatic access.

  4. Click Next: Permissions.

  5. Click Attach existing policies directly and select the following two policies:

    • AmazonDynamoDBFullAccesswithDataPipeline
    • AmazonS3FullAccess
  6. Click Next: Review, and then click Create user.

  7. Click Show to view the credentials. The access key ID and secret access key are displayed for the newly created user. Leave this window open for now because the credentials are needed in the following section. Safely store these credentials because with them, you can make changes to your account and affect your environment. At the end of this tutorial, you can delete the IAM user.

Configure AWS command-line interface

  1. In Cloud Shell, configure the AWS Command Line Interface (CLI).

    aws configure
    

    The following output appears:

    $ aws configure
    AWS Access Key ID [None]: PASTE_YOUR_ACCESS_KEY_ID
    AWS Secret Access Key [None]: PASTE_YOUR_SECRET_ACCESS_KEY
    Default region name [None]: us-west-2
    Default output format [None]:
    user@project:~/dynamodb-spanner$
    
    • Enter the ACCESS KEY ID and SECRET ACCESS KEY from the AWS IAM account that you created.
    • In the Default region name field, enter us-west-2. Leave other fields at their default values.
  2. Close the AWS IAM console window.

Understanding the data model

The following section outlines the similarities and differences between data types, keys, and indexes for Amazon DynamoDB and Cloud Spanner.

Data types

Cloud Spanner uses standard SQL data types. The following table describes how Amazon DynamoDB data types map to Cloud Spanner data types.

Amazon DynamoDB Cloud Spanner
Number Depending on precision or intended usage, might be mapped to INT64, FLOAT64, TIMESTAMP, or DATE.
String String
Boolean BOOL
Null No explicit type. Columns can contain null values.
Binary Bytes
Sets Array
Map and List Struct if the structure is consistent and can be described by using table DDL syntax.

Primary key

An Amazon DynamoDB primary key establishes uniqueness and can be either a hash key or a combination of a hash key plus a range key. This tutorial starts by modeling the migration of a Amazon DynamoDB table whose primary key is a hash key. This hash key becomes the primary key of your Cloud Spanner table. Later, in the section on interleaved tables, you model a situation where an Amazon DynamoDB table uses a primary key composed of a hash key and a range key.

Secondary indexes

Both Amazon DynamoDB and Cloud Spanner support creating an index on a non-primary key attribute. Take note of any secondary indexes in your Amazon DynamoDB table so that you can create them on your Cloud Spanner table, which is covered in a later section of this tutorial.

Sample table

To facilitate this tutorial, you migrate the following sample table from Amazon DynamoDB to Cloud Spanner:

Amazon DynamoDB Cloud Spanner
Table name Migration Migration
Primary key "Username" : String "Username" : STRING(1024)
Key type Hash n/a
Other fields Zipcode: Number Subscribed: Boolean ReminderDate: String PointsEarned: Number Zipcode: INT64 Subscribed: BOOL ReminderDate: DATE PointsEarned: INT64

Preparing the Amazon DynamoDB table

In the following section, you create an Amazon DynamoDB source table and populate it with data.

  1. In Cloud Shell, create an Amazon DynamoDB table that uses the sample table attributes.

    aws dynamodb create-table --table-name Migration \
        --attribute-definitions AttributeName=Username,AttributeType=S \
        --key-schema AttributeName=Username,KeyType=HASH \
        --provisioned-throughput ReadCapacityUnits=75,WriteCapacityUnits=75
    
  2. Verify that the table status is ACTIVE.

    aws dynamodb describe-table --table-name Migration \
        --query 'Table.TableStatus'
    
  3. Populate the table with sample data.

    python make-fake-data.py --table Migration --items 25000
    

Creating a Cloud Spanner database

You create a single-node instance, which is appropriate for testing and the scope of this tutorial. For a production deployment, refer to the documentation for Cloud Spanner instances to determine the appropriate node count to meet your database performance requirements.

In this example, you create a table schema at the same time as the database. It is also possible, and common, to carry out schema updates after you create the database.

  1. Create a Cloud Spanner instance in the same region where you set the default Compute Engine zone. For example, us-central1.

    gcloud spanner instances create spanner-migration \
        --config=regional-us-central1 --nodes=1 \
        --description="Migration Demo"
    
  2. Create a database in the Cloud Spanner instance along with the sample table.

    gcloud spanner databases create migrationdb \
        --instance=spanner-migration \
        --ddl "CREATE TABLE Migration ( \
                Username STRING(1024) NOT NULL, \
                PointsEarned INT64, \
                ReminderDate DATE, \
                Subscribed BOOL, \
                Zipcode INT64, \
             ) PRIMARY KEY (Username)"
    

Pausing the database

The next sections show you how to export the Amazon DynamoDB source table and set Cloud Pub/Sub replication to capture any changes to the database that occur while you export it. If changes to your database aren't idempotent and it isn't safe to write the same data more than once, it is best to carry out the following steps during a maintenance period when you can pause app changes to the table.

Stream changes to Cloud Pub/Sub

You use an AWS Lambda function to stream database changes to Cloud Pub/Sub.

  1. In Cloud Shell, enable Amazon DynamoDB streams on your source table.

    aws dynamodb update-table --table-name Migration \
        --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES
    
  2. Set up a Cloud Pub/Sub topic to receive the changes.

    gcloud pubsub topics create spanner-migration
    

    The following output appears:

    $ gcloud pubsub topics create spanner-migration
    Created topic [projects/your-project/topics/spanner-migration].
    
  3. Create a Cloud IAM service account to push table updates to the Cloud Pub/Sub topic.

    gcloud iam service-accounts create spanner-migration \
        --display-name="Spanner Migration"
    

    The following output appears:

    $ gcloud iam service-accounts create spanner-migration --display-name="Spanner Migration"
    Created service account [spanner-migration].
    
  4. Create a Cloud IAM policy binding so that the service account has permission to publish to Cloud Pub/Sub. Replace GOOGLE_CLOUD_PROJECT with the name of your GCP project.

    gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
        --role roles/pubsub.publisher \
        --member serviceAccount:spanner-migration@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com
    

    The following output appears:

    $ gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
      --role roles/pubsub.publisher \
      --member serviceAccount:spanner-migration@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com
    bindings: (...truncated...) - members: - serviceAccount:spanner-migration@solution-z.iam.gserviceaccount.com role: roles/pubsub.publisher
  5. Create credentials for the service account.

    gcloud iam service-accounts keys create credentials.json \
        --iam-account spanner-migration@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com
    

    The following output appears:

    $ gcloud iam service-accounts keys create credentials.json --iam-account spanner-migration@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com
    created key [5e559d9f6bd8293da31b472d85a233a3fd9b381c] of type [json] as [credentials.json] for [spanner-migration@your-project.iam.gserviceaccount.com]
  6. Prepare and package the AWS Lambda function to push Amazon DynamoDB table changes to the Cloud Pub/Sub topic.

    pip install --ignore-installed --target=lambda-deps google-cloud-pubsub==0.35
    cd lambda-deps; zip -r9 ../pubsub-lambda.zip *; cd -
    zip -g pubsub-lambda.zip ddbpubsub.py
  7. Create a variable to capture the Amazon Resource Name (ARN) of the Lambda execution role that you created earlier.

    LAMBDA_ROLE=$(aws iam list-roles \
        --query 'Roles[?RoleName==`dynamodb-spanner-lambda-role`].[Arn]' \
        --output text)
    
  8. Use the pubsub-lambda.zip package to create the AWS Lambda function.

    aws lambda create-function --function-name dynamodb-spanner-lambda \
        --runtime python2.7 --role $LAMBDA_ROLE \
        --handler ddbpubsub.lambda_handler --zip fileb://pubsub-lambda.zip \
        --environment Variables="{SVCACCT=$(base64 -w 0 credentials.json),PROJECT=$GOOGLE_CLOUD_PROJECT,TOPIC=spanner-migration}"
    

    The following output appears:

    $ aws lambda create-function --function-name dynamodb-spanner-lambda \
    >   --runtime python2.7 --role $LAMBDA_ROLE \
    >   --handler ddbpubsub.lambda_handler --zip fileb://pubsub-lambda.zip \
    >   --environment Variables="{SVCACCT=$(base64 -w 0 credentials.json),PROJECT=$GOOGLE_CLOUD_PROJECT,TOPIC=spanner-migration}"
    {
        "FunctionName": "dynamodb-spanner-lambda",
        "LastModified": "2018-07-07T12:53:58.670+0000",
        "RevisionId": "e58e8408-cd3a-4155-a184-4efc0da80bfb",
        "MemorySize": 128,
    ... truncated output...
  9. Create a variable to capture the ARN of the Amazon DynamoDB stream for your table.

    STREAMARN=$(aws dynamodb describe-table \
        --table-name Migration \
        --query "Table.LatestStreamArn" \
        --output text)
    
  10. Attach the Lambda function to the Amazon DynamoDB table.

    aws lambda create-event-source-mapping --event-source $STREAMARN \
        --function-name dynamodb-spanner-lambda --enabled \
        --starting-position TRIM_HORIZON
    
  11. To optimize responsiveness during testing, add --batch-size 1 to the end of the previous command, which triggers the function every time you create, update, or delete an item.

    The following output appears:

    $ aws lambda create-event-source-mapping --event-source $STREAMARN \
    >     --function-name dynamodb-spanner-lambda --enabled --starting-position TRIM_HORIZON
    {
        "UUID": "44e4c2bf-493a-4ba2-9859-cde0ae5c5e92",
        "StateTransitionReason": "User action",
        "LastModified": 1530662205.549,
        "BatchSize": 100,
        "EventSourceArn": "arn:aws:dynamodb:us-west-2:accountid:table/Migration/stream/2018-07-03T15:09:57.725",
        "FunctionArn": "arn:aws:lambda:us-west-2:accountid:function:dynamodb-spanner-lambda",
        "State": "Creating",
        "LastProcessingResult": "No records processed"
    }
    

Export the Amazon DynamoDB table to Amazon S3

  1. In Cloud Shell, create a variable for a bucket name that you use in several of the following sections.

    BUCKET=$DEVSHELL_PROJECT_ID-dynamodb-spanner-export
    
  2. Create an Amazon S3 bucket to receive the DynamoDB export.

    aws s3 mb s3://$BUCKET
    
  3. In the AWS Management Console, click Data Pipeline.

  4. Click Create new pipeline to define the export job.

  5. In the Name field, enter Export to Amazon S3.

  6. For the Source, select the following:

    • Build using a template.
    • Export DynamoDB table to Amazon S3.
  7. In the Parameters section, define the following:

    1. In the Source DynamoDB table name field, enter Migration.
    2. In the Output S3 folder field, click the Folder icon and select the [Your-Project-ID]-dynamodb-spanner-export Amazon S3 bucket that you just created where [YOUR-PROJECT-ID] represents your GCP project ID.
    3. To consume all available read-capacity during the export, in the DynamoDB read throughput ratio field, enter 1. In a production environment, you adjust this value so that it doesn't hinder live operations.
    4. In the Region of your DynamoDB table field, enter the name of the region, for example, us-west-2.
  8. To start the backup jobs immediately, in the Schedule section for Run, click On pipeline activation.

  9. Under Pipeline Configuration, in the Logging field, enter Disabled. If you are following this guide to migrate a production table, leave this option enabled and pointed at a separate Amazon S3 bucket for logs to help you troubleshoot errors. Leave other default parameters.

  10. To begin the backup process, click Activate.

  11. If you are prompted to address validation warnings, click Activate. In a production situation, you set a maximum duration for the job and enable logging.

  12. Click Refresh to update the status of the backup process. The job takes several minutes to create the resources and finish exporting. In a production environment, you can speed this process up by modifying the Data Pipeline jobs to use more EMR resources.

    When the process finishes, look at the output bucket.

    aws s3 ls --recursive s3://$BUCKET
    

    The export job is done when there is a file named _SUCCESS.

    $ aws s3 ls --recursive s3://$BUCKET
    
    2018-06-30 13:08:11 3736518 2018-06-30-20-01-21/76b53eea-46d1-4293-ba51-11759f5c65fa
    2018-06-30 13:08:20       0 2018-06-30-20-01-21/_SUCCESS
    2018-06-30 13:08:20     178 2018-06-30-20-01-21/manifest
    

Opening the database

If you paused writes to the database before exporting, it's time to reactivate writes to the database. Now that the Cloud Pub/Sub delivery is in place, you can push forward any table changes that occurred after the export.

Copy the exported table to Cloud Storage

  1. In Cloud Shell, create a Cloud Storage bucket to receive the exported files from Amazon S3.

    gsutil mb gs://$BUCKET
    
  2. Sync the files from Amazon S3 into Cloud Storage. For most copy operations, the rsync command is effective. If your export files are large (several GBs or more), use the Cloud Storage transfer service to manage the transfer in the background.

    gsutil rsync -d -r s3://$BUCKET gs://$BUCKET
    

    The following output appears:

    $ gsutil rsync -d -r s3://$BUCKET gs://$BUCKET
    Building synchronization state...
    Starting synchronization...
    Copying s3://project-dynamodb-spanner-export/2018-06-30-20-01-21/76b53eea-46d1-4293-ba51-11759f5c65fa [Content-Type=binary/octet-stream]...
    Copying s3://project-dynamodb-spanner-export/2018-06-30-20-01-21/_SUCCESS [Content-Type=binary/octet-stream]...
    Copying s3://project-dynamodb-spanner-export/2018-06-30-20-01-21/manifest [Content-Type=binary/octet-stream]...
    / [3 files][  3.6 MiB/  3.6 MiB]
    Operation completed over 3 objects/3.6 MiB.
    

Batch import the data

  1. To write the data from the exported files into the Cloud Spanner table, run a Cloud Dataflow job with sample Apache Beam code.

    cd dataflow
    mvn compile
    mvn exec:java \
        -Dexec.mainClass=com.example.spanner_migration.SpannerBulkWrite \
        -Dexec.args="--project=$GOOGLE_CLOUD_PROJECT \
                     --instanceId=spanner-migration \
                     --databaseId=migrationdb \
                     --table=Migration \
                     --importBucket=$BUCKET \
                     --runner=dataflow"
    
    1. To watch the progress of the import job, in the GCP Console, go to Cloud Dataflow.

      GO TO Cloud Dataflow

    2. While the job is running, you can watch the execution graph to examine the logs. Click the job that shows the Status of Running.

      Running import job

  2. Click each stage to see how many elements have been processed. The import is complete when all stages say Succeeded. The same number of elements that were created in your Amazon DynamoDB table display as processed at each stage.

    Stages of success of import job

  3. Verify that the number of records in the destination Cloud Spanner table matches the number of items in the Amazon DynamoDB table.

    aws dynamodb describe-table --table-name Migration --query Table.ItemCount
    gcloud spanner databases execute-sql migrationdb \ --instance=spanner-migration --sql="select count(*) from Migration"

    The following output appears:

    $ aws dynamodb describe-table --table-name Migration --query Table.ItemCount
    25000
    $ gcloud spanner databases execute-sql migrationdb --instance=spanner-migration --sql="select count(*) from Migration"
    25000
    
  4. Sample random entries in each table to make sure the data is consistent.

    gcloud spanner databases execute-sql migrationdb \
        --instance=spanner-migration \
        --sql="select * from Migration limit 1"
    

    The following output appears:

    $ gcloud spanner databases execute-sql migrationdb --instance=spanner-migration --sql="select * from Migration limit 1"
    Username    PointsEarned  ReminderDate  Subscribed  Zipcode
    aallen2538  1606          2018-06-18    False       17303
    
  5. Query the Amazon DynamoDB table with the same Username that was returned from the Cloud Spanner query in the previous step. For example, aallen2538. Your value is specific to your database.

    aws dynamodb get-item --table-name Migration \
        --key '{"Username": {"S": "aallen2538"}}'
    

    The values of the other fields should match those from the Cloud Spanner output. The following output appears:

    $ aws dynamodb get-item --table-name Migration --key '{"Username": {"S": "aallen2538"}}'
    {
        "Item": {
            "Username": {
                "S": "aallen2538"
            },
            "ReminderDate": {
                "S": "2018-06-18"
            },
            "PointsEarned": {
                "N": "1606"
            },
            "Zipcode": {
                "N": "17303"
            },
            "Subscribed": {
                "BOOL": false
            }
        }
    }
    

Replicate new changes

When the batch import job is complete, you set up a streaming job to write ongoing updates from the source table into Cloud Spanner. You subscribe to the events from Cloud Pub/Sub and write them to Cloud Spanner

The Lambda function you created is configured to capture changes to the source Amazon DynamoDB table and publish them to Cloud Pub/Sub.

  1. Create a subscription to the Cloud Pub/Sub topic that AWS Lambda sends events to.

    gcloud pubsub subscriptions create spanner-migration \
        --topic spanner-migration
    

    The following output appears:

    $ gcloud pubsub subscriptions create spanner-migration --topic spanner-migration
    Created subscription [projects/your-project/subscriptions/spanner-migration].
    
  2. To stream the changes coming into Cloud Pub/Sub to write to the Cloud Spanner table, run the Cloud Dataflow job from Cloud Shell.

    cd ~/dynamodb-spanner-migration/dataflow
    mvn exec:java \
        -Dexec.mainClass=com.example.spanner_migration.SpannerStreamingWrite \
        -Dexec.args="--project=$GOOGLE_CLOUD_PROJECT \
                     --instanceId=spanner-migration \
                     --databaseId=migrationdb \
                     --table=Migration \
                     --experiments=allow_non_updatable_job \
        --subscription=projects/$GOOGLE_CLOUD_PROJECT/subscriptions/spanner-migration"
    
    1. Similar to the batch load step, to watch the progress of the job, in the GCP Console, go to Cloud Dataflow.

      GO TO Cloud Dataflow

    2. Click the job that has the Status of Running.

      Running job

      The processing graph shows a similar output as before, but each processed item is counted in the status window. The system lag time is a rough estimate of how much delay to expect before changes appear in the Cloud Spanner table.

      Running processes due to lag time

The Cloud Dataflow job that you ran in the batch loading phase was a finite set of input, also known as a bounded dataset. This Cloud Dataflow job uses Cloud Pub/Sub as a streaming source and is considered unbounded. For more information about these two types of sources, review the section on PCollections in the Apache Beam programming guide. The Cloud Dataflow job in this step is meant to stay active, so it doesn't terminate when finished. The streaming Cloud Dataflow job remains in the Running status, instead of the Succeeded status.

Verify replication

You make some changes to the source table to verify that the changes are replicated to the Cloud Spanner table.

  1. Query a nonexistent row in Cloud Spanner.

    gcloud spanner databases execute-sql migrationdb \
        --instance=spanner-migration --sql=\
        "SELECT * FROM Migration WHERE Username='my-test-username'"
    
  2. Create a record in Amazon DynamoDB with the same key that you used in the Cloud Spanner query. If the command runs successfully, there is no output.

    aws dynamodb put-item \
        --table-name Migration \
        --item '{"Username" : {"S" : "my-test-username"}, "Subscribed" : {"BOOL" : false}}'
    
  3. Run the original query again to verify that the row is now in Cloud Spanner.

    gcloud spanner databases execute-sql migrationdb \
        --instance=spanner-migration \
        --sql="SELECT * FROM Migration WHERE Username='my-test-username'"
    

    The output shows the inserted row:

    $ gcloud spanner databases execute-sql migrationdb --instance=spanner-migration --sql="SELECT * FROM Migration WHERE Username='my-test-username'"
    Username PointsEarned ReminderDate Subscribed Zipcode my-test-username None None False
  4. Change some attributes in the original item and update the Amazon DynamoDB table.

    aws dynamodb update-item \
        --table-name Migration \
        --key '{"Username": {"S":"my-test-username"}}' \
        --update-expression "SET PointsEarned = :pts, Subscribed = :sub" \
        --expression-attribute-values '{":pts": {"N":"4500"}, ":sub": {"BOOL":true}}'\
        --return-values ALL_NEW
    

    The output appears as follows:

    $ aws dynamodb update-item \
    >   --table-name Migration \
    >   --key '{"Username": {"S":"my-test-username"}}' \
    >   --update-expression "SET PointsEarned = :pts, Subscribed = :sub" \
    >   --expression-attribute-values '{":pts": {"N":"4500"}, ":sub": {"BOOL":true}}'\
    >   --return-values ALL_NEW
    {
        "Attributes": {
            "Username": {
                "S": "my-test-username"
            },
            "PointsEarned": {
                "N": "4500"
            },
            "Subscribed": {
                "BOOL": true
            }
        }
    }
    
  5. Verify that the changes are propagated to the Cloud Spanner table.

    gcloud spanner databases execute-sql migrationdb \
        --instance=spanner-migration \
        --sql="SELECT * FROM Migration WHERE Username='my-test-username'"
    

    The output appears as follows:

    $ gcloud spanner databases execute-sql migrationdb --instance=spanner-migration --sql="SELECT * FROM Migration WHERE Username='my-test-username'"
    Username PointsEarned ReminderDate Subscribed Zipcode my-test-username 4500 None True
  6. Delete the test item from the Amazon DynamoDB source table.

    aws dynamodb delete-item \
        --table-name Migration \
        --key '{"Username": {"S":"my-test-username"}}'
    
  7. Verify that the corresponding row is deleted from the Cloud Spanner table. When the change is propagated, the following command returns zero rows:

    gcloud spanner databases execute-sql migrationdb \
        --instance=spanner-migration \
        --sql="SELECT * FROM Migration WHERE Username='my-test-username'"
    

Using interleaved tables

Cloud Spanner supports the concept of interleaving tables. This is a design model where a top-level item has several nested items that relate to that top-level item, such as a customer and their orders, or a player and their game scores. If your Amazon DynamoDB source table uses a primary key composed of a hash key and a range key, you can model an interleaved table schema as shown in the following diagram. This structure lets you efficiently query the interleaved table while joining fields in the parent table.

User table compared to orders table

Apply secondary indexes

It is a best practice to apply secondary indexes to Cloud Spanner tables after you load the data. Now that replication is working, you set up a secondary index to speed up queries. Like Cloud Spanner tables, Cloud Spanner secondary indexes are fully consistent. They are not eventually consistent, which is common in many NoSQL databases. This feature can help simplify your app design

Run a query that doesn't use any indexes. You are looking for the top N occurrences, given a particular column value. This is a common query in Amazon DynamoDB for database efficiency.

  1. Go to Cloud Spanner.

    GO TO Cloud Spanner

  2. Click QUERY.

    Query button

  3. In the Query field, enter the following query, and then click Run query.

    SELECT Username,PointsEarned FROM Migration
      WHERE Subscribed=true AND
      ReminderDate > DATE_SUB(DATE(current_timestamp()), INTERVAL 3 DAY)
    

    After the query runs, click Explanation and take note of the Rows scanned versus Rows returned. Without an index, Cloud Spanner scans the entire table to return a small subset of data that matches the query.

    Rows scanned compared to rows returned

  4. If this represents a commonly occurring query, create a composite index on the Subscribed and ReminderDate columns. In the Cloud Spanner console, click Create Index.

  5. Click to turn on Edit as Text.

  6. In the DDL statements, enter the index definition.

    CREATE INDEX SubscribedDate
    ON Migration (
      Subscribed,
      ReminderDate
    )
    
  7. To begin building the database in the background, click Create.

    Schema update in progress

  8. After the index is created, run the query again and add the index.

    SELECT Username,PointsEarned FROM
    Migration@{FORCE_INDEX=SubscribedDate}
      WHERE Subscribed=true AND
      ReminderDate > DATE_SUB(DATE(current_timestamp()), INTERVAL 3 DAY)
    

    Examine the query explanation again. Notice that the number of Rows scanned has decreased. The Rows returned at each step matches the number returned by the query.

    Explanation of query

Interleaved indexes

You can set up interleaved indexes in Cloud Spanner. The secondary indexes discussed in the previous section are at the root of the database hierarchy, and they use indexes the same way as a conventional database. An interleaved index is within the context of its interleaved row. See index options for more details about where to apply interleaved indexes.

Adjusting for your data model

In order to adapt the migration portion of this tutorial to your own situation, modify your Apache Beam source files. It is important that you don't change the source schema during the actual migration window, otherwise you can lose data.

  1. To parse incoming JSON and build mutations, use GSON. Adjust the JSON definition to match your data.

    public static class Item implements Serializable {
        private Username Username;
        private PointsEarned PointsEarned;
        private Subscribed Subscribed;
        private ReminderDate ReminderDate;
        private Zipcode Zipcode;
    
    }
    
    public static class Username implements Serializable {
        private String s;
    
    }
    
    public static class PointsEarned implements Serializable {
        private String n;
    
    }
    
    public static class Subscribed implements Serializable {
        private String bOOL;
    
    }
    
    public static class ReminderDate implements Serializable {
        private String s;
    
    }
    
    public static class Zipcode implements Serializable {
        private String n;
    
    }
  2. Adjust the corresponding JSON mapping.

    mutation.set("Username").to(item.Username.s);
    
    Optional.ofNullable(item.Zipcode).ifPresent(x->{
        mutation.set("Zipcode").to(Integer.parseInt(x.n));
    });
    
    Optional.ofNullable(item.Subscribed).ifPresent(x->{
        mutation.set("Subscribed").to(Boolean.parseBoolean(x.bOOL));
    });
    
    Optional.ofNullable(item.ReminderDate).ifPresent(x->{
        mutation.set("ReminderDate").to(Date.parseDate(x.s));
    });
    
    Optional.ofNullable(item.PointsEarned).ifPresent(x->{
        mutation.set("PointsEarned").to(Integer.parseInt(x.n));
    });

In the previous steps, you modified the Apache Beam source code for bulk import. Modify the source code for the streaming part of the pipeline in a similar manner. Finally, adjust the table creation scripts, schemas, and indexes of your Cloud Spanner target database.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

Delete the project

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete AWS resources

If your AWS account is used outside of this tutorial, use caution when you delete the following resources:

  1. Delete the DynamoDB table called Migration.
  2. Delete the Amazon S3 bucket and Lambda function that you created during the migration steps.
  3. Finally, delete the AWS IAM user that you created during this tutorial.

What's next

Hai trovato utile questa pagina? Facci sapere cosa ne pensi:

Invia feedback per...