Introducing Django ORM support for Cloud Spanner
Chris Kleinknecht
Software Engineer, Cloud Spanner
Today we're happy to announce beta support for Google Cloud Spanner in the Django ORM. The django-google-spanner package is a third-party database backend for Cloud Spanner, powered by the Cloud Spanner Python client library. The Django ORM is a powerful standalone component of the Django web framework that maps Python objects to relational data. It provides a nice Pythonic interface to the underlying database, and includes tools for automatically generating schema changes and managing schema version history. With this integration, Django applications can now take advantage of Cloud Spanner's high availability and consistency at scale.
We'll follow the Django tutorial below to create a new project and start writing data to Cloud Spanner. If you're already using Django with another database backend you can skip down to "Migrating an existing Django project to Cloud Spanner" for instructions on switching to Cloud Spanner.
Installation
To use `django-google-spanner`, you'll need a working Python installation and Django project. The library requires `Django~=2.2` and `Python>=3.6`. If you're new to Django, see the Django 2.2 getting started guide.
You'll also need an active Google Cloud project with the Cloud Spanner API enabled. For more details on getting started with Cloud Spanner see the Cloud Spanner getting started guide.
Django applications are typically configured to use a single database. If you're an existing Cloud Spanner customer, you should already have a database suitable for use with your Django application. If you don't already have a Cloud Spanner database, or want to start from scratch for a new Django application, you can create a new instance and database using the Google Cloud SDK:
To install the Cloud Spanner database backend package:
`django-google-spanner` provides a Django application named `django_spanner`. To use the Cloud Spanner database backend, this application needs to be the first entry in `INSTALLED_APPS` in your application's settings.py file:
The `django_spanner` application changes the default behavior of Django's AutoField so that it generates random (instead of automatically incrementing sequential) values. We do this to avoid a common anti-pattern in Cloud Spanner usage.
Configure the database engine by setting the project, instance, and database name:
To run your code locally during development and testing, you'll need to authenticate with Application Default Credentials, or set the GOOGLE_APPLICATION_CREDENTIALS environment variable to authenticate using a service account. This library delegates authentication to the Cloud Spanner Python client library. If you're already using this or another client library successfully, you shouldn't have to do anything new to authenticate from your Django application. For more information, see the client libraries documentation on setting up authentication.
Under the hood
Under the hood, `django-google-spanner` uses the Cloud Spanner Python client library, which communicates with Cloud Spanner via its gRPC API. The client library also manages Cloud Spanner session lifetimes, and provides sane request timeout and retry defaults.
To support the Django ORM, we added an implementation of the Python Database API Specification (or DB-API) to the client library in the `google.cloud.spanner_dbapi` package. This package handles Cloud Spanner database connections, provides a standard cursor for iterating over streaming results, and seamlessly retries queries and DML statements in aborted transactions. In the future we hope to use this package to support other libraries and ORMs that are compatible with the DB-API, including SQLAlchemy.
Django ships with a powerful schema version control system known as migrations. Each migration describes a change to a Django model that results in a schema change. Django tracks migrations in an internal `django_migrations` table, and includes tools for migrating data between schema versions and generating migrations automatically from an app's models. `django-google-spanner` provides backend support for Cloud Spanner by converting Django migrations into DDL statements – namely `CREATE TABLE` and `ALTER TABLE` – to be run at migration time.
Following the Django tutorial, let's see how the client library interacts with the Cloud Spanner API. The example that follows starts from the "Database setup" step of Tutorial 2, and assumes you've already created the `mysite` and `polls` apps from the first part of the tutorial.
After configuring database backend as described above, we can run the initial migrations for the project:
After running the migrations, we can see the tables and indexes Django created in the GCP Cloud Console:
Alternatively, we can inspect `information_schema.tables` to display the tables Django created using the Google Cloud SDK:
Note that this will display Spanner-internal tables too, including `SPANNER_SYS` and `INFORMATION_SCHEMA` tables. These are omitted in the example above.
We can confirm that the table schema in Cloud Spanner matches the DDL generated by the library with `sqlmigrate`. In this example we'll check the `polls` app, but the same steps apply for tables created by other the other apps in `INSTALLED_APPS` (e.g. `admin`, `auth`, `contenttypes`, and `sessions`):
Check the table schema by clicking the "SHOW EQUIVALENT DDL" link on the table detail page in the Cloud Console:
Now, following the Playing with the API section of the tutorial, let's create and modify some objects in the `polls` app and see how the changes are persisted in Cloud Spanner. In the example below, each code segment from the tutorial is followed by any SQL statements executed by Django, and a partial list of resulting Cloud Spanner API requests, including their arguments.
To see the generated SQL statements yourself, you can enable the `django.db.backends` logger.
Query the empty Questions table
This code results in the SQL statement:
And makes these Cloud Spanner API calls:
`google.spanner.v1.Spanner/CreateSession`. We create a new session on the first request to Cloud Spanner. Later requests will re-use this session by calling `google.spanner.v1.Spanner/GetSession` with the name of the session returned here. In the interest of space, we'll omit future calls to `GetSession` here.
Note that the request includes a few extra fields: `metadata`, which Cloud Spanner uses to track the database and session in use; and `retry` and `timeout` which – as you might expect – describe the request's retry and timeout behavior. Since these arguments aren't generally user-configurable, we'll also omit them below.
2. `google.spanner.v1.Spanner/ExecuteStreamingSql`. Since we don't need to modify data here, the operation happens in a non-blocking read-only transaction. We perform a strong read by default, which guarantees that we see the results of any transactions that were committed before this one started.
Create and save a new Question object
This code results in the SQL statement:
And makes these Cloud Spanner API calls:
`google.spanner.v1.Spanner/BeginTransaction`. On `model.save`, we start a blocking read-write transaction (note that the request payload includes `options: {read_write: {}}`). The request includes the session ID from above.
2. `google.spanner.v1.Spanner/ExecuteSql`. Run "INSERT INTO polls_question ..." inside the transaction. The request includes the transaction ID from above
3. `google.spanner.v1.Spanner/Commit`. Commit the "INSERT" transaction.
Modify an existing Question
This code results in the SQL statement:
And makes these Cloud Spanner API calls:
`google.spanner.v1.Spanner/BeginTransaction`. As in the last call to `model.save`, start another read-write transaction.
2. `google.spanner.v1.Spanner/ExecuteSql`. Run "UPDATE polls_question ...". Note that we're only executing the "UPDATE" statement this time, not "INSERT" since the row already exists.
3. `google.spanner.v1.Spanner/Commit`. Commit the "UPDATE" transaction.
4. `google.spanner.v1.Spanner/DeleteSession`. Finally, we can delete the session on exiting the shell.
Notice how much work goes on behind the scenes! Besides converting Django model operations into Cloud-Spanner-compliant SQL, the database driver is also responsible for managing sessions and transactions. As we saw in the example above, one call to `model.save` can result in many Cloud Spanner API calls.
Migrating an existing Django project to Cloud Spanner
To migrate a Django project from another database to Cloud Spanner, we can use Django's built-in support for multiple database connections. This feature allows us to connect to two databases at once; to read from one and write to another.
Suppose you want to move your application's data from SQLite to Cloud Spanner. Assuming the existing database connection is already configured as "default", we can add a second database connection to Cloud Spanner. We'll call this connection "spanner":
As in the tutorial, running `python manage.py migrate` will create tables and indexes for all models in the project. By default, `migrate` will run on all configured database connections, and generate DDL specific to each database backend. After running `migrate`, both databases should have equivalent schemas, but the new Cloud Spanner database will still be empty.
Since Django automatically generates the schema from the project's models, it's a good idea to check that the generated DDL follows Cloud Spanner best practices. You can adjust the project's models accordingly in a separate migration after copying data into Cloud Spanner.
There are several options for copying data into Cloud Spanner, including using HarbourBridge to import data from a PostgreSQL or MySQL database or Dataflow to import Avro files. Any option will work as long as the imported data matches the new schema, but the easiest (if not the fastest) way to copy data between databases is by using Django itself.
Consider the models we created in the tutorial. In this code snippet, we read all Questions and Choices from the SQLite database and then write them to Cloud Spanner:
For each row in each table in the existing database, we:
Read the row and store it in memory as a Django model object
Unset the primary key, and
Write it back to the new database, at which point it gets assigned a new primary key.
Note that we need to update foreign keys to use the newly-generated primary keys too. Also note that we call `question.choice_set.all()` before we change `question`'s primary key – otherwise the QuerySet would be evaluated lazily using the wrong key!
This is a naive example; meant to be easy to understand, but not necessarily fast. It makes a separate "SELECT … FROM polls_choice" query for each Question. Since we know ahead of time that we're going to read all Choices in the database, we can reduce this to a single query with `Choice.objects.all().select_related('question')`. In general, it should be possible to write your migration logic in a way that takes advantage of your project's schema, e.g. by using bulk_update instead of a separate request to write each row. This logic can take the form of a code snippet to be run in the Django shell (as above), a separate script, or a Django data migration.
After migrating from the old database to Cloud Spanner, you can remove the configuration for the old database connection and rename the Cloud Spanner connection to "default":
Limitations
This is a beta release, and not yet recommended for production use. There are no SLAs or technical support guarantees, and the feature set in the current release may not match the generally available version later. That said, we believe the library is stable and functional enough to be useful now, and we hope you'll consider using it and providing feedback.
Some Cloud Spanner features – like the NUMERIC data type – are not yet available, either in this library or the Cloud Spanner Python client library. Several Django database features are disabled, some because they're incompatible with the Cloud Spanner API. Django ships with a comprehensive test suite, for a detailed list of Django features that aren't yet supported by `python-spanner-django` see the list of Django tests that we skip.
Customers using the Cloud Spanner Emulator may see different behavior than the Cloud Spanner service, not least because the emulator doesn't support concurrent transactions. See the Cloud Spanner Emulator documentation for a list of limitations and differences from the Cloud Spanner service.
For now the library supports Django version 2.2.x, though we're planning to support 3.x in the future. The minimum required Python version is 3.6.
Getting involved
We'd love to hear from you, especially if you're using the Cloud Spanner Python client library with a Django application now, or if you're an existing Cloud Spanner customer who is considering using Django for new projects. The project is open source, you can comment, report bugs, and open pull requests on Github.
Thanks to Emmanuel Odeke and Tim Graham for their early work on the project, Maxim Factourovich, Ilya Gurov, and Hemang Chothani for their work on the DB-API implementation and beta readiness, and everyone else who's contributed so far or tested the library and provided early feedback.
See also
Cloud Spanner product documentation, including how-to guides
Django 2.2 tutorial, followed in detail above