About differences between legacy Dataform and Dataform in Google Cloud
Dataform is a serverless service for data analysts to develop and deploy tables, incremental tables, or views to BigQuery. Dataform offers a web environment for SQL workflow development, connection with GitHub, GitLab, Bitbucket, and Azure DevOps Services, continuous integration, continuous deployment, and workflow execution.
Dataform in Google Cloud is different from legacy Dataform in the following ways:
- Dataform in Google Cloud supports connection of Dataform repositories to Bitbucket repositories.
- Access control is based on IAM.
Configuration of a query concurrency limit (
concurrentQueryLimit
) in the workflow settings file is removed.In legacy Dataform, concurrency limits prevented Dataform from sending too many concurrent queries to BigQuery. To manage concurrency in Dataform in Google Cloud, we recommend enabling BigQuery query queues.
Legacy environments are replaced by release configurations.
Legacy schedules are replaced by workflow configurations.
Workflow failure alerts are configured in Cloud Logging.
Dataform in Google Cloud and legacy Dataform use different NPM versions and different formats of
package-lock.json
.To develop a SQL workflow in both legacy Dataform and Dataform in Google Cloud, use the legacy
package-lock.json
format for package installation. Don't install packages in Dataform in Google Cloud until you fully migrate to Dataform in Google Cloud.
For more information about features of Dataform in Google Cloud, see Overview of Dataform features.
Legacy Dataform features not supported in Google Cloud at this time
The following features of legacy Dataform are not supported in Dataform in Google Cloud at this time:
- Manually running unit tests.
Searching for file content in development workspaces.
This list will be continuously updated as new features of Dataform in Google Cloud are released.
Known limitations
Dataform in Google Cloud has the following known limitations:
Dataform in Google Cloud runs on a plain V8 runtime and does not support additional capabilities and modules provided by Node.js. If your existing codebase requires any Node.js modules, you need to remove these dependencies.
Projects without a name field in
package.json
generate diffs onpackage-lock.json
every time packages are installed. To avoid this, you need to add aname
property inpackage.json
.git
+https://
URLs for dependencies inpackage.json
are not supported.Convert such URLs to plain
https://
archive URLs. For example, convertgit+https://github.com/dataform-co/dataform-segment.git#1.5
tohttps://github.com/dataform-co/dataform-segment/archive/1.5.tar.gz
.As of Dataform core
3.0.0.
, Dataform doesn't distribute a Docker image. You can build your own Docker image of Dataform, which you can use to run the equivalent of Dataform CLI commands. To build your own Docker image, see Containerize an application in the Docker documentation.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery and Dataform APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery and Dataform APIs.
Required roles
To get the permissions that you need to import a legacy project,
ask your administrator to grant you the
Dataform Admin (roles/dataform.admin
) IAM role on repositories.
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Import a legacy project
To import a legacy project in Dataform in Google Cloud, follow these steps in the Google Cloud console:
- Ensure that your Dataform project in
app.dataform.co
is connected to GitHub or GitLab. In the Google Cloud console, go to the Dataform page.
Connect the repository to the remote Git repository that houses your legacy project.
Configure your imported Dataform project
To adjust your legacy project to Dataform in Google Cloud, follow these steps:
In the Google Cloud console, go to the Dataform page.
Select your repository.
Go to the development workspace.
In your workflow settings file, specify a default location.
workflow_settings.yaml
Add the defaultLocation
parameter in the following format:
defaultLocation: DATASET_LOCATION,
Replace DATASET_LOCATION with the default location of your
BigQuery dataset, for example, US
, EU
, or us-east1
.
The defaultLocation
parameter is ignored by app.dataform.co
.
dataform.json
Add the defaultLocation
parameter in the following format:
"defaultLocation": "DATASET_LOCATION",
Replace DATASET_LOCATION with the default location of your
BigQuery dataset, for example, US
, EU
, or us-east1
.
The defaultLocation
parameter is ignored by app.dataform.co
.
- Delete
package-lock.json
. In
package.json
, do the following:- Upgrade
@dataform/core
to3.0.0-beta.2
or later. Add a package name in the following format:
{ "name": "PACKAGE_NAME", "dependencies": { "@dataform/core": "^3.0.0-beta.2" } }
Replace PACKAGE_NAME with a name for your Dataform package, for example, your project name.
Convert
git+https://
URLs inpackage.json
dependencies to plainhttps://
archive URLs.For example, convert
git+https://github.com/dataform-co/dataform-segment.git#1.5
tohttps://github.com/dataform-co/dataform-segment/archive/1.5.tar.gz
.If you are using
git+https://
URLs in prebuilt Dataform packages, check the updated installation instructions for these packages on their release pages, for example, the dataform-segment release page.
- Upgrade
Configure BigQuery permissions and user permissions.
Migrate environments from
environments.json
to release configurations.Migrate schedules from
environments.json
to workflow configurations.
What's next
- To learn how to migrate legacy environments and schedules to Dataform in Google Cloud, see Migrate legacy environments and schedules.
- To learn more about Dataform in Google Cloud, see Dataform overview.
- To learn more about features of Dataform in Google Cloud, see Overview of Dataform features.
- To learn how to create a repository, see Create a Dataform repository.
- To learn about code lifecycle in Dataform and ways to configure it,
see Introduction to code lifecycle in Dataform.