Migrate from legacy Dataform

Legacy Dataform will be deprecated on February 26, 2024, after which you won't be able to access legacy projects. This document describes differences between legacy Dataform and Dataform in Google Cloud, and shows you how to import a legacy Dataform project into Dataform in Google Cloud.

About differences between legacy Dataform and Dataform in Google Cloud

Dataform is a serverless service for data analysts to develop and deploy tables, incremental tables, or views to BigQuery. Dataform offers a web environment for SQL workflow development, connection with GitHub, GitLab, Bitbucket, and Azure DevOps Services, continuous integration, continuous deployment, and workflow execution.

Dataform in Google Cloud is different from legacy Dataform in the following ways:

  • Dataform in Google Cloud supports connection of Dataform repositories to Bitbucket repositories.
  • Access control is based on IAM.
  • Configuration of a query concurrency limit (concurrentQueryLimit) in dataform.json is removed.

    In legacy Dataform, concurrency limits prevented Dataform from sending too many concurrent queries to BigQuery. To manage concurrency in Dataform in Google Cloud, we recommend enabling BigQuery query queues.

  • Legacy environments are replaced by release configurations.

  • Legacy schedules are replaced by workflow configurations.

  • Workflow failure alerts are configured in Cloud Logging.

  • Dataform in Google Cloud and legacy Dataform use different NPM versions and different formats of package-lock.json.

    To develop a SQL workflow in both legacy Dataform and Dataform in Google Cloud, use the legacy package-lock.json format for package installation. Don't install packages in Dataform in Google Cloud until you fully migrate to Dataform in Google Cloud.

For more information about features of Dataform in Google Cloud, see Overview of Dataform features.

Legacy Dataform features not supported in Google Cloud at this time

The following features of legacy Dataform are not supported in Dataform in Google Cloud at this time:

  • Manually running unit tests.
  • Searching for file content in development workspaces.

This list will be continuously updated as new features of Dataform in Google Cloud are released.

Known limitations

Dataform in Google Cloud has the following known limitations:

  • Dataform in Google Cloud runs on a plain V8 runtime and does not support additional capabilities and modules provided by Node.js. If your existing codebase requires any Node.js modules, you need to remove these dependencies.

    Projects without a name field in package.json generate diffs on package-lock.json every time packages are installed. To avoid this, you need to add a name property in package.json.

  • git+https:// URLs for dependencies in package.json are not supported.

    Convert such URLs to plain https:// archive URLs. For example, convert git+https://github.com/dataform-co/dataform-segment.git#1.5 to https://github.com/dataform-co/dataform-segment/archive/1.5.tar.gz.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the BigQuery and Dataform APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the BigQuery and Dataform APIs.

    Enable the APIs

Required roles

To get the permissions that you need to import a legacy project, ask your administrator to grant you the Dataform Admin (roles/dataform.admin) IAM role on repositories. For more information about granting roles, see Manage access.

You might also be able to get the required permissions through custom roles or other predefined roles.

Import a legacy project

To import a legacy project in Dataform in Google Cloud, follow these steps in the Google Cloud console:

  1. Ensure that your Dataform project in app.dataform.co is connected to GitHub or GitLab.
  2. In the Google Cloud console, go to the Dataform page.

    Go to the Dataform page

  3. Create a new repository.

  4. Connect the repository to the remote Git repository that houses your legacy project.

Configure your imported Dataform project

To adjust your legacy project to Dataform in Google Cloud, follow these steps:

  1. In the Google Cloud console, go to the Dataform page.

    Go to the Dataform page

  2. Select your repository.

  3. Create a development workspace.

  4. Go to the development workspace.

  5. In dataform.json, add the defaultLocationparameter. This parameter is ignored by app.dataform.co.

    "defaultLocation": "DATASET_LOCATION",
    

    Replace DATASET_LOCATION with the default location of your BigQuery dataset, for example, US, EU, or us-east1.

  6. Delete package-lock.json.

  7. In package.json, do the following:

    1. Upgrade @dataform/core to 3.0.0-beta.2 or later.
    2. Add a package name in the following format:

      {
          "name": "PACKAGE_NAME",
          "dependencies": {
              "@dataform/core": "^3.0.0-beta.2"
          }
      }
      

      Replace PACKAGE_NAME with a name for your Dataform package, for example, your project name.

    3. Convert git+https:// URLs in package.json dependencies to plain https:// archive URLs.

      For example, convert git+https://github.com/dataform-co/dataform-segment.git#1.5 to https://github.com/dataform-co/dataform-segment/archive/1.5.tar.gz.

      If you are using git+https:// URLs in pre-built dataform packages, check the updated installation instructions for these packages on their release pages, for example, the dataform-segment release page.

  8. Configure BigQuery permissions and user permissions.

  9. Migrate environments from environments.json to release configurations.

  10. Migrate schedules from environments.json to workflow configurations.

  11. Configure alerts using Cloud logging.

What's next