This page describes how to manage pipelines using source control in Cloud Data Fusion through Git repositories.
About Source Control Management
Cloud Data Fusion provides the capability to visually design pipelines for ETL and ELT integrations. For better management of pipelines between development and production, Cloud Data Fusion allows Source Control Management of the pipelines using GitHub and other version control systems.
The Source Control Management in Cloud Data Fusion lets you do the following:
- Integrate each Cloud Data Fusion namespace with a version control system.
- Manage your pipelines in a central Git repository.
- Review and audit pipeline changes.
- Revert pipeline changes.
- Effectively collaborate with the team while ensuring central control.
Before you begin
- Source Control Management supports integration with GitHub, Bitbucket Server, Bitbucket Cloud, and Gitlab repositories.
- GitHub OAuth isn't supported.
- Source Control Management only supports batch pipelines.
- Source Control Management only supports pipeline design JSONs for push and pull operations. Execution configurations are not supported.
- The size limit of the linked repository is 5 GB.
Required roles and permissions
Source Control Management in Cloud Data Fusion consists of two key operations:
- Configuring source control repositories
- Syncing pipelines with Git repositories using push and pull operations
To get the permissions that you need to use the Source Control Management feature, ask your administrator to grant you any of the following predefined roles on your project:
Configure source control repository:
- Cloud Data Fusion Operator (
roles/datafusion.operator
) - Cloud Data Fusion Editor (
roles/datafusion.editor
) - Cloud Data Fusion Admin (
roles/datafusion.admin
)
- Cloud Data Fusion Operator (
Sync pipelines using push or pull operation from a namespace:
- Cloud Data Fusion Operator (
roles/datafusion.operator
) - Cloud Data Fusion Developer (
roles/datafusion.developer
) - Cloud Data Fusion Editor (
roles/datafusion.editor
) - Cloud Data Fusion Admin (
roles/datafusion.admin
)
- Cloud Data Fusion Operator (
For more information about granting roles, see Manage access.
You might also be able to get the required permissions through other predefined roles.
Set up a Git repository
To create a Git repository in GitHub, follow the instructions described in Create a repository.
For more information about personal access tokens in GitHub and other version control systems, see the following pages:
Connect a Git repository with Cloud Data Fusion
Cloud Data Fusion lets you configure and connect your Git repository in the Source Control Management tab for each namespace. To link a namespace with your Git repository, follow these steps:
Console
- In the Cloud Data Fusion Studio, click Menu.
- Click Namespace admin.
- On the Namespace admin page, click the Source Control Management tab.
- Click Link repository.
Enter the following details:
- Provider: Choose a Git service provider, such as GitHub or GitLab.
- Repository URL: Enter the URL where your repository can be
accessed. For GitHub, the repository URL is
https://github.com/HOST/REPO
. - Default branch (optional): Enter the initial branch of the Git. This branch can be different from the default branch configured on GitHub. This branch is used to sync pipelines, regardless of the default branch on GitHub.
- Path prefix (optional): Enter a prefix for your pipeline name that's
saved in the Git repository. For example, if your pipeline name
is
DataFusionQuickStart
and if you specify the prefix asnamespaceName
, then the pipeline is saved asnamespaceName/DataFusionQuickStart
in the Git repository. - Authentication type: Cloud Data Fusion lets you use the personalized access token as the authentication type. This is auto-selected.
- Token name: Enter a name that can be associated with the token.
- Token: Enter the token provided by the GitHub repository.
- Optional: User name: Enter a username or an owner for the token.
Click Validate. Wait for the connection to be verified.
When the configuration is complete, click Save and close to confirm the configuration.
REST API
Create a secret key in Cloud Data Fusion containing the personal access token.
Run the following command:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/securekeys/PASSWORD_SECRET_KEY -X PUT -d '{ "description": "Example Secure Key","data": "PERSONAL_ACCESS_TOKEN"}'
Replace the following:
NAMESPACE_ID
: the ID of the namespace.PASSWORD_SECRET_KEY
: the name of the secret key containing personal access token.PERSONAL_ACCESS_TOKEN
: personal access token of GitHub.
Run the following command:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/repository -X PUT -d '{"test": "TEST_ONLY", "config": {"provider": "PROVIDER_TYPE", "link": "REPO_URL", "defaultBranch": "DEFAULT_BRANCH", "pathPrefix": "PATH_TO_DIRECTORY", "auth": {"type": "AUTH_TYPE", "patConfig": {"passwordName": "PASSWORD_SECRET_KEY", "username": "USER_NAME"}}}}'
Replace the following:
NAMESPACE_ID
: the ID of the namespace.TEST_ONLY
: set totrue
if you want to only validate the configuration and not add to it.PROVIDER_TYPE
: the Git provider name, that is,GITHUB
.REPO_URL
: Repository URL to be linked. Use anhttps
URl—for example,https://github.com/user/repo.git
.DEFAULT_BRANCH
: Branch used for push and pull operations. If omitted, the default configured branch in the repository is used—for example, the main branch.PATH_TO_DIRECTORY
: path to the directory in the repository where configuration files are stored.AUTH_TYPE
: the authentication type. OnlyPAT
is supported. See Fine-grained personal access token in GitHub.PASSWORD_SECRET_KEY
: the name of the secret key containing the personal access token for authentication typePAT
.USER_NAME
: you can omit this value for authentication typePAT
.
Sync Cloud Data Fusion pipelines with a remote repository
After you configure a Git repository with a namespace, you can push and pull pipelines, and sync them, with the Git repository.
Push pipelines from Cloud Data Fusion to Git repository
To sync multiple deployed pipelines from a namespace to a Git repository, follow these steps:
Console
- In the Cloud Data Fusion Studio, click Menu.
- Click Namespace admin.
- On the Namespace admin page, click the Source Control Management tab.
- Find the Git repository that you want to sync with, and click Sync pipelines.
- Click the Namespace pipelines tab.
Search for and select the pipelines that you want to push to the Git repository.
If the latest version of the pipeline is pushed to or pulled from the Git repository, the Connected to Git status shows
Connected
. If the pipeline has never been pushed to GitHub, the Connected to Git status shows blank (-
).If you deploy a newer version of a pipeline that is already synced with the Git repository, the Connected to Git status changes from
Connected
to blank (-
).Click Push to repository.
Enter a Commit message, and click OK.
The push operation starts and a message is displayed indicating that the selected pipelines are being pushed to the remote repository.
When the push operation is completed successfully, a success message is displayed indicating the number of pipelines that were pushed to the remote repository.
If the push operation fails, check the pipeline in GitHub to see if it's the latest version. For every failed push operation, an error message is displayed. To view the details of the error, expand the error message.
You can also push individual pipelines to a Git repository from the pipeline design studio:
- In the Cloud Data Fusion Studio, click Menu.
- Click List.
- Click the pipeline you want to push to the Git repository.
- On the pipeline page, click Actions > Push to repository.
- Enter a Commit message and click OK.
REST API
Push a set of pipelines from Cloud Data Fusion to the Git repository:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/repository/apps/push -X POST -d '{"apps": ["PIPELINE_NAME_1", "PIPELINE_NAME_2"]}, "commitMessage": "COMMIT_MESSAGE"'
Replace the following:
NAMESPACE_ID
: the ID of the namespace.PIPELINE_NAME_1
,PIPELINE_NAME_2
: names of the pipelines to be pushed.COMMIT_MESSAGE
: commit message for the Git commit.
The response contains the ID of the push operation. For example:
RESPONSE { "id": OPERATION_ID }
To poll the status of the push operation, run the following command:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/operations/OPERATION_ID
Replace the following:
NAMESPACE_ID
: the ID of the namespace.OPERATION_ID
: the operation ID received from the push operation.
The response contains the status of the push operation. For example:
RESPONSE { "id": OPERATION_ID "done": True/False "status": STARTING/RUNNING/SUCCEEDED/FAILED "error": {"message": ERROR_MESSAGE, "details":[{"resourceUri": RESOURCE, "message": ERROR_MESSAGE}]} }
To verify if the push operation is completed, check the
done
property in the response. If the operation failed, check theerror
property for more details.
Pull pipelines from Git repository into Cloud Data Fusion
To sync multiple pipelines from a Git repository to your namespace, follow these steps:
Console
- In the Cloud Data Fusion Studio, click Menu.
- Click Namespace admin.
- On the Namespace admin page, click the Source Control Management tab.
- Find the Git repository that you want to sync with, and click Sync pipelines.
- Click the Repository pipelines tab. All of the pipelines stored in the Git repository are displayed.
- Search for and select the pipelines that you want to pull from the Git repository into your Cloud Data Fusion namespace.
Click Pull from repository.
The pull operation starts and a message is displayed indicating that the selected pipelines are being pulled from the remote repository. Cloud Data Fusion looks for JSON files under the configured path, and pulls and deploys them as pipelines to Cloud Data Fusion.
When the pull operation is completed successfully, a success message is displayed indicating the number of pipelines that were pulled from the remote repository.
If the pull operation fails, an error message is displayed. To view the details of the error, expand the error message.
You can also pull individual pipelines from a Git repository to a namespace from the pipeline design studio:
- In the Cloud Data Fusion Studio, click Menu.
- Click List.
- Click the pipeline that you want to pull from the Git repository.
- On the pipeline page, click Actions > Pull from repository.
REST API
Pull a set of pipelines from the Git repository into Cloud Data Fusion:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/repository/apps/pull -X POST -d '{"apps": ["PIPELINE_NAME_1", "PIPELINE_NAME_2"]}'
Replace the following:
NAMESPACE_ID
: the ID of the namespace.PIPELINE_NAME_1
,PIPELINE_NAME_2
: names of the pipelines to be pulled.
The response contains the ID of the pull operation. For example:
RESPONSE { "id": OPERATION_ID }
To poll the status of the pull operation, run the following command:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/operations/OPERATION_ID
Replace the following:
NAMESPACE_ID
: the ID of the namespace.OPERATION_ID
: the operation ID received from the pull operation.
The response contains the status of the pull operation. For example:
RESPONSE { "id": OPERATION_ID "done": True/False "status": STARTING/RUNNING/SUCCEEDED/FAILED "error": {"message": ERROR_MESSAGE, "details":[{"resourceUri": RESOURCE, "message": ERROR_MESSAGE}]} }
To verify if the pull operation is completed, check the
done
property in the response. If the operation failed, check theerror
property for more details.
Delete the Git repository configuration
To delete the Git repository configuration from a namespace, follow these steps:
Console
- In the Cloud Data Fusion Studio, click Menu.
- Click Namespace admin.
- On the Namespace admin page, click the Source Control Management tab.
- For the Git repository configuration you want to delete, click > Delete.
REST API
Delete the Git repository configuration:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)"
${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/repository -X DELETE
Replace NAMESPACE_ID with the ID of the namespace.
What's next
- Read more about Using a GitHub repository to manage pipelines.