This page describes how to manage pipelines using source control in Cloud Data Fusion through Git repositories.
About Source Control Management
Cloud Data Fusion provides the capability to visually design pipelines for ETL and ELT integrations. For better management of pipelines between development and production, Cloud Data Fusion allows Source Control Management of the pipelines using GitHub.
The Source Control Management in Cloud Data Fusion lets you do the following:
- Integrate each Cloud Data Fusion namespace with GitHub.
- Manage your pipelines in a central Git repository.
- Review and audit pipeline changes.
- Revert pipeline changes.
- Effectively collaborate with the team while ensuring central control.
Before you begin
- Source Control Management only supports integration with GitHub repositories.
- GitHub OAuth isn't supported.
- Source Control Management only supports batch pipelines.
- Source Control Management only supports pipeline design JSONs for push and pull operations. Execution configurations are not supported.
- The size limit of the linked repository is 5 GB.
Required roles and permissions
Source Control Management in Cloud Data Fusion consists of two key operations:
- Configuring source control repositories
- Syncing pipelines with Git repositories using push and pull operations
To get the permissions that you need to use the Source Control Management feature, ask your administrator to grant you any of the following predefined roles on your project:
Configure source control repository:
- Cloud Data Fusion Operator (
roles/datafusion.operator
) - Cloud Data Fusion Editor (
roles/datafusion.editor
) - Cloud Data Fusion Admin (
roles/datafusion.admin
)
- Cloud Data Fusion Operator (
Sync pipelines using push or pull operation from a namespace:
- Cloud Data Fusion Operator (
roles/datafusion.operator
) - Cloud Data Fusion Developer (
roles/datafusion.developer
) - Cloud Data Fusion Editor (
roles/datafusion.editor
) - Cloud Data Fusion Admin (
roles/datafusion.admin
)
- Cloud Data Fusion Operator (
For more information about granting roles, see Manage access.
You might also be able to get the required permissions through other predefined roles.
Set up a Git repository
To create a Git repository in GitHub, follow the instructions described in Create a repository.
For more information about personal access tokens in GitHub, see the following documents:
Connect a Git repository with Cloud Data Fusion
Cloud Data Fusion lets you configure and connect your Git repository in the Source Control Management tab for each namespace. To link a namespace with your Git repository, follow these steps:
Console
- In the Cloud Data Fusion web interface, click Menu.
- Click Namespace admin.
- On the Namespace admin page, click the Source Control Management tab.
- Click Link repository.
Enter the following details:
- Provider: Choose a Git service provider. Select
GitHub
, as Source Control Management only supports integration with GitHub repositories. - Repository URL: Enter the URL where your repository can be
accessed. For GitHub, the repository URL is
https://github.com/HOST/REPO
. - Default branch (optional): Enter the initial branch of the Git. This branch can be different from the default branch configured on GitHub. This branch will be used to sync pipelines, regardless of the default branch on GitHub.
- Path prefix (optional): Enter a prefix for your pipeline name that
will be saved in the Git repository. For example, if your pipeline name
is
DataFusionQuickStart
and if you specify the prefix asnamespaceName
, then the pipeline will be saved asnamespaceName/DataFusionQuickStart
in the Git repository. - Authentication type: Cloud Data Fusion lets you use the personalized access token as the authentication type. This is auto-selected.
- Token name: Enter a name that can be associated with the token.
- Token: Enter the token provided by the GitHub repository.
- User name (optional): Enter a username or an owner for the token.
- Provider: Choose a Git service provider. Select
Click Validate. Wait for the connection to be verified.
When the configuration is complete, click Save and close to confirm the configuration.
REST API
Create a secret key in Cloud Data Fusion containing the personal access token.
Run the following command:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/securekeys/PASSWORD_SECRET_KEY -X PUT -d '{ "description": "Example Secure Key","data": "PERSONAL_ACCESS_TOKEN"}'
Replace the following:
NAMESPACE_ID
: the ID of the namespacePASSWORD_SECRET_KEY
: the name of the secret key containing personal access tokenPERSONAL_ACCESS_TOKEN
: personal access token of GitHub
Run the following command:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/repository -X PUT -d '{"test": "TEST_ONLY", "config": {"provider": "PROVIDER_TYPE", "link": "REPO_URL", "defaultBranch": "DEFAULT_BRANCH", "pathPrefix": "PATH_TO_DIRECTORY", "auth": {"type": "AUTH_TYPE", "patConfig": {"passwordName": "PASSWORD_SECRET_KEY", "username": "USER_NAME"}}}}'
Replace the following:
NAMESPACE_ID
: the ID of the namespaceTEST_ONLY
: set totrue
if you want to only validate the configuration and not addPROVIDER_TYPE
: the Git provider name, that is,GITHUB
REPO_URL
: Repository URL to be linked. Use anhttps
URl—for example,https://github.com/user/repo.git
DEFAULT_BRANCH
: Branch used for push and pull operations. If omitted, the default configured branch in the repository will be used—for example, the main branchPATH_TO_DIRECTORY
: path to the directory in the repository where configuration files should be storedAUTH_TYPE
: the authentication type. OnlyPAT
is supported. See Fine-grained personal access token in GitHubPASSWORD_SECRET_KEY
: the name of the secret key containing the personal access token for authentication typePAT
USER_NAME
: you can omit this value for authentication typePAT
Sync Cloud Data Fusion pipelines with a remote repository
After you configure a Git repository with a namespace, you can push and pull pipelines, and sync them, with the Git repository.
Push pipelines from Cloud Data Fusion to Git repository
To sync multiple deployed pipelines from a namespace to a Git repository, follow these steps:
Console
- In the Cloud Data Fusion web interface, click Menu.
- Click Namespace admin.
- On the Namespace admin page, click the Source Control Management tab.
- Find the Git repository that you want to sync with, and click Sync pipelines.
- Click the Namespace pipelines tab.
Search for and select the pipelines that you want to push to the Git repository.
If the latest version of the pipeline is pushed to or pulled from the Git repository, the Connected to Git status shows
Connected
. If the pipeline has never been pushed to GitHub, the Connected to Git status shows blank (-
).If you deploy a newer version of a pipeline that is already synced with the Git repository, the Connected to Git status changes from
Connected
to blank (-
).Click Push to repository.
Enter a Commit message, and click OK.
The push operation starts and a message is displayed indicating that the selected pipelines are being pushed to the remote repository.
When the push operation is completed successfully, a success message is displayed indicating the number of pipelines that were pushed to the remote repository.
If the push operation fails, check the pipeline in GitHub to see if it's the latest version. For every failed push operation, an error message is displayed. To view the details of the error, expand the error message.
You can also push individual pipelines to a Git repository from the pipeline design studio:
- In the Cloud Data Fusion web interface, click Menu.
- Click List.
- Click the pipeline you want to push to the Git repository.
- On the pipeline page, click Actions > Push to repository.
- Enter a Commit message and click OK.
REST API
Push a set of pipelines from Cloud Data Fusion to the Git repository:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/repository/apps/push -X POST -d '{"apps": ["PIPELINE_NAME_1", "PIPELINE_NAME_2"]}, "commitMessage": "COMMIT_MESSAGE"'
Replace the following:
NAMESPACE_ID
: the ID of the namespacePIPELINE_NAME_1
,PIPELINE_NAME_2
: names of the pipelines to be pushedCOMMIT_MESSAGE
: commit message for the Git commit
The response contains the ID of the push operation. For example:
RESPONSE { "id": OPERATION_ID }
To poll the status of the push operation, run the following command:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/operations/OPERATION_ID
Replace the following:
NAMESPACE_ID
: the ID of the namespaceOPERATION_ID
: the operation ID received from the push operation.
The response contains the status of the push operation. For example:
RESPONSE { "id": OPERATION_ID "done": True/False "status": STARTING/RUNNING/SUCCEEDED/FAILED "error": {"message": ERROR_MESSAGE, "details":[{"resourceUri": RESOURCE, "message": ERROR_MESSAGE}]} }
To verify if the push operation is completed, check the
done
property in the response. If the operation failed, check theerror
property for more details.
Pull pipelines from Git repository into Cloud Data Fusion
To sync multiple pipelines from a Git repository to your namespace, follow these steps:
Console
- In the Cloud Data Fusion web interface, click Menu.
- Click Namespace admin.
- On the Namespace admin page, click the Source Control Management tab.
- Find the Git repository that you want to sync with, and click Sync pipelines.
- Click the Repository pipelines tab. All of the pipelines stored in the Git repository are displayed.
- Search for and select the pipelines that you want to pull from the Git repository into your Cloud Data Fusion namespace.
Click Pull from repository.
The pull operation starts and a message is displayed indicating that the selected pipelines are being pulled from the remote repository. Cloud Data Fusion looks for JSON files under the configured path, and pulls and deploys them as pipelines to Cloud Data Fusion.
When the pull operation is completed successfully, a success message is displayed indicating the number of pipelines that were pulled from the remote repository.
If the pull operation fails, an error message is displayed. To view the details of the error, expand the error message.
You can also pull individual pipelines from a Git repository to a namespace from the pipeline design studio:
- In the Cloud Data Fusion web interface, click Menu.
- Click List.
- Click the pipeline that you want to pull from the Git repository.
- On the pipeline page, click Actions > Pull from repository.
REST API
Pull a set of pipelines from the Git repository into Cloud Data Fusion:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/repository/apps/pull -X POST -d '{"apps": ["PIPELINE_NAME_1", "PIPELINE_NAME_2"]}'
Replace the following:
NAMESPACE_ID
: the ID of the namespacePIPELINE_NAME_1
,PIPELINE_NAME_2
: names of the pipelines to be pulled
The response contains the ID of the pull operation. For example:
RESPONSE { "id": OPERATION_ID }
To poll the status of the pull operation, run the following command:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/operations/OPERATION_ID
Replace the following:
NAMESPACE_ID
: the ID of the namespaceOPERATION_ID
: the operation ID received from the pull operation.
The response contains the status of the pull operation. For example:
RESPONSE { "id": OPERATION_ID "done": True/False "status": STARTING/RUNNING/SUCCEEDED/FAILED "error": {"message": ERROR_MESSAGE, "details":[{"resourceUri": RESOURCE, "message": ERROR_MESSAGE}]} }
To verify if the pull operation is completed, check the
done
property in the response. If the operation failed, check theerror
property for more details.
Delete the Git repository configuration
To delete the Git repository configuration from a namespace, follow these steps:
Console
- In the Cloud Data Fusion web interface, click Menu.
- Click Namespace admin.
- On the Namespace admin page, click the Source Control Management tab.
- For the Git repository configuration you want to delete, click > Delete.
REST API
Delete the Git repository configuration:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)"
${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/repository -X DELETE
Replace NAMESPACE_ID with the ID of the namespace.
What's next
- Read more about Using a GitHub repository to manage pipelines.