Use the edit functionality for easy management of CDF pipelines
Sanjana Sandeep
Software Engineer
As data pipelines become more complex and involve multiple team members, it can be challenging to keep track of changes, collaborate effectively, and deploy pipelines to different environments in a controlled manner.
We can help, with Google's new pipeline edit feature that we introduced for Cloud Data Fusion (CDF) batch pipelines.
Better pipelines with versioning
A typical pipeline development process is iterative in nature. You make small unit changes to a pipeline, test them on small data, then on production data, then iteratively add features to the pipeline. Iterative pipeline design is also critical for a seamless experience to the Cloud Data Fusion user as it reduces overheads in developing and testing pipelines. An ETL developer is able to design a pipeline iteratively, where improvements are added incrementally while maintaining a full history of changes.
You can edit pipelines starting in Cloud Data Fusion version 6.9. When you edit a pipeline you've already deployed, you don't have to duplicate the pipeline and implement a versioning strategy across multiple pipelines. Instead, you edit a single pipeline and the versions are tracked for you. With pipeline edit capability in place after deployment, you do not have to implement versioning artificially by duplicating a pipeline. Thereby enhancing the user experience and productivity, and maintaining a correlation mapping between the various clones of a pipeline.
Benefits of pipeline editing
The pipeline edit feature lets you do the following:
- Incrementally make changes to any part of the deployed pipeline, such as the pipeline structure, configuration, metadata, preferences, and comments.
- You can also export an edited JSON file for a deployed pipeline.
How is it different from the CDF duplicate pipeline feature?
- Duplicating a pipeline creates a new pipeline with a different name while editing a pipeline creates a new version of the same pipeline, which prevents proliferation of pipelines (as seen in figure below), allowing for better organization.
- Maintain a history of all edited versions of the pipeline. Similar to the Google Docs experience, you can view the older versions of the pipeline.
Before you begin
- You need a Cloud Data Fusion instance with version 6.9.1 or above.
Upgrading to 6.9.1 or above will also unlock Source Control Management with Github. You can refer to the blog here.
NOTE: The pipeline edit feature is supported only for CDF batch pipelines.
How to use this feature?
When you edit the pipeline, CDF creates a new draft, once deployed it becomes the latest version of the pipeline (in case of upgraded instances, the pipelines are upgraded to become the latest version of the pipeline).The latest version retains the triggers, pipeline configurations, runtime arguments, metadata, comments, and schedules from the previous version. The latest version is the active version of the pipeline, i.e; it can be run or scheduled to run.
To edit a deployed pipeline follow the below steps:
- Go to the pipeline that you want to edit and click Edit, you can access this in the UI through both pipeline studio and the pipeline list page:
Edit through the pipeline studio page
Edit through the pipeline list page
- A new draft of the pipeline is created. Edit your pipeline and make the necessary changes. Optional: To finish editing the pipeline later, click Save. Draft statuses are displayed to mitigate concurrency issues (more discussed below).
Edit Draft opens for changes
“In-Progress” editing status for the edit draft that is yet to be deployed.
- After you finish editing the pipeline, click Deploy. This will open the Enter Change Summary dialog box, enter a description of the changes you made to the pipeline and click Deploy. A best practice is to enter a descriptive change summary as it identifies the edit version.
Note: You must make changes to your pipeline draft in order to deploy it, else an error message is displayed.
View version history
The history button is introduced in the pipeline studio page, which displays a list of edit versions and through which the previous edit versions of the pipeline can be accessed. The only actions that can be performed on an older edit version are view and restore. The older versions are identified by the date of creation and the change summary.
- When you click view, you are redirected to the older pipeline version on the studio page. Keep in mind that this is an inactive version that cannot be run or scheduled for runs.
You can go back to the latest version through the return to latest version link.
- When you click restore on an older version, it restores the older version to the latest active version, as such you are able to run or schedule runs on this version. Note : You cannot restore the latest version.
Export older edit version
When you wish to view or manipulate an older version pipeline json, you can export it locally. The edited json can be imported back to the pipeline edit draft.
An orphaned edit draft
When a pipeline is deleted, all deployed versions of the pipeline are removed other than the ones that are open in draft status. The draft pipeline enters an orphaned status, since the associated pipeline is removed and the draft no longer belongs to an existing pipeline. Deploying the draft will deploy a brand new pipeline and resolve the orphaned status.
An obsolete edit draft
When a newer version of the pipeline that you are currently editing becomes available, your changes are out of date. This happens when another user deploys the pipeline before you finish editing. The draft then enters the out of date/obsolete status.
Deployment is blocked and you see the error message prompting you to manually reconcile your changes.
To manually reconcile your pipeline, click on Export and Rebase in the prompt, this will export your current json draft locally, and rebase studio to the latest version. Thereby, resolving the out of date/obsolete status. Manually resolving the conflicts and importing the changes back into the draft is the recommended solution.
Learn more