Organízate con las colecciones
Guarda y clasifica el contenido según tus preferencias.
El linaje de los datos es una función de Dataflow que te permite monitorizar cómo se mueven los datos por tus sistemas: de dónde proceden, a dónde se transfieren y qué transformaciones se les aplican.
Cada flujo de procesamiento que ejecutas con Dataflow tiene varios recursos de datos asociados. El linaje de un recurso de datos incluye su origen, lo que le ocurre y dónde se mueve con el tiempo. Con el linaje de datos, puede hacer un seguimiento del movimiento de extremo a extremo de sus recursos de datos, desde el origen hasta el destino final.
Cuando habilitas el linaje de datos en tus trabajos de Dataflow, Dataflow captura eventos de linaje y los publica en la API Data Lineage de Dataplex Universal Catalog.
Sign in to your Google Cloud Platform account. If you're new to
Google Cloud,
create an account to evaluate how our products perform in
real-world scenarios. New customers also get $300 in free credits to
run, test, and deploy workloads.
Enable the Dataplex, BigQuery, and Data lineage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains the serviceusage.services.enable permission. Learn how to grant
roles.
Enable the Dataplex, BigQuery, and Data lineage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains the serviceusage.services.enable permission. Learn how to grant
roles.
Para obtener los permisos que necesitas para ver los gráficos de visualización del linaje, pide a tu administrador que te conceda los siguientes roles de gestión de identidades y accesos:
El linaje de datos de Dataflow tiene las siguientes limitaciones:
El linaje de datos se admite en las versiones 2.63.0 y posteriores del SDK de Apache Beam.
Debe habilitar el linaje de datos por trabajo.
La recogida de datos no es instantánea. Los datos de linaje de las tareas de Dataflow pueden tardar unos minutos en aparecer en Dataplex Universal Catalog.
Se admiten las siguientes fuentes y receptores:
Apache Kafka
BigQuery
Bigtable
Cloud Storage
JDBC (Java Database Connectivity)
Pub/Sub
Spanner
Las plantillas de Dataflow que usan estas fuentes y receptores también capturan y publican automáticamente eventos de linaje.
Habilitar el linaje de datos en Dataflow
Debes habilitar el linaje a nivel de tarea. Para habilitar el linaje de datos, usa la enable_lineageopción de servicio Dataflow
de la siguiente manera:
Opcionalmente, puede especificar uno o ambos de los siguientes parámetros con la opción de servicio:
process_id: identificador único que usa Dataplex Universal Catalog para agrupar ejecuciones de trabajos. Si no se especifica, se usa el nombre del trabajo.
process_name: nombre legible por humanos del proceso de linaje de datos.
Si no se especifica, se usa el nombre del trabajo con el prefijo "Dataflow ".
Especifique estas opciones de la siguiente manera:
El linaje de datos proporciona información sobre las relaciones entre los recursos de tu proyecto y los procesos que los crearon. Puede ver la información del linaje de datos en la consola de Google Cloud en forma de gráfico o de tabla única. También puede obtener información sobre el linaje de datos de la API Data Lineage en forma de datos JSON.
Si el linaje de datos está habilitado en un trabajo específico y quieres inhabilitarlo, cancela el trabajo y ejecuta una nueva versión del trabajo sin la opción de servicio enable_lineage.
[[["Es fácil de entender","easyToUnderstand","thumb-up"],["Me ofreció una solución al problema","solvedMyProblem","thumb-up"],["Otro","otherUp","thumb-up"]],[["Es difícil de entender","hardToUnderstand","thumb-down"],["La información o el código de muestra no son correctos","incorrectInformationOrSampleCode","thumb-down"],["Me faltan las muestras o la información que necesito","missingTheInformationSamplesINeed","thumb-down"],["Problema de traducción","translationIssue","thumb-down"],["Otro","otherDown","thumb-down"]],["Última actualización: 2025-09-10 (UTC)."],[[["\u003cp\u003eData lineage in Dataflow tracks how data moves through your systems, including its origin, transformations, and destination, allowing for end-to-end data asset movement tracking.\u003c/p\u003e\n"],["\u003cp\u003eEnabling data lineage for Dataflow jobs captures lineage events and publishes them to the Dataplex Data Lineage API, and it is done on a per-project basis and at the job level using the \u003ccode\u003eenable_lineage\u003c/code\u003e service option.\u003c/p\u003e\n"],["\u003cp\u003eViewing lineage information in Dataplex can be done through a visualization graph or a single table in the Google Cloud console, as well as retrieving JSON data from the Data Lineage API.\u003c/p\u003e\n"],["\u003cp\u003eSupported sources and sinks for data lineage in Dataflow include Apache Kafka, BigQuery, Bigtable, Cloud Storage, JDBC, Pub/Sub, and Spanner, and the feature requires Apache Beam SDK versions 2.63.0 or later.\u003c/p\u003e\n"],["\u003cp\u003eDisabling data lineage requires cancelling the current job and running a new version without the \u003ccode\u003eenable_lineage\u003c/code\u003e service option.\u003c/p\u003e\n"]]],[],null,["Data lineage is a Dataflow feature that lets you track\nhow data moves through your systems: where it comes from, where it is passed to,\nand what transformations are applied to it.\n\nEach pipeline that you run by using Dataflow has several associated\ndata assets. The lineage of a data asset includes its origin, what happens to\nit, and where it moves over time. With data lineage, you can track\nthe end-to-end movement of your data assets, from origin to eventual destination.\n\nWhen you enable data lineage for your\nDataflow jobs, Dataflow\ncaptures lineage events and publishes them to the Dataplex Universal Catalog\n[Data Lineage API](/dataplex/docs/reference/data-lineage/rest).\n\nTo access lineage information through Dataplex Universal Catalog, see\n[Use data lineage with Google Cloud Platform systems](/dataplex/docs/use-lineage).\n\nBefore you begin\n\nSet up your project:\n\n\n- Sign in to your Google Cloud Platform account. If you're new to Google Cloud, [create an account](https://console.cloud.google.com/freetrial) to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.\n-\n [Verify that billing is enabled for your Google Cloud project](/billing/docs/how-to/verify-billing-enabled#confirm_billing_is_enabled_on_a_project).\n\n-\n\n\n Enable the Dataplex, BigQuery, and Data lineage APIs.\n\n\n **Roles required to enable APIs**\n\n\n To enable APIs, you need the Service Usage Admin IAM\n role (`roles/serviceusage.serviceUsageAdmin`), which\n contains the `serviceusage.services.enable` permission. [Learn how to grant\n roles](/iam/docs/granting-changing-revoking-access).\n\n [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=dataplex.googleapis.com,bigquery.googleapis.com,datalineage.googleapis.com)\n\n-\n [Verify that billing is enabled for your Google Cloud project](/billing/docs/how-to/verify-billing-enabled#confirm_billing_is_enabled_on_a_project).\n\n-\n\n\n Enable the Dataplex, BigQuery, and Data lineage APIs.\n\n\n **Roles required to enable APIs**\n\n\n To enable APIs, you need the Service Usage Admin IAM\n role (`roles/serviceusage.serviceUsageAdmin`), which\n contains the `serviceusage.services.enable` permission. [Learn how to grant\n roles](/iam/docs/granting-changing-revoking-access).\n\n [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=dataplex.googleapis.com,bigquery.googleapis.com,datalineage.googleapis.com)\n\n\u003cbr /\u003e\n\n| **Caution:** Data lineage is enabled on a per-project basis, not a per-service basis. After you enable the Data Lineage API, lineage information is automatically reported for multiple Google Cloud Platform services in the project, depending on their product-level lineage control. For more details, see [Data lineage considerations](/dataplex/docs/lineage-considerations).\n\nIn Dataflow, you also need to enable lineage at the job level.\nSee [Enable data lineage in Dataflow](#enable-data-lineage) in\nthis document.\n\nRequired roles\n\n\nTo get the permissions that\nyou need to view lineage visualization graphs,\n\nask your administrator to grant you the\nfollowing IAM roles:\n\n- [Dataplex Catalog viewer](/iam/docs/roles-permissions/dataplex#dataplex.catalogViewer) (`roles/dataplex.catalogViewer`) on the Dataplex Universal Catalog resource project\n- [Data Lineage Viewer](/iam/docs/roles-permissions/datalineage#datalineage.viewer) (`roles/datalineage.viewer`) on the project where you use Dataflow\n- [Dataflow viewer](/iam/docs/roles-permissions/dataflow#dataflow.viewer) (`roles/dataflow.viewer`) on the project where you use Dataflow\n\n\nFor more information about granting roles, see [Manage access to projects, folders, and organizations](/iam/docs/granting-changing-revoking-access).\n\n\nYou might also be able to get\nthe required permissions through [custom\nroles](/iam/docs/creating-custom-roles) or other [predefined\nroles](/iam/docs/roles-overview#predefined).\n\nFor more information about data lineage roles, see\n[Predefined roles for data lineage](/dataplex/docs/iam-roles#lineage-roles).\n\nSupport and limitations\n\nData lineage in Dataflow has the following limitations:\n\n- Data lineage is supported in the Apache Beam SDK versions 2.63.0 and later.\n- You must enable data lineage on a per-job basis.\n- Data capture isn't instantaneous. It can take a few minutes for Dataflow job lineage data to appear in Dataplex Universal Catalog.\n- The following sources and sinks are supported:\n\n - Apache Kafka\n - BigQuery\n - Bigtable\n - Cloud Storage\n - JDBC (Java Database Connectivity)\n - Pub/Sub\n - Spanner\n\n [Dataflow templates](/dataflow/docs/guides/templates/provided-templates)\n that use these sources and sinks also automatically capture and publish\n lineage events.\n\nEnable data lineage in Dataflow\n\nYou need to enable lineage at the job level. To enable data lineage,\nuse the `enable_lineage`\n[Dataflow service option](/dataflow/docs/reference/service-options)\nas follows: \n\nJava \n\n --dataflowServiceOptions=enable_lineage=true\n\nPython \n\n --dataflow_service_options=enable_lineage=true\n\nGo \n\n --dataflow_service_options=enable_lineage=true\n\ngcloud\n\nUse the\n[`gcloud dataflow jobs run`](/sdk/gcloud/reference/dataflow/jobs/run) command\nwith the `additional-experiments` option. If you're using Flex Templates, use\nthe\n[`gcloud dataflow flex-template run`](/sdk/gcloud/reference/dataflow/flex-template/run)\ncommand. \n\n --additional-experiments=enable_lineage=true\n\nOptionally, you can specify one or both of the following parameters with the\nservice option:\n\n- `process_id`: A unique identifier that Dataplex Universal Catalog uses to group job runs. If not specified, the job name is used.\n- `process_name`: A human-readable name for the data lineage process. If not specified, the job name prefixed with `\"Dataflow \"` is used.\n\nSpecify these options as follows: \n\nJava \n\n --dataflowServiceOptions=enable_lineage=process_id=\u003cvar translate=\"no\"\u003ePROCESS_ID\u003c/var\u003e;process_name=\u003cvar translate=\"no\"\u003eDISPLAY_NAME\u003c/var\u003e\n\nPython \n\n --dataflow_service_options=enable_lineage=process_id=\u003cvar translate=\"no\"\u003ePROCESS_ID\u003c/var\u003e;process_name=\u003cvar translate=\"no\"\u003eDISPLAY_NAME\u003c/var\u003e\n\nGo \n\n --dataflow_service_options=enable_lineage=process_id=\u003cvar translate=\"no\"\u003ePROCESS_ID\u003c/var\u003e;process_name=\u003cvar translate=\"no\"\u003eDISPLAY_NAME\u003c/var\u003e\n\ngcloud \n\n --additional-experiments=enable_lineage=process_id=\u003cvar translate=\"no\"\u003ePROCESS_ID\u003c/var\u003e;process_name=\u003cvar translate=\"no\"\u003eDISPLAY_NAME\u003c/var\u003e\n\nView lineage in Dataplex Universal Catalog\n\nData lineage provides information about the relations between your project\nresources and the processes that created them. You can view data lineage\ninformation in the Google Cloud console in the form of a graph or a\nsingle table. You can also retrieve data lineage information from the\nData Lineage API in the form of JSON data.\n\nFor more information, see\n[Use data lineage with Google Cloud Platform systems](/dataplex/docs/use-lineage).\n\nDisable data lineage in Dataflow\n\nIf data lineage is enabled for a specific job and you want to disable\nit, cancel the existing job and run a new version of the job without the\n`enable_lineage` service option.\n\nBilling\n\nUsing data lineage in Dataflow doesn't impact your\nDataflow bill, but it might incur additional charges on your\nDataplex Universal Catalog bill. For more information, see\n[Data lineage considerations](/dataplex/docs/lineage-considerations)\nand [Dataplex Universal Catalog pricing](/dataplex/pricing).\n\nWhat's next\n\n- Learn more about [data lineage](/dataplex/docs/about-data-lineage).\n- Learn how to [use\n data lineage](/dataplex/docs/use-lineage)."]]