透過集合功能整理內容
你可以依據偏好儲存及分類內容。
變更資料擷取 (CDC) 處理程序
本頁將逐步說明如何使用 BigQuery 中的 Google Cloud Cortex Framework 進行變更資料擷取 (CDC)。BigQuery 的設計宗旨是有效率地儲存及分析新資料。
CDC 程序
當來源資料系統 (例如 SAP) 中的資料變更時,BigQuery 不會修改現有記錄。而是以新記錄的形式新增更新資訊。為避免重複,之後需要套用合併作業。這個程序稱為變更資料擷取 (CDC) 處理程序。
SAP 適用的資料基礎包含建立 Cloud Composer 或 Apache Airflow 指令碼的選項,可合併或 upsert
更新後產生的新記錄,並只將最新版本保留在新資料集中。如要讓這些指令碼正常運作,表格必須包含下列特定欄位:
operation_flag
:這個標記會告知指令碼記錄是否已插入、更新或刪除。
recordstamp
:這個時間戳記有助於識別記錄的最新版本。這個標記會指出記錄是否為:
使用 CDC 處理程序,可確保 BigQuery 資料準確反映來源系統的最新狀態。這項功能可避免重複輸入資料,並為資料分析提供可靠的基礎。
資料集結構
對於所有支援的資料來源,上游系統的資料會先複製到 BigQuery 資料集 (source
或 replicated dataset
),然後更新或合併的結果會插入另一個資料集 (CDC 資料集)。報表檢視畫面會從 CDC 資料集選取資料,確保報表工具和應用程式一律使用最新版本的資料表。
下圖顯示 SAP 的 CDC 處理程序,取決於 operational_flag
和 recordstamp
。

圖 1:SAP 的 CDC 處理範例。
下圖顯示從 API 整合到原始資料和 Salesforce 的 CDC 處理程序,取決於 Salesforce API 產生的 Id
和 SystemModStamp
欄位。

圖 2:從 API 整合到 Salesforce 的原始資料和變更資料擷取處理程序。
部分複製工具可在將記錄插入 BigQuery 時合併或 upsert 記錄,因此這些指令碼的產生作業為選用。在本例中,設定只包含單一資料集。報表資料集會從該資料集擷取更新的記錄,以供報表使用。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-09-04 (世界標準時間)。
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[[["\u003cp\u003eChange Data Capture (CDC) in Google Cloud Cortex Framework for BigQuery adds updated information as new records instead of modifying existing ones.\u003c/p\u003e\n"],["\u003cp\u003eA merge or upsert operation is required after CDC to avoid duplicates and keep only the latest version of each record in a new dataset.\u003c/p\u003e\n"],["\u003cp\u003eThe process relies on \u003ccode\u003eoperation_flag\u003c/code\u003e and \u003ccode\u003erecordstamp\u003c/code\u003e fields to identify whether a record was inserted, updated, or deleted, and to track the most recent version.\u003c/p\u003e\n"],["\u003cp\u003eData is replicated into a \u003ccode\u003esource\u003c/code\u003e dataset, and the merged results are inserted into a separate CDC dataset, ensuring reporting tools always use the latest data version.\u003c/p\u003e\n"],["\u003cp\u003eSome replication tools can merge or upsert records during insertion into BigQuery, making the creation of CDC scripts optional, and allowing a single dataset approach.\u003c/p\u003e\n"]]],[],null,["# Change Data Capture (CDC) processing\n====================================\n\nThis page guides you through Change Data Capture (CDC) within Google Cloud Cortex Framework\nin BigQuery. BigQuery is designed for efficiently\nstoring and analyzing new data.\n\nCDC process\n-----------\n\nWhen data changes in your source data system\n(like SAP), BigQuery doesn't modify existing records. Instead,\nthe updated information is added as a new record. To avoid duplicates, a\nmerge operation needs to be applied afterwards. This process is\ncalled [Change Data Capture (CDC) processing](/bigquery/docs/migration/database-replication-to-bigquery-using-change-data-capture).\n\nThe Data Foundation for SAP includes the option to create scripts for\nCloud Composer or Apache Airflow to [merge](/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement)\nor `upsert` the new records resulting from updates and only keep the\nlatest version in a new dataset. For these scripts to work the tables\nneed to have some specific fields:\n\n- `operation_flag`: This flag tells the script whether a record was inserted, updated, or deleted.\n- `recordstamp`: This timestamp helps identify the most recent version of a record. This flag indicates whether the record is:\n - Inserted (I)\n - Updated (U)\n - Deleted (D)\n\nBy utilizing CDC processing, you can ensure that your BigQuery\ndata accurately reflects the latest state of your source system.\nThis eliminates duplicate entries and provides a reliable foundation for\nyour data analysis.\n\nDataset structure\n-----------------\n\nFor all supported data sources, data from upstream systems are first replicated\ninto a BigQuery dataset (`source` or `replicated dataset`),\nand the updated or merged results are inserted into another dataset\n(CDC dataset). The reporting views select data from the CDC dataset,\nto ensure the reporting tools and applications always have the latest version\nof a table.\n\nThe following flow shows how the CDC processing for SAP, dependent on\nthe `operational_flag` and `recordstamp`.\n\n**Figure 1**. CDC processing example for SAP.\n\nThe following flow depicts the integration from APIs into Raw data and\nCDC processing for Salesforce, dependent on the `Id` and `SystemModStamp`\nfields produced by Salesforce APIs.\n\n**Figure 2**. Integration from APIs into Raw data and CDC processing for Salesforce.\n\nSome replication tools can merge or upsert the records when\ninserting them into BigQuery, so the generation of these\nscripts is optional. In this case, the setup only has a single\ndataset. The reporting dataset fetches updated records for reporting\nfrom that dataset."]]