High number of duplicates in Dataflow pipeline streaming inserts to BigQuery

Problem

High number of duplicate (order of 1000s) are observed in a Dataflow pipeline reading from Pub/Sub and writing to BigQuery using streaming inserts. How do we improve deduplication and manually remove duplicates from BigQuery?

Environment

  • Dataflow Streaming pipeline
  • Pub/sub to BigQuery pipeline
  • Streaming inserts to BigQuery

Solution

Troubleshoot at the sink (BigQuery):
  1. Check if write to BigQuery step has .ignoreInsertIds().
  2. Remove it to enable BigQuery best-effort deduplication mechanism.
Troubleshoot at the source (Pub/Sub):
  1. Set the withIdAttribute() in PubSubIO.Read step.
  2. Beam guarantees no duplicate data if this attribute is set.

Remove duplicates from BigQuery:

  1. Manually set a unique identifier for each row when writing to BigQuery.
  2. Use the method provided in the documentation using your own identifier (instead of insertID).

Cause

There are multiple reasons because of which we can see duplicates

  • Streaming inserts by default enables BigQuery best-effort deduplication mechanism, but it should not be relied upon as a mechanism to guarantee the absence of duplicates in your data.
  •  BigQuery might degrade the quality of best effort de-duplication at any time in order to guarantee higher reliability and availability for your data.  Load write jobs typically have better deduplication efforts than streaming write jobs. 
  • Additionally, this best effort de-duplication can be disabled when you use Apache Beam's BigQuery I/O connector for Java, by using use the ignoreInsertIds().

However, the number of duplicates because of BigQuery should not be high. If a high number of duplicates is observed, it is possible that the source data has duplicates.

  • Dataflow provides at-least-once delivery and for achieving de-duplication when consuming messages, the PubSubIO.Read class needs to set the withIdAttribute() function
    • When reading from Cloud Pub/Sub where unique record identifiers are provided as Pub/Sub message attributes, this specifies the name of the attribute containing the unique identifier. If idAttribute is not provided, Beam cannot guarantee that no duplicate data will be delivered.

When manually removing duplicates from BigQuery, When using BigQuery Client Library, you get the flexibility to choose insertID which can be used to deduplicate. Dataflow on the other hand auto generates a random value as the insertID and there is currently no known way to use this insertID and insert to BigQuery. We need to manually set a unique identifier for each row as an alternative to insertID