Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Debugging data transformations using Cloud Dataflow and Stackdriver Debugger

Thursday, April 21, 2016
Posted by Ben Chambers, Software Engineer

Debugging distributed data processing can be difficult. Google Stackdriver Debugger (previously Google Cloud Debugger) makes it easy by allowing you to request a snapshot of the execution state at a specific point. This captures the call-stack that led to the given line, as well as the variables that were in scope at each point and their values. In this post, I’d like to show you how easy it is to use Stackdriver Debugger with a Google Cloud Dataflow job using the new 1.5.1 release of the Dataflow SDK.

As an example of a data transformation pipeline, we’ll be using DebuggingWordCount. This example counts only those words that match a specific filter — by default the words “Flourish” or “stomach.” It demonstrates a variety of debugging features including logging the words that don’t match. We won’t be using those logs, since we’re trying to simulate the case of debugging a problem with a real pipeline that didn’t have precautionary logging added in advance.

I’ve run the pipeline with --filterPattern=Flourish,stomach and noticed that there was no output produced! Looking in the Dataflow UI, I see that my aggregators report that none of the words matched the filter and that while there's output from the WordCount.CountWords step, there's no output from the ParDo(FilterText) step. Also, the assertion in the pipeline failed.

Since there was input to the ParDo(FilterText) and no output (indicated by these custom counters and looking at the number of inputs to each step) I decided to take a snapshot at the branch of the conditional that performs filtering (where the unmatchedWords is incremented).

Step 1: Run my pipeline with the debugger.

All I need to do is run my pipeline on the Cloud Dataflow service and use the --enableCloudDebugger option. When I do so it prints out a message like:

To debug your job, visit Google Cloud Debugger at:<...>&dbgee=<...>

I can reach the Debugger UI by either clicking that link, or navigating to and selecting my project. If there are multiple pipelines running I may need to select the one I want from the menu — by default it will pick the most recent.

Step 2: Access my source code

I need to make my source code available to the Debugger UI before I can see the code in the UI. This also makes it easier to take snapshots, since I can click on the line rather than needing to type in a filename and line number. There are a variety of options described in Source Code Options, including pointing at a directory in the local filesystem or Cloud Source Repository.

For simplicity (and since I don’t have the source code uploaded anywhere), I choose to “View source code from local files” and point it at the directory with my pipeline source code.

Step 3: Take a snapshot

I can take a snapshot by typing a location in the text box on the right and hitting "Enter," or by clicking a source code line number. Since I already have the source code visible in the UI, I navigate to the file, and click on line number 148 to request a snapshot at that line:

Since I’m especially interested on words that I thought should have passed the filter, I make it conditional using the following expression:

"Flourish".equals((( c.element()).getKey())

Note that in the above conditional we're able to access private fields, but any time we rely on generics (eg., element() returning a KV) we need to explicitly cast.

Note: Normally, the evaluation of all conditions is limited to less than 0.01 of the CPU time. Since this condition is evaluated for every element, and the actual work for an element is relatively simple (discard it), the snapshot will be cancelled by the debugger. Therefore, I needed to run with --maxConditionCost=0.6 to get the debugger to evaluate this condition. You can specify this parameter when submitting the Dataflow job.

Step 4: Wait for snapshot to be captured

Now that I’ve set a snapshot, the Stackdriver Debugger service communicates that to all of the running Dataflow workers. As soon as one of the workers hits the line that I put my snapshot on, I’ll be able to see the snapshot capture.

Step 5: Explore the snapshot to figure out what is wrong.

If I want to see what value is being processed, I can expand the ProcessContext (called c) and dig into the context. I’m looking for the element, which is stored as a WindowedValue inside the context. Poking around a bit, I can also expand the this$0 to see what filter is being used.

Here I see that the filterPattern I specified (Flourish,stomach) didn’t work because it was used as a regex. I change it to Flourish|stomach and rerun — everything looks green!

That’s It!

Today we saw how to use Stackdriver Debugger to request snapshots from a Dataflow pipeline. We used a relatively simple example running on a few workers, but this could just as easily have been running on 500 workers over gigabytes of data, which is not an uncommon scenario in cloud-scale extract, transform and load (ETL) jobs.

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.