Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Visualization and large-scale processing of historical weather radar (NEXRAD Level II) data

Thursday, June 15, 2017
By Lak Lakshmanan, Technical Lead, Data & Machine Learning Professional Services

One of the highest-resolution sources of weather data in the United States comes from the NEXRAD network of 160 weather radars operated by the NOAA National Weather Service (NWS), the Federal Aviation Administration (FAA) and the U.S. Air Force (USAF). These 10 cm S-Band Doppler radars, located in the contiguous United States, Alaska, Hawaii, U.S. territories and at military base sites, detect atmospheric precipitation and winds.

The historical archive of that weather radar data is now available as a public dataset on Google Cloud Storage.

You can browse and download files using the console here or programmatically interact with the Google Cloud Storage bucket gs://gcp-public-data-nexrad-l2.

Sample program to visualize volume scans

In order to take advantage of high sustained reads offered within Google’s data centers, the volume scan data in the radar archive is bundled up into suitably large tar files. Thus, for example, gs://gcp-public-data-nexrad-l2/2012/07/23/KMUX/NWS_NEXRAD_NXL2DP_KMUX_20120723000000_20120723005959.tar contains volume scans from KMUX whose collection started on July 23, 2012 in the hour starting at 23:00:00UTC.

Because radar-visualization software typically does not read tar files and does not read from cloud storage buckets, you'll first have to download the tar files to the machine on which you want to visualize it. Here, I want the data from the KIWA radar on July 23, 2013 for the hour starting at 23:00 UTC:

#!/bin/bash
RADAR=KIWA
YEAR=2013
MONTH=07
DAY=23
HOUR=23
gsutil cp gs://gcp-public-data-nexrad-l2/$YEAR/$MONTH/$DAY/$RADAR/*_$RADAR_${YEAR}${MONTH}${DAY}${HOUR}0000_${YEAR}${MONTH}${DAY}${HOUR}5959.tar temp.tar
tar xvf temp.tar
rm temp.tar

The script above uses gsutil, which comes with the gcloud SDK and the standard UNIX utility tar. Alternately, you could use the Python Cloud Storage API to get the files you need. Running the above bash program yields all the volume scans in that hour:

KIWA20130723_230242_V06.gz
KIWA20130723_230726_V06.gz
etc.
KIWA20130723_235451_V06.gz
KIWA20130723_235936_V06.gz

With the files in place, we can use the Py-ART library from the U.S. Department of Energy ARM program to create diagrams:

import matplotlib.pyplot as plt
import pyart
def plot_data(infilename, outpng, maxrange):
  radar = pyart.io.read_nexrad_archive(infilename)
  display = pyart.graph.RadarDisplay(radar)
  fig = plt.figure(figsize=(10, 10))

  ax = fig.add_subplot(221)
  display.plot('velocity', 1, ax=ax, title='Doppler Velocity',
             colorbar_label='',
             axislabels=('', 'North South distance from radar (km)'))
  display.set_limits((-maxrange, maxrange), (-maxrange, maxrange), ax=ax)
  ax = fig.add_subplot(222)
  display.plot('reflectivity', 0, ax=ax,
             title='Reflectivity lowest', colorbar_label='',
             axislabels=('', ''))
  display.set_limits((-maxrange, maxrange), (-maxrange, maxrange), ax=ax)

  ax = fig.add_subplot(223)
  display.plot('reflectivity', 1, ax=ax,
             title='Reflectivity second', colorbar_label='')
  display.set_limits((-maxrange, maxrange), (-maxrange, maxrange), ax=ax)

  ax = fig.add_subplot(224)
  display.plot('cross_correlation_ratio', 0, ax=ax,
             title='Correlation Coefficient', colorbar_label='',
             axislabels=('East West distance from radar (km)', ''))
  display.set_limits((-maxrange, maxrange), (-maxrange, maxrange), ax=ax)

  plt.show()
Using the above code on KIWA20130723_235451_V06.gz yields:

If you zoom in (change the maxrange to 30 km), you'll notice speckle-like high reflectivity scatter near the radar.

The high-reflectivity (yellow and red) pixels in the figure above are not precipitation. Instead, the high reflectivity is caused by ground clutter.

Which radars in Arizona are most affected by such ground clutter? Answering this question calls for large-scale data analysis, and this is where cloud computing really shines.

Sample program to do large-scale analysis

While you can work with individual volume scans as shown above, one key benefit of having all the NEXRAD data immediately available on a public cloud is the ability to analyze long time periods of data at scale. Thanks to GCP’s “serverless” approach to infrastructure, it's possible to do data processing, data analysis and machine learning without having to manage low-level resources.

Cloud Dataflow, GCP’s fully-managed service for stream and batch processing, allows you to write a data processing and analysis pipeline that will be executed in a distributed manner. The pipeline will autoscale different steps to run on multiple machines in a fault-tolerant way.

For example, let’s say that you have an algorithm to identify cases of anomalous propagation where the ducting of a radar beam causes ground-clutter artifacts to appear in radar data. You wish to determine which range-gates are affected, and when. But you wish to do this over several months, several years and several radars.

This is the sort of data analysis that might involve cobbling together monitoring, error checking and file-handling code. Instead, we can simply code up the actual logic in Apache Beam (Cloud Dataflow SDK versions 2.0 and later are based on Beam) and submit the pipeline to Cloud Dataflow. Cloud Dataflow will then take care of infrastructure wrangling, scaled execution and fault tolerance.

Pipeline details

The pipeline being executed in Cloud Dataflow is shown in the diagram above. Notice that we managed to carry out 4 days worth of processing in less than an hour, thanks to doing the work in parallel on 15 workers with 8 CPUs each! The code to do massively scalable processing on the NEXRAD data is quite straightforward — the full code for the pipeline is on GitHub, but the key steps are shown below:

Step Description of processing Java code
getParams Which radars/year/months to process?

Create.of(getParams(options))  where getParams() is a loop through radars, years and months: for (String radar : radars) {
 for (int year : years) {
   for (int month : months) {
     for (int day : days(month)) {
       params.add(radar + "," + year + "," +                   month + "," + day);
}}}
getArchives Get tar files for each radar/year/month/day(done in parallel, for each parameter combination)

List files = GcsNexradL2List.getFiles(radar, year, month, day);
for (String file : files) {
   c.output(file);
} where GcsNexradL2List.getFiles() uses Google Cloud Storage API to find the list of tar files (see GitHub for full code):
String bucket = "gcp-public-data-nexrad-l2";
String date =   String.format("%d/%02d/%02d", year, month, day);
Iterable blobs =   storage.list(bucket,       BlobListOption.prefix( date + "/" + radar)).iterateAll();
List result = new ArrayList<>();
for (Blob blob : blobs) {
 result.add("gs://" + blob.getBucket() + "/" + blob.getName());
}
return result;
processTar Process each tar file and find AP pixels in all the volume scans in that tar file(done in parallel, for each tar file)
try (GcsNexradL2Read hourly =        new GcsNexradL2Read(tarFile)) {
 for (RadialDatasetSweep volume :         hourly.getVolumeScans()) {
    List apPixels = ...;
    for (AnomalousPropagation ap : apPixels) {
                 c.output(ap);
    }
 }
} catch (Exception e) {
 log.error("Skipping " + tarFile, e);
}
AP->String Convert each AP detection to a comma-separated string
AnomalousPropagation ap = c.element();
c.output(ap.toCsv())
Write out Write to text file on Google Cloud Storage
TextIO.write().to(options.getOutput() +
"allDetections").withSuffix(".csv")

Beam supports two languages: Python and Java. I chose to write my pipeline in Java because, at present, the Java runtime is faster, but you could do the above processing in Python as well. (If you're interested in what a Beam Python pipeline looks like, see this example of distributed processing of Landsat scenes in Python.) Note the exception-handling code in the processTar step: Every so often, you'll run into corrupt volume scan files in the archive, and it's important to be able to skip them. Here I am skipping the entire hour of data, but you could skip just the corrupt volume scan by changing where the exception is caught.

The code uses the Netcdf Java library to read the NEXRAD volume scan files. Because this Java library can read only individual volume scan files on disk (not tar files and not from Cloud Storage), we download the tar files from Cloud Storage to the local disk of the machine doing the computation, untar the file and delete the temporary files when we're done. The code to do these tasks (see GitHub) uses the Google Cloud Storage API. For example, downloading from Cloud Storage:

private static File downloadFromGcs(String bucketName, String blobName, File tmpDir) throws IOException {
    Storage storage = StorageOptions.getDefaultInstance().getService();
    File fileName = File.createTempFile("download", "bytes", tmpDir);
    try (ReadChannel reader = storage.reader(bucketName, blobName);
        FileOutputStream writer = new FileOutputStream(fileName)) {
      ByteBuffer bytes = ByteBuffer.allocate(64 * 1024);
      while (reader.read(bytes) > 0) {
        bytes.flip();
        writer.getChannel().write(bytes);
        bytes.clear();
      }
    }
    return fileName;
  }

Besides scaling out processing, it's also possible to perform computations, grouping, etc. within the Cloud Dataflow pipeline. The sample (full code on GitHub) also demonstrates finding which radars are most affected by anomalous propagation/ground clutter:

Step Description of processing Java code
ByRadar Split AP detections by radar (remember that we processed tar files in parallel, so they will be in arbitrary order)
AnomalousPropagation ap = c.element();// once for each pixel with AP
c.output(KV.of(ap.radarId, 1));
TotalAP Number of AP pixels for each radar
Sum.integersPerKey()
KV->String Convert each key-value pair to a comma-separated string
String radar = c.element().getKey();
int numPixelsAP = c.element().getValue();
c.output(radar + "," + numPixelsAP);
Write total by radar Write to text file on Cloud Storage
TextIO.write().to(    
    options.getOutput() + " totalByRadar")    
    .withSuffix(".csv").withoutSharding()

The entire pipeline took about 4 hours to complete. An example AP detection looks like this:

KIWA,2013-07-23T23:54:53,43.736572,20250.0

This result indicates that for the KIWA radar in Phoenix, AZ, there was anomalous propagation at the range gate located at azimuth=43.7, range=20.25km on July 23, 2013 at 23:54:53 UTC.

This is one of the high-reflectivity clutter points we noticed in the image in the previous section. The total number of AP/ground-clutter range-gates by radar for June-August in 2012-2014 are:

KYUX,8542
KFSX,1484
KIWA,7906

The radars in Yuma and Phoenix (KYUX and KIWA) experience AP/ground clutter 5 to 6 times more often than the radar in Flagstaff (KFSX).

Next steps

The ability to do this kind of large-scale processing, without any human intervention and without having to move data around, is a key benefit of having your data in a cloud environment that supports serverless data processing. To explore more use cases for Cloud Dataflow, take advantage of our free trial!

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE