Jump to Content
Data Analytics

Using Google Cloud Vision API from within a Data Fusion Pipeline

November 9, 2021
Aaron Pestel

Customer Engineer - Data Management Specialist, Google Cloud

Try Google Cloud

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Free trial

Cloud Data Fusion (CDF) provides enormous opportunity to help cultivate new data pipelines and integrations.  With over 200 plugins, Data Fusion gives you the tools to wrangle, coalesce and integrate with many data providers like Salesforce, Amazon S3, BigQuery, Azure, Kafka Streams and more.  Deploying scalable, resilient data pipelines based upon open source CDAP gives organizations the flexibility to enrich data at scale.  Sometimes though, the integration of a custom REST API or other tool is not already in the plugin library, you may need to connect your own REST API. 

Many modern REST APIs (like Google’s AI APIs, and other Google APIs) and data sources use OAuth 2.0 authorization. OAuth 2.0 is a great authorization protocol, however, it can be challenging to figure out how to interface with it when integrating with tools like Cloud Data Fusion (CDF). So, we thought we would show you how to configure a CDF HTTP source that calls a Google Vision AI API using OAuth 2.0.

First, let’s look at the Vision AI API itself. Specifically, we will be using the Vision AI “annotate” API. We will pass the API an HTTP URL to an image and the API will return a JSON document that provides AI generated information about the image. Here are the official docs for the API.

Let’s start with an example of how we would call the API interactively with curl on Google Cloud Shell where we can authenticate with a gcloud command.

Loading...

This will produce the AI API’s JSON response output like this:

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_DataFusion_oAuth.max-1000x1000.png

Now that we can call the API with curl, it’s time to figure out how we translate that API call to a CDF HTTP source. Some things will be the same, like the API URL and the request body. Some things will be different, like the authorization process.

The CDF HTTP source can’t get an authorization token from calling “gcloud auth print-access-token” like we did above. Instead, we will need to create an OAuth 2.0 Client ID in our GCP project and we will need to get a refresh token for that Client ID that CDF will be able to use to generate a new OAuth 2.0 token when CDF needs to make a request.

Let’s get started by filling in all the properties of the HTTP Source. The first few are simple, same as we used with curl:

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_DataFusion_oAuth.max-700x700.png

The next setting is Format. You might think we should pick JSON here, and we could — since JSON is what is returned. However, CDF sources will expect a JSON record per line, and we really want the entire response to be a single record. So, we will mark the Format as blob and will convert the blob to string in the pipeline later (and could even split out records like object detections, faces, etc.):

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_DataFusion_oAuth.max-500x500.png

The next and final section is the hardest — the OAuth 2.0 properties. Let’s look at the properties we will need to find and then start finding them:

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_DataFusion_oAuth.max-700x700.png

The documentation for getting most of these settings is here:

https://developers.google.com/identity/protocols/oauth2/web-server

Auth URL and Token URL…

The first two properties are listed in the doc above:

Auth URL:  https://accounts.google.com/o/oauth2/v2/auth

Token URL: https://oauth2.googleapis.com/token

Client ID and Client Secret…

For the Client ID and Client Secret, we will need to create those credentials here: https://console.cloud.google.com/apis/credentials. It may seem odd to specify a URI of http://localhost:8080, but that is just to get the refresh token later.

https://storage.googleapis.com/gweb-cloudblog-publish/images/5_DataFusion_oAuth.max-700x700.png

After specifying these options above and clicking Create, we will get our Client ID and Client Secret:

https://storage.googleapis.com/gweb-cloudblog-publish/images/6_DataFusion_oAuth.max-700x700.png

Scopes…

For the Scopes, we can use either of these two scopes as mentioned in the API docs that are linked and screenshotted below:

https://www.googleapis.com/auth/cloud-platform

https://www.googleapis.com/auth/cloud-vision

https://cloud.google.com/vision/docs/reference/rest/v1/projects.images/annotate

https://storage.googleapis.com/gweb-cloudblog-publish/images/7_DataFusion_oAuth.max-700x700.png

Refresh Token…

Lastly, we need the refresh token, which is the hardest property to get. There are two steps to this process. First, we have to authenticate and authorize with the Google Auth server to get an authorization “code”, and then we have to use that authorization code with the Google Token server to get an “access token” and a “refresh token” that CDF will use to get future access tokens. The access token has a short life, so wouldn’t be useful to give to CDF. Instead, CDF will use the refresh token so that it can get its own access tokens whenever the pipeline is run.

To get the authorization “code”, you can copy the URL below, change to use your client_id, and then open that URL in a browser window:

https://accounts.google.com/o/oauth2/v2/auth?scope=https%3A//www.googleapis.com/auth/cloud-platform&access_type=offline&include_granted_scopes=true&response_type=code&state=state_parameter_passthrough_value&redirect_uri=http%3A//localhost:8080&client_id=199375159079-st8toco9pfu1qi5b45fkj59unc5th2v1.apps.googleusercontent.com

Initially, this will prompt you to login, then prompt you to authorize this client for the specified scopes, and then will redirect to http://localhost:8080. It will look like an error page, but notice that the URL of the error page you were redirected to includes the “code” (circled in green below). In a normal web application, that is how the authorization code is returned to the requesting web application.

NOTE: You may see an error like this “Authorization Error — Error 400: admin_policy_enforced”. If so, your GCP User’s organization has a policy that restricts you from using Client IDs for third party products. In that case, you’ll need to get that restriction lifted, or use a different GCP user in a different org.

https://storage.googleapis.com/gweb-cloudblog-publish/images/8_DataFusion_oAuth.max-700x700.png

With that authorization code (circled in green above), we can now call the Google Token server to get the “access token” and the “refresh token”. Just set your “code”, “client_id”, and “client_secret” in the curl command below and run it in a Cloud Shell terminal.

curl -X POST -d "code=4/0AX4XfWjgRdrWXuNxqXOOtw_9THZlwomweFrzcoHMBbTFkrKLMvo8twSXdGT9JramIYq86w&client_id=199375159079-st8toco9pfu1qi5b45fkj59unc5th2v1.apps.googleusercontent.com&client_secret=q2zQ-vc3wG5iF5twSwBQkn68&redirect_uri=http%3A//localhost:8080&grant_type=authorization_code" \

https://oauth2.googleapis.com/token

At long last, you will have your “refresh_token”, which is the last OAuth 2.0 property that the CDF HTTP source needs to authorize with the Google Vision API!

https://storage.googleapis.com/gweb-cloudblog-publish/images/9_DataFusion_oAuth.max-600x600.png

Now, we have all the information needed to populate the OAuth 2.0 properties of the CDF HTTP Source:

https://storage.googleapis.com/gweb-cloudblog-publish/images/10_DataFusion_oAuth.max-1600x1600.png

Next, we need to set the output schema of the HTTP Source to have a column called “body” with a type of “bytes” (since bytes is the format we selected in the properties), and then we can validate and close the HTTP source properties:

https://storage.googleapis.com/gweb-cloudblog-publish/images/12_DataFusion_oAuth.max-700x700.png

In the Projection properties, we simply convert the body from bytes to string and then validate:

https://storage.googleapis.com/gweb-cloudblog-publish/images/13_DataFusion_oAuth.max-700x700.png

Now, we can add a BigQuery sink (or any sink) in CDF Studio and run a preview:

https://storage.googleapis.com/gweb-cloudblog-publish/images/14_DataFusion_oAuth.max-700x700.png

If we click Preview Data on the Projection step, we can see our Vision AI response both as a byte array (on the left) and projected as a string (on the right):

https://storage.googleapis.com/gweb-cloudblog-publish/images/15_DataFusion_oAuth.max-700x700.png

Lastly, we can name the pipeline, deploy it, and run it. Here are the results of the run as well as a screenshot of the data in BigQuery

https://storage.googleapis.com/gweb-cloudblog-publish/images/16_DataFusion_oAuth.max-700x700.png

Final thoughts…

Further processing…

This example just stored the Vision AI response JSON in a string column of a BigQuery table. The pipeline could easily be extended to use a Wrangler transform to parse the Vision AI JSON response into more fine grained columns, or even pull out parts of the response JSON into multiple rows/records (for example, a row for each face or object found in the image).

We also hard coded the image url in the pipeline above. That’s not terribly useful for reuse. We could have used a runtime parameter like ${image_url} and then we could specify a different image URL for each pipeline run.

Other OAuth 2.0 APIs…

This example was focused on the Google APIs, so would work similarly for any other Google API that needs to be called with OAuth 2.0 authorization. But, the HTTP plugin is generic (not specific to Google APIs), so can work with other OAuth 2.0 protected services outside of Google as well. Of course the auth server and token server will be different (since they are not Google), but hopefully this at least gives an example of using an OAuth 2.0 protected service.

The final example pipeline…

Below is a link to the finished pipeline in case you want to import it and look it over in more detail.

my-pipeline-cdap-data-pipeline.json

Posted in