Automating IoT Machine Learning: Bridging Cloud and Device Benefits with AI Platform

rendered images

This tutorial shows how to automate a workflow that delivers new or updated Machine Learning (ML) models directly to IoT (Internet of Things) devices.

Here's the workflow:

  • Train ML model versions by using AI Platform.
  • Use photorealistic CAD-rendered data to train the ML model.
  • Automate the packaging and delivery of the new or modified model to a remote IoT device where the inference (model prediction) runs locally.

When you automate the combined ML training and deployment process, you can evolve models more easily and deploy those models faster to a large number of devices.

This tutorial also demonstrates how you can use cloud-based 3D rendering to automatically generate training data for vision-based image models.

Part detection

Your workflow detects parts by using a custom ML visual detection model. As your parts inventory changes over time, you periodically update the ML model to recognize newly added parts and incorporate improvements to the model design. You then deliver updated models to authorized deployed devices in multiple field sites.

See also the reference code for this tutorial.


  • Combine AI Platform, Cloud Functions, and Cloud Build to automate the training and packaging of an ML model.
  • Use Container Registry and patterns from Kubernetes to help automate the delivery of packaged models to groups of devices more securely and scalably.
  • Use ZYNC Render to automatically generate pre-labeled training data.


This tutorial uses several billed services. For a simple run through, the cost is minimal. In production, the final cost depends on which parts of the tutorial you use the most.

Before you begin

  1. Create a Google Cloud project:

    Go to the Projects Page

  2. Start a Cloud Shell instance. You run all the terminal commands in this tutorial from Cloud Shell.

    Open Cloud Shell

  3. Enable the AI Platform Training and Prediction API, Compute Engine API, Container Registry API, Cloud Build API, Cloud Functions API, and Dataflow API:

    Enable APIs

    This step can take several minutes to complete.

  4. Skip the step of creating credentials. Instead, select the project in the Google Cloud Console banner menu.

Application background and architecture

This tutorial addresses the following scenario: A camera attached to a connected device visually identifies mechanical parts moving along a conveyor belt or other mechanism. The tutorial focuses on delivery to a camera-enabled, Linux-based IoT device, but you can build similar systems for other types of devices with different sensor inputs.

Given the high reliability requirements of this application, the part detection device must continue to work even if network connectivity is interrupted. To help achieve this reliability, you train TensorFlow models on GCP but run the models locally on the connected device. The deployed model does not require cloud connectivity in order to make predictions. The model can store and transmit recorded predictions when back online.

The solution architecture has these main components:

  • Cloud rendering of training data
  • Model training and packaging
  • Deployment and delivery of new models
  • Execution of the trained model on an edge device

The following diagram shows a high-level view of the architecture.


Cloud rendering of training data

Training effective ML models requires large collections of well-labeled images. However, manually gathering and labeling images can be tedious and time-consuming. The process often introduces labeling errors, such as classifying a frog as a cat. Transfer learning, which the Inception vision model uses, reduces the number of required labeled images and speeds up the training process, Still, ML training always needs large numbers of accurately labeled images.

For certain use cases, such as the mechanical parts shown in this image, you might already have a complete collection of CAD-designed 3D shape files.

mechanical parts

You can use photorealistic rendering software to automatically generate a vast number of image variations for these parts. These variations can include background features, orientation, lighting sources, lighting intensity, focal length, and simulated dirt on the lens—whatever features are needed to accurately train a model for a given application. Because rendering software generates these images based on a specific part model, output image labeling is automatic and accurate. No manual labeling is involved.

Rendering is a computationally intensive task, and training datasets benefits from having a large number of views of the same part. To help with rendering and training, Google Cloud provides the ZYNC Render service. ZYNC is an online 3D rendering tool that quickly generates auto-labeled datasets for custom image-based ML models. CAD-based parts are a natural fit for this approach. However, Hollywood has shown that nearly anything can be rendered in a photorealistic way. Cloud-based rendering makes it possible to consider this approach for nearly any scenario where gathering and labeling training images is difficult.

This tutorial provides pre-rendered training images. For details on how to automate image rendering using ZYNC, see this tutorial.

The automated image generation process produces a set of rendered images. To indicate image labels to the ML training task, you store these sets in Cloud Storage in a directory name that corresponds to the set's classification label.

Model training and packaging

AI Platform provides a fully managed environment for TensorFlow model training. By taking advantage of parallel training, this environment enables you to rapidly iterate and tune model performance, which leads to more accurate models. AI Platform Training outputs a TensorFlow SavedModel. You deploy this artifact to an IoT device and load it into a local prediction server.

Although this tutorial demonstrates the use of cloud-rendered image training datasets, the following architecture applies to any kind of ML model, including models based on IoT sensor datasets.

general model

Defining a model identifier

When defining your model, consider how to handle large numbers of training images that are required to train a model version.

Handling training input files

A trained model is the unique combination of 1) the training algorithm code and configuration parameters, and 2) a specific set of training input. You can use source code management tools like Git to manage and version algorithm and configuration code. However, these types of tools can't handle the large number of training images required to generate a specific model version.

Depending on the nature of your training data, the following guidelines can help manage your training input. These guidelines can also help you re-create the specific collection of images used for a training job:

In a production environment, it's important to determine the right data snapshot approach. In this tutorial, you are working only with a demonstration dataset, so you don't need to manage the training data.

Model versions

A model version consists of the specific training dataset and the code used to train it. Like software versions, model versions benefit from following clear conventions such as semantic versioning. With semantic versioning, a breaking change changes the shape of the model input, and a minor change improves the accuracy of the model, or it makes additional classifications in the output, such as new parts added to a catalog.

A model version does not capture a release channel such as alpha or beta. Instead, a model version resembles a Git hash. In Git, the implicit release channel is referred to as master. GitHub will rename the implicit GitHub release channel to main on October 1, 2020. In Docker, the implicit release channel is referred to as latest. You can use a release channel to route models of different stages to a specific group of IoT devices—for example, an alpha testing group.

The following image shows the movement of a specific build, represented by green or pink, between and across release channels. A channel, represented by a heavily outlined circle, always directs clients to the most recent build in that channel. You can use the same build in multiple channels, as for the pink release shown here.

builds in multiple channels

This solution uses: [MODEL_NAME]_[MAJOR_VERSION]_[WORKFLOW_START_TIMESTAMP_AS_MINOR_VERSION] as a model version—for example:

  • EquipmentParts_1_1501089026


    • [MODEL_NAME] is EquipmentParts.
    • [MAJOR_VERSION] is 1.

If a model is also deployed to AI Platform Prediction for online prediction, you can use the model version string in the name field for the online resource. If you use the model with TensorFlow serving, which supports only an integer version, you can use the Unix timestamp component.

Setting up the environment and storage buckets

In this section, you use Cloud Shell to clone the repo, configure it, and customize it for your project. You use the same environment and shell for the rest of the tutorial. Update the REGION setting if you want to run in a region other than the default, us-central1.

Complete the following steps:

  1. First, set the environment variables:

    export FULL_PROJECT=$(gcloud config list project --format "value(core.project)")
    export PROJECT="$(echo $FULL_PROJECT | cut -f2 -d ':')"
    export REGION='us-central1' #OPTIONALLY CHANGE THIS
  2. Create several buckets to be used in different stages:

    gsutil mb -l $REGION gs://$PROJECT-training
    gsutil mb -l $REGION gs://$PROJECT-model-output
    gsutil mb -l $REGION gs://$PROJECT-deploy
  3. Clone the demo repo:

    git clone
    cd cloudml-edge-automation
  4. Run an automated search and replace on some repository files to configure your project. Then push these changes to your project's cloud repository. You need a copy of this repository online because some platform tools directly pull code from this repository:


    If you have not configured git before in Cloud Shell:

    git config --global ""
    git config --global "Your Name"
    git commit -a -m "custom project"
    gcloud source repos create ml-automation
    git config credential.helper
    git remote add google$FULL_PROJECT/r/ml-automation
    git push google main
  5. Finally, push your now-modified template files into a bucket:

    gsutil -m rsync -r model-deployment/deploy-bucket-bootstrap/ gs://$PROJECT-deploy

Running the training job

This tutorial uses image-based training data, but you can apply the automated principles of training on the cloud and distributing to the edge for predictions to any TensorFlow model. With images as the input, it makes sense to take advantage of the transfer learning capabilities of the Inception-v3 vision model. This approach creates a custom model for a distinct image classification task by retraining on just the latter layers of a deep neural network model. For more about Inception transfer learning, see this blog post.

Using the provided rendered training data in Cloud Storage, you use the following commands to initiate a standard AI Platform Training job:

cd trainer
pip install --user -r requirements.txt

export MODEL_NAME=equipmentparts
export MODEL_VERSION="${MODEL_NAME}_1_$(date +%s)"


The training consists of image preprocessing and ML training in managed services. You can track training progress in the Dataflow console for preprocessing tasks, and then in the ML section of the console.

Packaging models in containers for device delivery

The result of a training job consists of a Tensorflow-saved model stored in a Cloud Storage bucket at a path specific to a model ID. You can run this portable model in a variety of ways. This tutorial focuses on getting the model deployed to devices. The tutorial shows how to deploy the trained model online in parallel so that other services can use it for batch and online predictions.

Even on Google Cloud, ML jobs can take some time to complete. The provided sample training set and configuration take about 15–20 minutes for image preprocessing in Dataflow, and 10 minutes of training in AI Platform.

After training completes, you don't want to perform a series of manual tasks to push the trained model to a device through multiple release stages.

First, you automate packaging of the trained model and model runner code into a form that can be easily delivered to a device. To accomplish this packaging, borrow a common technique used for server-based applications and package the model data and runner engine in a Docker container.

The packaging step uses Cloud Build. Cloud Build is a managed container builder service that can compose a final Docker image from multiple build steps. The resulting container is stored in Container Registry, which offers a private container registry that helps to securely store sensitive models. Container Registry also provides authenticated container downloads to devices. With global network edge caching, Container Registry can reliably deliver packaged models to globally distributed devices.

To automate the Cloud Build build, you use a Cloud Function to watch for the final parts of the saved model to be written to Cloud Storage. Use the built-in trigger channel between Cloud Storage and Cloud Function to monitor all file writes in your output bucket. Then trigger the Container Build after the last exported model file is written.

To deploy the Cloud Function, run the following commands:

cd ../model-packaging/build-trigger-function/
gcloud functions deploy modelDoneWatcher --trigger-bucket $PROJECT-model-output

The Cloud Function performs the following tasks:

  • Reacts to the last file written by the training output.
  • Extracts the overall model output location and the model-version encoded in the path.
  • Uses this data to populate variables in a build submission, which is sent to the Cloud Build service. One build submission is dispatched for each of the multiple runtime architectures of the model-serving package.
  • Deploys a named version of the resulting model to AI Platform for online prediction. This feature is optional feature and can be disabled.

Refer to the Cloud Function source in the solution repository for additional details.

This package requires a base image to exist. Create it with these commands:

cd ../model_base_container/
gcloud builds submit --config cloudbuild-model-base-x86_64.yaml .

After receiving the build requests from the Cloud Function, the Cloud Build service executes the following steps:

  1. Copies the model output into the container.
  2. Copies the model running service code into the container.
  3. Applies the model version and latest tags to the built container.

Automating deployment of the packaged model container

This section explains how to automate the process of taking a model packaged by the preceding steps and delivering it to an edge device. The following diagram shows the workflow.


Using containers as the packaging type allows you to take advantage of a well-understood, encapsulated, and portable runtime packaging system. It also lets you use some of the tools and patterns for server container delivery resulting in Over the Air (OTA) updates.

Kubernetes is a widely deployed and sophisticated container orchestration system. Running the full Kubernetes control plane is possible on substantial (server sized) edge infrastructure, or in hybrid environments, but it is impractical on most IoT devices with limited resources. These devices might have only 1 or 2 GB of RAM and run on slower processors than typical servers.

However, you can still use various Kubernetes system components, such as Pods and the Kubelet, on limited IoT devices. The Kubelet is the primary agent sitting on each node of a Kubernetes cluster. Some of its key tasks are:

  • Downloading specified containers from the registry.
  • Mounting the pod's required volumes.
  • Running the pod's containers using a container runtime.

The solution in this tutorial is not trying to run a fleet of many IoT devices as a large distributed Kubernetes cluster--that's a Kubernetes anti-pattern. The Kubernetes control plane and services are designed around a datacenter-class low-latency network that connects nodes within a cluster to each other. Instead, the solution uses Kubernetes building blocks as a declarative-style container download agent and process manager.

Normally, specifying which Kubernetes pods to run on a machine comes directly from the API server running on the control plane cluster. However, the Kubelet can use a local manifests folder containing PodSpec-formatted YAML files, which the solution uses for conveying which models to load and serve on the device.

One result of using the Kubelet alone is that you can deploy pods but not higher cluster-level abstractions such as deployments or replication sets. For this solution, a couple of container-based services is all that's required to run on an IoT device, so using pods is sufficient. You have already built your package as a container and stored it in Container Registry. Container Registry helps securely and reliably deliver layers that compose a Docker image.

Next, learn what you need to install on remote devices for them to start receiving model updates.

Image tags and release tracks

The Model versions section described the idea of using release channels, or tracks, to target different populations of devices with different model versions—for example, testing a new model version on a set of alpha devices.

Docker images support the concept of associating one or more tags with an image. Each image contains a set of layer-versions marked by a digest. A specific tag might be associated with only one digest at time. By convention, the latest tag is applied to the most recently added digest layer of the image.

You use image tags to denote which version of an image is associated with which release track. This solution uses the following tracks:

  • latest
  • alpha
  • beta
  • stable

After the automated Cloud Build service builds the image versions, it applies the latest tag to each image version. You can apply other tags through the Container Registry console or with other CLI tools.

Kubernetes does not re-pull a given image tag even if its underlying digest has changed. You need a way to automatically generate new or updated Kubernetes pod manifests any time a new model version is built or whenever a tag is otherwise updated.

You can accomplish this by using Cloud Functions as your automation glue. Container Registry publishes build notifications to Cloud Pub/Sub. You use Cloud Functions to monitor these image changes in the Container Registry and automatically write the new or modified Pod manifests into your deployment bucket. Because your model packager builds an image for each target device architecture, you have a subfolder for each architecture. That architecture folder contains a folder for each release channel. You can use a special all folder to hold common base packages that need to be deployed to all release channels.

A typical directory structure looks like this:

  • Deployment bucket

    • x86_64

      • all
      • latest
      • alpha
      • beta
      • stable
    • armv7l

      • all
      • latest
      • alpha
      • beta
      • stable

Deploy the Cloud Function to monitor image changes in Container Registry and write out updated PodSpec-formatted YAML files to the deployment bucket folders:

cd ../../model-deployment/tag-monitor-function/
gcloud pubsub topics create gcr
gcloud functions deploy tagMonitor --trigger-topic gcr

Running device-side components

This solution supports running the model on multiple compute architectures: x86_64, which is common to laptops and servers, and ARM, which is common to many IoT devices.

To demonstrate the automation workflow in a way testable to the widest audience, this section uses a Compute Engine virtual machine (VM) as a mock IoT device. If you have the required hardware, you can deploy the ARM containers to a smaller constrained device such as the common Raspberry Pi.

Two major pieces of management infrastructure exist for the target device:

  • Kubelet

    You need to get the Kubelet installed and running.

  • Configuration

    You need to bootstrap the YAML pod folder with a pointer to a pod sync. service container.

Creating a mock device on a VM instance

  1. Run the following command to create an Ubuntu-based VM instance that functions as a mock IoT device:

    gcloud compute --project $FULL_PROJECT \
        instances create "mock-device" \
        --zone "us-central1-f" \
        --machine-type "g1-small" \
        --subnet "default" \
        --scopes "" \
        --image "ubuntu-1710-artful-v20180109" \
        --image-project "ubuntu-os-cloud" \
        --boot-disk-size "10" \
        --boot-disk-type "pd-standard" \
        --boot-disk-device-name "mock-device-disk"
  2. Copy the setup script to the device:

    cd ../../client-files/
    gcloud compute scp --zone "us-central1-f" kubelet.service mock-device:/tmp
    # note you might need to generate GCP SSH keys if this is the first time
    gcloud compute scp --zone "us-central1-f" mock-device:
    gcloud compute scp --zone "us-central1-f" mock-device:
  3. Connect to and execute the setup script. To distinguish between Cloud Shell and the mock device's shell, open an SSH terminal to the device directly, not from within Cloud Shell. You can do this from the Cloud Console, or your local terminal.

    # click the SSH button in the Compute Engine list or run
    # gcloud compute ssh --zone "us-central1-f" mock-device
    # in the mock-device shell:
    chmod +x

    This script installs Docker and the Kubelet as a service.

Building your sync service pods

Next, you create a container responsible for syncing your pod manifests between the updated deployment bucket and the target device. A different configuration of this synchronization tool was bootstrapped for each release channel. Which version of the tool you use determines the device's release channel. For example, if you use the beta sync tool on a device, you pull down beta versions of the model. This tool uses gsutil rsync to download and keep current the latest manifest versions.

Create the sync pods

These Cloud Build tasks build containers running scripts that periodically call the gsutil rsync command. The container picks up environment variable settings specified in the PodSpec-formatted YAML file specific to each release channel. Run the following code in Cloud Shell:

cd ../model-deployment/sync-pod
gcloud builds submit . --config=cloudbuild-x86_64.yaml &
gcloud builds submit . --config=cloudbuild-armv7l.yaml

Install a sync pod onto the device

Now in the mock device's shell, run the following code to install the sync pod for the latest release channel:

export PROJECT=`curl -s "" -H "Metadata-Flavor: Google"`

sudo gsutil cp gs://$PROJECT-deploy/x86_64/latest/sync-pod.yaml /opt/device/pods/

After a few moments, you should be able to see the pod running by using the following command:

sudo docker ps

As soon as this container is running, it downloads the containers of the specified release channel.

Exercising the deployment automation

Before you close the loop between the ML job training and the packager, it's useful to step back and get a sense of how the automated deployment works. The odds are that the first ML training job you submitted completed before all of the Cloud Function–based automation was deployed. Before you submit and test another model version, try the automation you deployed with something that builds much faster.

For this approach, you use a simple web server, which you can quickly iterate to create new versions. When you have generated a new version, you test the effect of manipulating image tags in the resulting OTA updates.

Generate the web server by running the following code in Cloud Shell:

cd ../simple-web-test
gcloud builds submit . --config=cloudbuild-x86_64.yaml

When this web server is built:

  • A new container is created.
  • The container is tagged with latest.
  • A pod manifest file is rendered with the hash of that image.
  • The device's sync pod pulls down that new manifest.
  • The Kubelet pulls down the container and ensures that it is running.
  1. In the device's shell, determine the current deployed version:

    curl localhost:8080/version/

    The output is the current image hash being served by this container, consisting of a long random string.

  2. Next, repeat the cloud build command in Cloud Shell:

    gcloud builds submit . --config=cloudbuild-x86_64.yaml
  3. All the previous steps are duplicated, and after a minute or so the device is running the latest image. Verify this by repeating the version check in the device's shell:

    curl localhost:8080/version/

In the Container Registry console, you can see the different build versions of the container. You can change which specific image is deployed to the latest track by editing the tags. You can apply a specific tag to only one image version. Verify this by editing the tags of an earlier build and adding the latest tag. The tag is removed from the true latest image version and applied to the one you're editing.

You can also add the alpha tag to a specific image version. If you look in the deployment bucket, you see a new pod manifest for that hash in the alpha deployment subfolder.

This process allows you to easily migrate different model builds from latest, to alpha, beta, and so on. These versions are automatically delivered to the corresponding target device populations.

To show the full end-to-end automation working, you need to kick off another training job. You can either emulate the completion of training by modifying a maker file, or rerun the model training to create a new model version.

To emulate the completion of your earlier training:

gsutil mv -p gs://$PROJECT-model-output/equipmentparts/$MODEL_VERSION/model/TRAINER-DONE gs://$PROJECT-model-output/equipmentparts/$MODEL_VERSION/model/TRAINER-DONEx
gsutil mv -p gs://$PROJECT-model-output/equipmentparts/$MODEL_VERSION/model/TRAINER-DONEx gs://$PROJECT-model-output/equipmentparts/$MODEL_VERSION/model/TRAINER-DONE

To rerun a new model training, you first ensure a new unique model version environment variable in Cloud Shell:

export MODEL_VERSION="${MODEL_NAME}_1_$(date +%s)"

If you have disconnected and reconnected, you must reinstall the Python requirements as described previously, because they are not persisted in the Cloud Shell environment. Move to the trainer directory of the repo and execute the following command:


When this emulation or training completes, you see the following automatically generated artifacts:

  1. Container registry images for each target architecture. These images contain the packaged model along with the small model server example.
  2. Updated pod YAML manifests in the latest subfolder corresponding to each architecture in the [PROJECT]-deploy bucket.

Running the model inference

The model runner is implemented as a small web app that is packaged with the specific model version into a container.

You can also distribute a client as a container, but in this case you enter commands directly into the device's shell.

  1. Verify an update by checking the version endpoint:

    curl localhost:5000/version/

    This command returns the current image hash.

  2. Run an inference on the current model run by the container:

    wget -P /tmp/images
    python /images/part1-test.jpeg

    The output looks similar to the following:


    This output shows the sorted prediction of the model—in this case, a 96% confidence that this model is part-1. You now have an updated TensorFlow model automatically deployed to the edge. From here, you can act on that data locally or use IoT Core to report the inference telemetry to cloud-hosted applications.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

What's next