在 Cloud TPU 上运行 Inception

设置为 1 以显示清理块 #}

本教程介绍如何在 Cloud TPU 上训练 Inception 模型。

免责声明

本教程使用第三方数据集。Google 不对此数据集的有效性或任何其他方面提供任何表示、担保或其他保证。

模型说明

Inception v3 是一种可以实现显著准确率的图片识别模型。该模型是数年来多位研究人员提出的诸多想法积淀的成果。它以 Szegedy 等人发表的《Rethinking the Inception Architecture for Computer Vision》原创性论文为理论依据。

该模型由对称构件和不对称构件组成,包括:

  • 卷积
  • 平均池化
  • 最大池化
  • concatenate
  • 丢包
  • 全连接层

损失是通过 Softmax 计算的。

下图概括显示了该模型:

映像

您可以在 GitHub 上详细了解该模型。

该模型是使用 Estimator API 构建的。

该 API 通过封装大多数低级别函数来简化模型创建,让您专注于开发模型,而不是运行硬件的底层硬件的内部工作机制。

目标

  • 创建 Cloud Storage 存储分区以保存数据集和模型输出。
  • 运行训练作业。
  • 验证输出结果。

费用

本教程使用 Google Cloud 的以下收费组件:

  • Compute Engine
  • Cloud TPU
  • Cloud Storage

您可使用价格计算器根据您的预计使用情况来估算费用。 Google Cloud 新用户可能有资格申请免费试用

准备工作

在开始学习本教程之前,请检查您的 Google Cloud 项目是否已正确设置。

  1. 登录您的 Google Cloud 帐号。如果您是 Google Cloud 新手,请创建一个帐号来评估我们的产品在实际场景中的表现。新客户还可获享 $300 赠金,用于运行、测试和部署工作负载。
  2. 在 Google Cloud Console 中的项目选择器页面上,选择或创建一个 Google Cloud 项目

    转到“项目选择器”

  3. 确保您的 Cloud 项目已启用结算功能。了解如何检查项目是否已启用结算功能

  4. 在 Google Cloud Console 中的项目选择器页面上,选择或创建一个 Google Cloud 项目

    转到“项目选择器”

  5. 确保您的 Cloud 项目已启用结算功能。了解如何检查项目是否已启用结算功能

  6. 本演示使用 Google Cloud 的收费组件。如需估算费用,请参阅 Cloud TPU 价格页面。请务必在使用完您创建的资源以后清理这些资源,以免产生不必要的费用。

设置资源

本部分介绍如何为教程设置 Cloud Storage、虚拟机和 Cloud TPU 资源。

  1. 打开一个 Cloud Shell 窗口。

    打开 Cloud Shell

  2. 为项目 ID 创建一个变量。

    export PROJECT_ID=project-id
    
  3. 将 Google Cloud CLI 配置为使用要在其中创建 Cloud TPU 的项目。

    gcloud config set project ${PROJECT_ID}
    

    当您第一次在新的 Cloud Shell 虚拟机中运行此命令时,系统会显示 Authorize Cloud Shell 页面。点击页面底部的 Authorize,以允许 gcloud 使用您的凭据调用 Google Cloud API。

  4. 为 Cloud TPU 项目创建服务帐号。

    gcloud beta services identity create --service tpu.googleapis.com --project $PROJECT_ID
    

    该命令将返回以下格式的 Cloud TPU 服务帐号:

    service-PROJECT_NUMBER@cloud-tpu.iam.gserviceaccount.com
    

  5. 使用以下命令创建 Cloud Storage 存储分区。将 bucket-name 替换为存储分区的名称。

    gsutil mb -p ${PROJECT_ID} -c standard -l us-central1 -b on gs://bucket-name
    

    此 Cloud Storage 存储分区存储您用于训练模型的数据和训练结果。ctpu up 工具会为 Cloud TPU 服务帐号设置默认权限。如果您需要更精细的权限,请查看访问级层权限

    存储分区位置必须与虚拟机 (VM) 和 TPU 节点位于同一区域。虚拟机和 TPU 节点位于特定地区,即区域内的细分。

  6. 使用 ctpu up 命令启动 Compute Engine 资源。

    ctpu up --project=${PROJECT_ID} \
     --zone=us-central1-b \
     --vm-only \
     --machine-type=n1-standard-8 \
     --tf-version=1.15.5 \
     --name=inception-tutorial
    

    命令标志说明

    project
    您的 Google Cloud 项目 ID
    zone
    您计划在其中创建 Cloud TPU 的可用区
    vm-only
    创建虚拟机,而不创建 Cloud TPU。默认情况下,ctpu up 命令会同时创建虚拟机和 Cloud TPU。
    machine-type
    要创建的 Compute Engine 虚拟机的机器类型
    tf-version
    虚拟机上安装的 TensorFlow ctpu 版本。
    name
    要创建的 Cloud TPU 的名称。

    如需详细了解 CTPU 实用程序,请参阅 CTPU 参考

  7. 在出现提示时,按 y 创建 Cloud TPU 资源。

    为了验证您是否已登录 Compute Engine 虚拟机,shell 提示符应已从 username@projectname 更改为 username@vm-name。如果您未连接到 Compute Engine 实例,则可通过运行以下命令进行连接:

    gcloud compute ssh inception-tutorial --zone=us-central1-b
    

    从现在开始,前缀 (vm)$ 表示您应该在 Compute Engine 虚拟机实例上运行该命令。

  8. 为存储分区创建环境变量。将 bucket-name 替换为您的 Cloud Storage 存储分区的名称。

    (vm)$ export STORAGE_BUCKET=gs://bucket-name
    
  9. 为 TPU 名称创建环境变量。

    (vm)$ export TPU_NAME=inception-tutorial

训练数据集

训练应用应该能够访问 Cloud Storage 中的训练数据。在训练期间,训练应用还会使用您的 Cloud Storage 存储分区来存储检查点。

ImageNet 是一个图片数据库。数据库中的图片被整理为一个层次结构,该层次结构中的每个节点由成百上千个图片进行描述。

本教程使用演示版的完整 ImageNet 数据集,该数据集称为“fake_imagenet”数据集。此数据集可让您测试教程,节省了下载模型和对完整的 ImageNet 数据库运行模型所需的存储空间或时间。或者,您也可以使用完整的 ImageNet 数据集

DATA_DIR 环境变量用于指定要训练的数据集。

此 fake_imagenet 数据集仅用于了解如何使用 Cloud TPU。准确率数字和保存的模型并无实际意义。

此 fake_imagenet 数据集是 Cloud Storage 上的以下位置:

gs://cloud-tpu-test-datasets/fake_imagenet

(可选)设置 TensorBoard

TensorBoard 提供了一套工具,可直观地呈现 TensorFlow 数据。当用于监控时,TensorBoard 可以帮助识别处理过程中的瓶颈,并提出改进性能的方法。

如果您不需要监控模型的输出,则可以跳过 TensorBoard 设置步骤。

如果想要监控模型的输出和性能,请参照设置 TensorBoard 指南。

运行模型

现在您可以使用 ImageNet 数据训练和评估 Inception v3 模型了。

Inception v3 模型已预安装在您的 Compute Engine 虚拟机的 /usr/share/tpu/models/experimental/inception/ 目录中。

在以下步骤中,前缀 (vm)$ 表示您应该在 Compute Engine 虚拟机上运行该命令:

  1. 设置包含下列某个值的 DATA_DIR 环境变量:

    • 如果您使用的是 fake_imagenet 数据集:

      (vm)$ export DATA_DIR=gs://cloud-tpu-test-datasets/fake_imagenet
      
    • 如果您已将一组训练数据上传到您的 Cloud Storage 存储分区:

      (vm)$ export DATA_DIR=${STORAGE_BUCKET}/data
      
  2. 运行 Inception v3 模型:

    (vm)$ python /usr/share/tpu/models/experimental/inception/inception_v3.py \
        --tpu=$TPU_NAME \
        --learning_rate=0.165 \
        --train_steps=250000 \
        --iterations=500 \
        --use_tpu=True \
        --use_data=real \
        --mode=train_and_eval \
        --train_steps_per_eval=2000 \
        --data_dir=${DATA_DIR} \
        --model_dir=${STORAGE_BUCKET}/inception
    • --tpu 用于指定 Cloud TPU 的名称。ctpu 会将此名称作为环境变量 (TPU_NAME) 传递给 Compute Engine 虚拟机。
    • --use_data 用于指定此程序在训练期间必须使用哪种类型的数据(是虚构的还是真实的)。默认值为 fake(虚构数据)。
    • --data_dir 用于指定训练输入的 Cloud Storage 路径。如果您使用的是 fake_imagenet 数据,应用会忽略此参数。
    • --model_dir 用于指定在模型训练期间存储检查点和摘要的目录。如果指定的文件夹不存在,此程序会自行创建。使用 Cloud TPU 时,model_dir 必须是 Cloud Storage 路径 (gs://...)。您可以重复使用文件夹来加载当前检查点数据和存储其他检查点。您必须使用同一 TensorFlow 版本写入和加载检查点。

具体变更与计划

Inception v3 可对 299x299 图片执行操作。默认训练批量大小为 1024,即每次迭代处理其中 1024 个图片。

您可以使用 --mode 标志选择以下三种操作模式之一:train(训练)、eval(评估)和 train_and_eval(训练并评估)。

  • --mode=train--mode=eval 用于指定仅限训练的作业或仅限评估的作业。
  • --mode=train_and_eval 用于指定同时执行训练和评估的混合式作业。

仅限训练的作业会运行 train_steps 中定义的指定步数,并且可以遍历整个训练集(如果需要)。

Train_and_eval 作业会循环执行训练和评估部分。每个训练周期都运行 train_steps_per_eval,然后运行评估作业(使用训练到当时的权重)。

您可以使用 train_stepsfloor 函数除以 train_steps_per_eval 来计算训练周期数。

    floor(train_steps / train_steps_per_eval)

默认情况下,基于 Estimator API 的模型会报告每过特定步数的损失值。报告格式如下:step = 15440, loss = 12.6237

讨论:专门针对 TPU 对模型进行修改

为了使基于 Estimator API 的模型可在 TPU 上运行,仅需进行极少量的修改。此程序会导入以下库: from google.third_party.tensorflow.contrib.tpu.python.tpu import tpu_config from google.third_party.tensorflow.contrib.tpu.python.tpu import tpu_estimator from google.third_party.tensorflow.contrib.tpu.python.tpu import tpu_optimizer CrossShardOptimizer 函数会封装优化器,如下所示: if FLAGS.use_tpu: optimizer = tpu_optimizer.CrossShardOptimizer(optimizer) 定义模型的函数会使用以下命令返回 Estimator 规范: return tpu_estimator.TPUEstimatorSpec( mode=mode, loss=loss, train_op=train_op, eval_metrics=eval_metrics) main 函数使用以下方法定义了 Estimator 兼容配置: run_config = tpu_config.RunConfig( master=tpu_grpc_url, evaluation_master=tpu_grpc_url, model_dir=FLAGS.model_dir, save_checkpoints_secs=FLAGS.save_checkpoints_secs, save_summary_steps=FLAGS.save_summary_steps, session_config=tf.ConfigProto( allow_soft_placement=True, log_device_placement=FLAGS.log_device_placement), tpu_config=tpu_config.TPUConfig( iterations_per_loop=iterations, num_shards=FLAGS.num_shards, per_host_input_for_training=per_host_input_for_training)) 此程序使用此定义的配置和模型定义函数来创建 Estimator 对象: inception_classifier = tpu_estimator.TPUEstimator( model_fn=inception_model_fn, use_tpu=FLAGS.use_tpu, config=run_config, params=params, train_batch_size=FLAGS.train_batch_size, eval_batch_size=eval_batch_size, batch_axis=(batch_axis, 0)) 仅限训练的作业只调用 train 函数: inception_classifier.train( input_fn=imagenet_train.input_fn, steps=FLAGS.train_steps) 仅限评估的作业会从可用检查点获取数据,并一直等到有新的检查点可用: for checkpoint in get_next_checkpoint(): eval_results = inception_classifier.evaluate( input_fn=imagenet_eval.input_fn, steps=eval_steps, hooks=eval_hooks, checkpoint_path=checkpoint) 当您选择选项 train_and_eval 时,训练和评估作业会并行运行。在评估过程中,会从最新的可用检查点加载可训练变量。训练和评估周期按标志中指定的方式重复: ``` for cycle in range(FLAGS.train_steps // FLAGS.train_steps_per_eval): inception_classifier.train( input_fn=imagenet_train.input_fn, steps=FLAGS.train_steps_per_eval)

  eval_results = inception_classifier.evaluate(
      input_fn=imagenet_eval.input_fn, steps=eval_steps, hooks=eval_hooks)
If you used the fake\_imagenet dataset to train the model, proceed to
[clean up](#clean-up).

## Using the full Imagenet dataset {: #full-dataset }

The ImageNet dataset consists of three parts, training data, validation data,
and image labels.

The training data contains 1000 categories and 1.2 million images,  packaged for
easy downloading. The validation and test data are not contained in the ImageNet
training data (duplicates have been removed).

The validation and test data consists of 150,000 photographs, collected from
[Flickr](https://www.flickr.com/) and other search engines, hand labeled with
the presence or absence of 1000 object categories. The 1000 object categories
contain both internal nodes and leaf nodes of ImageNet, but do not overlap with
each other. A random subset of 50,000 of the images with labels has been
released as validation data along with a list of the 1000 categories. The
remaining images are used for evaluation and have been released without labels.

### Steps to pre-processing the full ImageNet dataset

There are five steps to preparing the full ImageNet dataset for use by a Machine
Learning model:

1. Verify that you have space on the download target.
1. Set up the target directories.
1. Register on the ImageNet site and request download permission.
1. Download the dataset to local disk or Compute Engine VM.

   Note: Downloading the Imagenet dataset to a Compute Engine VM takes
   considerably longer than downloading to your local machine (approximately 40
   hours versus 7 hours). If you download the dataset to your local
   machine, you must copy the files to a Compute Engine VM to pre-process them.
   You must then upload the files to Cloud Storage before using them to train
   your model. Copying the training and validation files from your local machine to the VM
   takes about 13 hours. The recommended approach is to download the dataset to
   a VM.

1. Run the pre-processing and upload script.

### Verify space requirements

Whether you download the dataset to your local machine or to a Compute Engine
VM, you need about 300GB of space available on the download target. On a VM, you
can check your available storage with the `df -ha` command.

Note: If you use `gcloud compute` to set up your VM, it will allocate 250 GB by
default.

You can increase the size of the VM disk using one of the following methods:

*  Specify the `--disk-size` flag on the `gcloud compute` command line with the
   size, in GB, that you want allocated.
*  Follow the Compute Engine guide to [add a disk][add-disk] to your
   VM.
   * Set **When deleting instance** to **Delete disk** to ensure that the
     disk is removed when you remove the VM.
   * Make a note of the path to your new disk. For example: `/mnt/disks/mnt-dir`.

### Set up the target directories

On your local machine or Compute Engine VM, set up the directory structure to
store the downloaded data.

*  Create and export a home directory for the ImageNet dataset.

   Create a directory, for example, `imagenet` under your home directory on
   your local machine or VM. Under this directory, create two sub directories:
   `train` and `validation`. Export the home directory as IMAGENET_HOME:

   <pre class="prettyprint lang-sh tat-dataset">
   export IMAGENET_HOME=~/imagenet
   </pre>

### Register and request permission to download the dataset

*  Register on the [Imagenet website](http://image-net.org/). You cannot
   download the dataset until ImageNet confirms your registration and sends you
   a confirmation email. If you do not get the confirmation email within a
   couple of days, contact [ImageNet support](mailto:support@image-net.org) to
   see why your registration has not been confirmed. Once your registration is
   confirmed, you can download the dataset. The Cloud TPU tutorials that use the
   ImageNet dataset use the images from the ImageNet Large Scale Visual
   Recognition Challenge 2012 (ILSVRC2012).

### Download the ImageNet dataset

1. From the [LSRVC 2012 download site](https://image-net.org/challenges/LSVRC/2012/2012-downloads.php),
   go to the Images section on the page and right-click
   "Training images (Task 1 & 2)". The URL to download
   the largest part of the training set. Save the URL.

   Right-click "Training images (Task 3)" to get the URL for the second
   training set. Save the URL.

   Right-click "Validation images (all tasks)" to get the URL for the
   validation dataset. Save the URL.

   If you download the ImageNet files to your local machine, you need to copy
   the directories on your local machine to the corresponding `$IMAGENET_HOME`
   directory on your Compute Engine VM. Copying the ImageNet dataset from
   local host to your VM takes approximately 13 hours.

   The following command copies the files under
   $IMAGENET_HOME on your local machine to <var>~/imagenet</var> on your VM (<var>username@vm-name</var>):

   <pre class="prettyprint lang-sh tat-dataset">
   gcloud compute scp --recurse $IMAGENET_HOME <var>username@vm-name</var>:~/imagenet
   </pre>

1. From $IMAGENET_HOME, use `wget` to download the training and validation files
   using the saved URLs.

   The "Training images (Task 1 & 2)" file is the large training set. It is
   138 GB and if you are downloading to a Compute Engine VM using the Cloud
   Shell, the download takes approximately 40 hours. If the Cloud Shell loses its
   connection to the VM, you can prepend `nohup` to the command or use
   [screen](https://linuxize.com/post/how-to-use-linux-screen/).

   <pre class="prettyprint lang-sh tat-dataset">
   cd $IMAGENET_HOME \
   nohup wget http://image-net.org/challenges/LSVRC/2012/dd31405981ef5f776aa17412e1f0c112/ILSVRC2012_img_train.tar
   </pre>

   This command downloads a large tar file: ILSVRC2012_img_train.tar.

   From $IMAGENET_HOME on the VM, extract the individual training directories
   into the `$IMAGENET_HOME/train` directory using the following command. The
   extraction takes between 1 - 3 hours.

   <pre class="prettyprint lang-sh tat-dataset">
   tar xf ILSVRC2012_img_train.tar
   </pre>

   Extract the individual training tar files located in the $IMAGENET_HOME/train
   directory, as shown in the following script:

   <pre class="prettyprint lang-sh tat-dataset">
   cd $IMAGENET_HOME/train

   for f in *.tar; do
     d=`basename $f .tar`
     mkdir $d
     tar xf $f -C $d
   done
   </pre>

   Delete the tar files after you have extracted them to free up disk space.

   The "Training images (Task 3)" file is 728 MB and takes just a few minutes
   to download so you do not need to take precautions against losing the Cloud
   Shell connection.

   When you download this file, it extracts the individual training directories
   into the existing `$IMAGENET_HOME/train` directory.

   <pre class="prettyprint lang-sh tat-dataset">
   wget http://www.image-net.org/challenges/LSVRC/2012/dd31405981ef5f776aa17412e1f0c112/ILSVRC2012_img_train_t3.tar
   </pre>

   When downloading the "Validation images (all tasks)" file, your Cloud Shell may disconnect.
   You can use `nohup` or [screen](https://linuxize.com/post/how-to-use-linux-screen/) to
   prevent Cloud Shell from disconnecting.

   <pre class="prettyprint lang-sh tat-dataset">
   wget http://www.image-net.org/challenges/LSVRC/2012/dd31405981ef5f776aa17412e1f0c112/ILSVRC2012_img_val.tar
   </pre>

   This download takes about 30 minutes. When you download this file, it
   extracts the individual validation directories into the
   `$IMAGENET_HOME/validation` directory.

   If you downloaded the validation files to your local machine, you need to
   copy the `$IMAGENET_HOME/validation` directory on your local machine to the
   `$IMAGENET_HOME/validation` directory on your Compute Engine VM. This copy
   operation takes about 30 minutes.

   Download the labels file.

   <pre class="prettyprint lang-sh tat-dataset">
   wget -O $IMAGENET_HOME/synset_labels.txt \
https://raw.githubusercontent.com/tensorflow/models/master/research/inception/inception/data/imagenet_2012_validation_synset_labels.txt
   </pre>

   If you downloaded the labels file to your local machine, you need to copy it
   to the `$IMAGENET_HOME` directory on your local machine to `$IMAGENET_HOME`
   on your Compute Engine VM. This copy operation takes a few seconds.

   The training subdirectory names (for example, n03062245) are "WordNet IDs"
   (wnid). The [ImageNet API](https://image-net.org/download-attributes.php)
   shows the mapping of WordNet IDs to their associated validation labels in the
   `synset_labels.txt` file. A synset in this context is a visually similar
   group of images.

### Process the Imagenet dataset and, optionally, upload to Cloud Storage

1. Download the `imagenet_to_gcs.py` script from GitHub:

   <pre class="prettyprint lang-sh tat-dataset">
   wget https://raw.githubusercontent.com/tensorflow/tpu/master/tools/datasets/imagenet_to_gcs.py
   </pre>

1. If you are uploading the dataset to Cloud Storage, specify the storage
   bucket location to upload the ImageNet dataset:

   <pre class="lang-sh prettyprint tat-client-exports">
   export STORAGE_BUCKET=gs://<var>bucket-name</var>
   </pre>

1. If you are uploading the dataset to your local machine or VM, specify a data
   directory to hold the dataset:

   <pre class="lang-sh prettyprint tat-client-exports">
   <span class="no-select">(vm)$ </span>export DATA_DIR=$IMAGENET_HOME/<var>dataset-directory</var>
   </pre>

1. Run the script to pre-process the raw dataset as TFRecords and upload it to
   Cloud Storage using the following command:

   Note: If you don't want to upload to Cloud Storage, specify `--nogcs_upload`
   as another parameter and leave off the `--project` and `--gcs_output_path`
   parameters.

   <pre class="prettypring lang-sh tat-dataset">
     python3 imagenet_to_gcs.py \
      --project=$PROJECT \
      --gcs_output_path=$STORAGE_BUCKET  \
      --raw_data_dir=$IMAGENET_HOME \
      --local_scratch_dir=$IMAGENET_HOME/tf_records
   </pre>

Note: Downloading and preprocessing the data can take 10 or more hours,
depending on your network and computer speed. Do not interrupt the script.

The script generates a set of directories (for both training and validation) of
the form:

    ${DATA_DIR}/train-00000-of-01024
    ${DATA_DIR}/train-00001-of-01024
     ...
    ${DATA_DIR}/train-01023-of-01024

and

    ${DATA_DIR}/validation-00000-of-00128
    S{DATA_DIR}/validation-00001-of-00128
     ...
    ${DATA_DIR}/validation-00127-of-00128

After the data has been uploaded to your Cloud bucket, run your model and set
`--data_dir=${DATA_DIR}`.

## Clean up {: #clean-up }

To avoid incurring charges to your GCP account for the resources used
in this topic:

1. Disconnect from the Compute Engine VM:

    <pre class="lang-sh prettyprint tat-skip">
    <span class="no-select">(vm)$ </span>exit
    </pre>

     Your prompt should now be `username@projectname`, showing you are in the
     Cloud Shell.

1. In your Cloud Shell, run `ctpu delete` with the --zone flag you used when
   you set up the Cloud TPU to delete your
   Compute Engine VM and your Cloud TPU:

    <pre class="lang-sh prettyprint tat-resource-setup">
    <span class="no-select">$ </span>ctpu delete [optional: --zone]
    </pre>

    Important: If you set the TPU resources name when you ran `ctpu up`, you must
    specify that name with the `--name` flag when you run `ctpu delete` in
    order to shut down your TPU resources.

1. Run `ctpu status` to make sure you have no instances allocated to avoid
   unnecessary charges for TPU usage. The deletion might take several minutes.
   A response like the one below indicates there are no more allocated
   instances:

    <pre class="lang-sh prettyprint tat-skip">
    <span class="no-select">$ </span>ctpu status --zone=europe-west4-a
    </pre>
    <pre class="lang-sh prettyprint tat-skip">
    2018/04/28 16:16:23 WARNING: Setting zone to "--zone=europe-west4-a"
    No instances currently exist.
        Compute Engine VM:     --
        Cloud TPU:             --
    </pre>

1. Run `gsutil` as shown, replacing <var>bucket-name</var> with the name of the
   Cloud Storage bucket you created for this tutorial:

    <pre class="lang-sh prettyprint tat-resource-setup">
    <span class="no-select">$ </span>gsutil rm -r gs://<var>bucket-name</var>
    </pre>

Note: For free storage limits and other pricing information, see the
[Cloud Storage pricing guide](/storage/pricing).

## Inception v4

The Inception v4 model is a deep neural network model that uses Inception v3
building blocks to achieve higher accuracy than Inception v3. It is described in
the paper "Inception-v4, Inception-ResNet and the Impact of Residual Connections
on Learning" by Szegedy et. al.

The Inception v4 model is pre-installed on your Compute Engine VM, in
the `/usr/share/tpu/models/experimental/inception/` directory.

In the following steps, a prefix of `(vm)$` means you should run the command on
your Compute Engine VM:

1. If you have TensorBoard running in your Cloud Shell tab, you need another tab
   to work in. Open another tab in your Cloud Shell, and use `ctpu` in the new
   shell to connect to your Compute Engine VM:

    <pre class="lang-sh prettyprint">
    <span class="no-select">$ </span>ctpu up --project=${PROJECT_ID} </pre>

1. Set up a `DATA_DIR` environment variable containing one of the following
   values:

    * If you are using the fake\_imagenet dataset:

        <pre class="prettyprint lang-sh">
        <span class="no-select">(vm)$ </span>export DATA_DIR=gs://cloud-tpu-test-datasets/fake_imagenet
        </pre>

    * If you have uploaded a set of training data to your Cloud Storage
      bucket:

        <pre class="prettyprint lang-sh">
        <span class="no-select">(vm)$ </span>export DATA_DIR=${STORAGE_BUCKET}/data
        </pre>

1. Run the Inception v4 model:

    <pre class="lang-sh prettyprint">
    <span class="no-select">(vm)$ </span>python /usr/share/tpu/models/experimental/inception/inception_v4.py \
        --tpu=$TPU_NAME \
        --learning_rate=0.36 \
        --train_steps=1000000 \
        --iterations=500 \
        --use_tpu=True \
        --use_data=real \
        --train_batch_size=256 \
        --mode=train_and_eval \
        --train_steps_per_eval=2000 \
        --data_dir=${DATA_DIR} \
        --model_dir=${STORAGE_BUCKET}/inception</pre>

    * `--tpu` specifies the name of the Cloud TPU. `ctpu`
      passes this name to the Compute Engine VM as an environment
      variable (`TPU_NAME`).
    * `--use_data` specifies which type of data the program must use during
      training, either fake or real. The default value is fake.
    * `--train_batch_size` specifies the train batch size to be 256. As the
      Inception v4 model is larger than Inception v3, it must be run at a
      smaller batch size per TPU core.
    * `--data_dir` specifies the Cloud Storage path for training input.
      The application ignores this parameter when you're using fake\_imagenet
      data.
    * `--model_dir` specifies the directory where checkpoints and
      summaries are stored during model training. If the folder is missing, the
      program creates one. When using a Cloud TPU, the `model_dir`
      must be a Cloud Storage path (`gs://...`). You can reuse an existing
      folder to load current checkpoint data and to store additional
      checkpoints as long as the previous checkpoints were created using TPU of
      the same size and TensorFlow version.

    

Clean up

What's next

## What's next {: #whats-next } The TensorFlow Cloud TPU tutorials generally train the model using a sample dataset. The results of this training are not usable for inference. To use a model for inference, you can train the data on a publicly available dataset or your own data set. TensorFlow models trained on Cloud TPUs generally require datasets to be in [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) format. You can use the [dataset conversion tool sample](https://cloud.google.com/tpu/docs/classification-data-conversion) to convert an image classification dataset into TFRecord format. If you are not using an image classification model, you will have to convert your dataset to [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) format yourself. For more information, see [TFRecord and tf.Example](https://www.tensorflow.org/tutorials/load_data/tfrecord). ### Hyperparameter tuning To improve the model's performance with your dataset, you can tune the model's hyperparameters. You can find information about hyperparameters common to all TPU supported models on [GitHub](https://github.com/tensorflow/tpu/tree/master/models/hyperparameters). Information about model-specific hyperparameters can be found in the [source code](https://github.com/tensorflow/tpu/tree/master/models/official) for each model. For more information on hyperparameter tuning, see [Overview of hyperparameter tuning](https://cloud.google.com/ai-platform/training/docs/hyperparameter-tuning-overview), [Using the Hyperparameter tuning service](https://cloud.google.com/ai-platform/training/docs/using-hyperparameter-tuning), and [Tune hyperparameters](https://developers.google.com/machine-learning/guides/text-classification/step-5). ### Inference Once you have trained your model you can use it for inference (also called prediction). [AI Platform](https://cloud.google.com/ai-platform/docs/technical-overview) is a cloud-based solution for developing, [training](https://cloud.google.com/ai-platform/training/docs), and [deploying](https://cloud.google.com/ai-platform/prediction/docs/deploying-models) machine learning models. Once a model is deployed, you can use the [AI Platform Prediction service](https://cloud.google.com/ai-platform/prediction/docs). * Go in depth with an [advanced view](/tpu/docs/inception-v3-advanced) of Inception v3 on Cloud TPU. * Learn more about [`ctpu`](https://github.com/tensorflow/tpu/tree/master/tools/ctpu), including how to install it on a local machine. * Explore the [TPU tools in TensorBoard](/tpu/docs/cloud-tpu-tools){: track-type="gettingStarted" track-name="tutorialLink" track-metadata-position="nextSteps"}.