Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Mendownload, melakukan prapemrosesan, dan mengupload set data COCO
COCO adalah set data deteksi objek, segmentasi, dan pemberian teks berskala besar.
Model machine learning yang menggunakan set data COCO meliputi:
Mask-RCNN
Retinanet
ShapeMask
Sebelum dapat melatih model di Cloud TPU, Anda harus menyiapkan data pelatihan.
Dokumen ini menjelaskan cara menyiapkan set data COCO untuk
model yang berjalan di Cloud TPU. Set data COCO hanya dapat disiapkan setelah Anda
membuat VM Compute Engine. Skrip yang digunakan untuk menyiapkan data,
download_and_preprocess_coco.sh,
diinstal di VM dan harus dijalankan di VM.
Setelah menyiapkan data dengan menjalankan skrip download_and_preprocess_coco.sh, Anda dapat menampilkan Cloud TPU dan menjalankan pelatihan.
Untuk mendownload dan melakukan prapemrosesan serta mengupload set data COCO sepenuhnya ke bucket Cloud Storage, diperlukan waktu sekitar 2 jam.
Di Cloud Shell, konfigurasikan gcloud dengan project ID Anda.
Saat Anda terhubung ke VM, perintah shell akan berubah dari
username@projectname menjadi username@vm-name.
Siapkan dua variabel, satu untuk bucket penyimpanan yang Anda
buat sebelumnya dan satu lagi untuk direktori yang menyimpan
data pelatihan (DATA_DIR) di bucket penyimpanan.
(vm)$exportSTORAGE_BUCKET=gs://bucket-name
(vm)$exportDATA_DIR=${STORAGE_BUCKET}/coco
Instal paket yang diperlukan untuk memproses data terlebih dahulu.
Jalankan skrip download_and_preprocess_coco.sh untuk mengonversi
set data COCO menjadi kumpulan file TFRecord (*.tfrecord) yang diharapkan oleh aplikasi
pelatihan.
Tindakan ini akan menginstal library yang diperlukan, lalu menjalankan skrip prapemrosesan. File ini menghasilkan file *.tfrecord di direktori data lokal Anda.
Skrip download dan konversi COCO memerlukan waktu sekitar satu jam untuk diselesaikan.
Salin data ke bucket Cloud Storage Anda.
Setelah Anda mengonversi data ke dalam format TFRecord, salin data dari penyimpanan lokal ke bucket Cloud Storage menggunakan gcloud CLI. Anda juga harus
menyalin file anotasi. File ini membantu memvalidasi performa
model.
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-09-04 UTC."],[],[],null,["# Downloading, preprocessing, and uploading the COCO dataset\n==========================================================\n\nCOCO is a large-scale object detection, segmentation, and captioning dataset.\nMachine learning models that use the COCO dataset include:\n\n- Mask-RCNN\n- Retinanet\n- ShapeMask\n\nBefore you can train a model on a Cloud TPU, you must prepare the training\ndata.\n\nThis document describes how to prepare the [COCO](http://cocodataset.org) dataset for\nmodels that run on Cloud TPU. The COCO dataset can only be prepared after you\nhave created a Compute Engine VM. The script used to prepare the data,\n`download_and_preprocess_coco.sh`,\nis installed on the VM and must be run on the VM.\n\nAfter preparing the data by running the `download_and_preprocess_coco.sh`\nscript, you can bring up the Cloud TPU and run the training.\n\nTo fully download and preprocess and upload the COCO dataset to a\nCloud Storage bucket takes approximately 2 hours.\n\n1. In your [Cloud Shell](https://console.cloud.google.com/), configure `gcloud` with your project\n ID.\n\n ```bash\n export PROJECT_ID=project-id\n gcloud config set project ${PROJECT_ID}\n ```\n2. In your [Cloud Shell](https://console.cloud.google.com/),\n create a Cloud Storage bucket using the following command:\n\n **Note:** In the following command, replace \u003cvar translate=\"no\"\u003ebucket-name\u003c/var\u003e with the name you want to assign to your bucket. \n\n ```bash\n gcloud storage buckets create gs://bucket-name --project=${PROJECT_ID} --location=us-central2\n ```\n3. Create a Compute Engine VM to download and preprocess the dataset. For more\n information, see\n [Create and start a Compute Engine instance](/compute/docs/instances/create-start-instance).\n\n ```bash\n $ gcloud compute instances create vm-name \\\n --zone=us-central2-b \\\n --image-family=ubuntu-2204-lts \\\n --image-project=ubuntu-os-cloud \\\n --machine-type=n1-standard-16 \\\n --boot-disk-size=300GB \\\n --scopes=https://www.googleapis.com/auth/cloud-platform\n ```\n4. Connect to the Compute Engine VM using SSH:\n\n ```bash\n $ gcloud compute ssh vm-name --zone=us-central2-b\n ```\n\n When you connect to the VM, your shell prompt changes from\n `username@projectname` to `username@vm-name`.\n5. Set up two variables, one for the storage bucket you\n created earlier and one for the directory that holds\n the training data (`DATA_DIR`) on the storage bucket.\n\n ```bash\n (vm)$ export STORAGE_BUCKET=gs://bucket-name\n ``` \n\n ```bash\n (vm)$ export DATA_DIR=${STORAGE_BUCKET}/coco\n ```\n6. Install the packages needed to pre-process the data.\n\n ```bash\n (vm)$ sudo apt-get update && \\\n sudo apt-get install python3-pip && \\\n sudo apt-get install -y python3-tk && \\\n pip3 install --user Cython matplotlib opencv-python-headless pyyaml Pillow numpy absl-py tensorflow && \\\n pip3 install --user \"git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI\" && \\\n pip3 install protobuf==3.19.0 tensorflow==2.11.0 numpy==1.26.4\n ```\n7. Run the `download_and_preprocess_coco.sh` script to convert\n the COCO dataset into a set of TFRecord files (`*.tfrecord`) that the training\n application expects.\n\n ```bash\n (vm)$ git clone https://github.com/tensorflow/tpu.git\n (vm)$ sudo -E bash tpu/tools/datasets/download_and_preprocess_coco.sh ./data/dir/coco\n ```\n\n This installs the required libraries and then runs the preprocessing\n script. It outputs `*.tfrecord` files in your local data directory.\n The COCO download and conversion script takes approximately one hour to complete.\n8. Copy the data to your Cloud Storage bucket.\n\n After you convert the data into the TFRecord format, copy the data from local storage\n to your Cloud Storage bucket using the gcloud CLI. You must\n also copy the annotation files. These files help validate the model's\n performance.\n\n\n ```bash\n (vm)$ gcloud storage cp ./data/dir/coco/*.tfrecord ${DATA_DIR}\n (vm)$ gcloud storage cp ./data/dir/coco/raw-data/annotations/*.json ${DATA_DIR}\n ```\n\n \u003cbr /\u003e\n\nClean up\n--------\n\nFollow these steps to clean up your Compute Engine and Cloud Storage resources.\n\n1. Disconnect from the Compute Engine VM:\n\n ```bash\n (vm)$ exit\n ```\n2. Delete your Compute Engine VM:\n\n ```bash\n $ gcloud compute instances delete vm-name \\\n --zone=us-central2-b\n ```\n3. Delete your Cloud Storage bucket and its contents:\n\n ```bash\n $ gcloud storage rm -r gs://bucket-name\n $ gcloud storage buckets delete gs://bucket-name\n ```"]]