本頁面由 Cloud Translation API 翻譯而成。

在 Dataproc 的 JupyterLab 筆記本中執行基因體分析

本教學課程說明如何使用 Dask、NVIDIA RAPIDS 和 GPU 執行單細胞基因體分析，您可以在 Dataproc 上設定這些項目。您可以設定 Dataproc，透過獨立排程器或 YARN 執行 Dask，以管理資源。

本教學課程會設定 Dataproc，搭配託管的 JupyterLab 執行個體，執行以單細胞基因體分析為主題的筆記本。在 Dataproc 上使用 Jupyter Notebook，可將 Jupyter 的互動功能與 Dataproc 支援的工作負載擴充功能結合。使用 Dataproc 時，您可以將工作負載從一台機器擴充至多台機器，並根據需求設定 GPU 數量。

本教學課程適用於資料科學家和研究人員。並假設您有 Python 使用經驗，且具備下列基礎知識：

目標

建立已設定 GPU、JupyterLab 和開放原始碼元件的 Dataproc 執行個體。
在 Dataproc 上執行筆記本。

費用

在本文件中，您會使用下列 Google Cloud的計費元件：

Dataproc

Cloud Storage

GPUs

如要根據預測用量估算費用，請使用 Pricing Calculator。

初次使用 Google Cloud 的使用者可能符合免費試用資格。

完成本文所述工作後，您可以刪除已建立的資源，避免繼續計費。詳情請參閱清除所用資源一節。

事前準備

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector
Verify that billing is enabled for your Google Cloud project.
Enable the Dataproc API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API

準備環境

為資源選取位置。
```
REGION=REGION
```

建立 Cloud Storage 值區。

gcloud storage buckets create gs://BUCKET --location=REGION

將下列初始化動作複製到 bucket。

SCRIPT_BUCKET=gs://goog-dataproc-initialization-actions-REGION
gcloud storage cp ${SCRIPT_BUCKET}/gpu/install_gpu_driver.sh BUCKET/gpu/install_gpu_driver.sh
gcloud storage cp ${SCRIPT_BUCKET}/dask/dask.sh BUCKET/dask/dask.sh
gcloud storage cp ${SCRIPT_BUCKET}/rapids/rapids.sh BUCKET/rapids/rapids.sh
gcloud storage cp ${SCRIPT_BUCKET}/python/pip-install.sh BUCKET/python/pip-install.sh

建立包含 JupyterLab 和開放原始碼元件的 Dataproc 叢集

建立 Dataproc 叢集。

gcloud dataproc clusters create CLUSTER_NAME \
    --region REGION \
    --image-version 2.0-ubuntu18 \
    --master-machine-type n1-standard-32 \
    --master-accelerator type=nvidia-tesla-t4,count=4 \
    --initialization-actions
BUCKET/gpu/install_gpu_driver.sh,BUCKET/dask/dask.sh,BUCKET/rapids/rapids.sh,BUCKET/python/pip-install.sh
\
    --initialization-action-timeout=60m \
    --metadata
gpu-driver-provider=NVIDIA,dask-runtime=yarn,rapids-runtime=DASK,rapids-version=21.06,PIP_PACKAGES="scanpy==1.8.1,wget" \
    --optional-components JUPYTER \
    --enable-component-gateway \
    --single-node

叢集具有下列屬性：

--region：叢集所在的區域。
--image-version： 2.0-ubuntu18，叢集映像檔版本
--master-machine-type：n1-standard-32，主要機器類型。
--master-accelerator：主要節點上的 GPU 類型和數量，四個 nvidia-tesla-t4 GPU。
--initialization-actions：安裝指令碼的 Cloud Storage 路徑，用於安裝 GPU 驅動程式、Dask、RAPIDS 和額外依附元件。
--initialization-action-timeout：初始化動作的逾時時間。
--metadata：傳遞至初始化動作，以使用 NVIDIA GPU 驅動程式、Dask 的獨立排程器和 RAPIDS 21.06 版設定叢集。
--optional-components：使用 Jupyter 選用元件設定叢集。
--enable-component-gateway：允許存取叢集中的網頁 UI。
--single-node：將叢集設定為單一節點 (沒有工作站)。

存取 Jupyter Notebook

在 Dataproc Google Cloud 控制台中開啟「Clusters」(叢集) 頁面。
開啟「Clusters」(叢集) 頁面
按一下叢集，然後點選「Web Interfaces」分頁標籤。
按一下「JupyterLab」JupyterLab。
在 JupyterLab 中開啟新的終端機。

複製 clara-parabricks/rapids-single-cell-examples 存放區，然後查看 dataproc/multi-gpu 分支。

git clone https://github.com/clara-parabricks/rapids-single-cell-examples.git
git checkout dataproc/multi-gpu

在 JupyterLab 中，前往 rapids-single-cell-examples/notebooks 存放區，然後開啟 1M_brain_gpu_analysis_uvm.ipynb Jupyter 筆記本。
如要清除筆記本中的所有輸出內容，請依序選取「Edit」>「Clear All Outputs」。
請閱讀筆記本儲存格中的操作說明。筆記本會在 Dataproc 上使用 Dask 和 RAPIDS，引導您完成 100 萬個細胞的單細胞 RNA 序列工作流程，包括處理及視覺化資料。詳情請參閱「Accelerating Single Cell Genomic Analysis using RAPIDS」。

清除所用資源

如要避免系統向您的 Google Cloud 帳戶收取本教學課程中所用資源的相關費用，請刪除含有該項資源的專案，或者保留專案但刪除個別資源。

刪除專案

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

刪除個別資源

刪除 Dataproc 叢集。

gcloud dataproc clusters delete cluster-name \
    --region=region

刪除 bucket：
```
gcloud storage buckets delete BUCKET_NAME
```
重要事項：值區必須先清空，才能遭到刪除。

後續步驟

進一步瞭解 Dataproc。
探索參考架構、圖表、教學課程和最佳做法。歡迎瀏覽我們的雲端架構中心。