Ejecuta un análisis de genómica en un notebook de JupyterLab en Dataproc
Organiza tus páginas con colecciones
Guarda y categoriza el contenido según tus preferencias.
En este instructivo, se muestra cómo ejecutar un análisis de genómica de una sola celda mediante Dask, NVIDIA RAPIDS y GPU, que puedes configurar en Dataproc Puedes configurar Dataproc a fin de ejecutar Dack con su programador independiente o con YARN para la administración de recursos.
En este instructivo, se configura Dataproc con una instancia de JupyterLab alojada para ejecutar un notebook con un análisis de genómica de una sola celda. El uso de un notebook de Jupyter en Dataproc te permite combinar las funciones interactivas de Jupyter con el escalamiento de carga de trabajo que habilita Dataproc. Con Dataproc, puedes escalar horizontalmente tus cargas de trabajo desde una hasta muchas máquinas, las cuales puedes configurar con tantas GPU como necesites.
Este instructivo está dirigido a investigadores y científicos de datos. Se supone que tienes experiencia con Python y que tienes conocimientos básicos sobre lo siguiente:
Para generar una estimación de costos en función del uso previsto, usa la calculadora de precios.
Es posible que los usuarios de Google Cloud nuevos cumplan con los requisitos para acceder a una prueba gratuita.
Cuando completes las tareas que se describen en este documento, podrás borrar
los recursos que creaste para evitar que se te siga facturando. Para obtener más información, consulta
Realiza una limpieza.
Antes de comenzar
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
--master-machine-type: n1-standard-32, es el tipo de máquina principal.
--master-accelerator: Es el tipo y el recuento de GPU en el nodo principal, cuatro GPU nvidia-tesla-t4.
--initialization-actions: Son las rutas de Cloud Storage a las secuencias de comandos de instalación que instalan controladores de GPU, Dask, RAPIDS y dependencias adicionales.
--initialization-action-timeout: Es el tiempo de espera para las acciones de inicialización.
--metadata: se pasa a las acciones de inicialización a fin de configurar el clúster con los controladores de GPU de NVIDIA, el programador independiente para Dask y la versión de 21.06 de RAPIDS.
Para borrar todos los resultados del notebook, selecciona Edit > Clear All Outputs.
Lee las instrucciones en las celdas del notebook. El notebook usa Dack y RAPIDS en Dataproc para guiarte a través de un flujo de trabajo de RNA-seq de una sola celda en 1 millón de celdas, incluido el procesamiento y la visualización de los datos. Para obtener más información, consulta Acelera el análisis genómico de una sola celda mediante RAPIDS.
Limpia
Para evitar que se apliquen cargos a tu cuenta de Google Cloud por los recursos usados en este instructivo, borra el proyecto que contiene los recursos o conserva el proyecto y borra los recursos individuales.
Borra el proyecto
In the Google Cloud console, go to the Manage resources page.
[[["Fácil de comprender","easyToUnderstand","thumb-up"],["Resolvió mi problema","solvedMyProblem","thumb-up"],["Otro","otherUp","thumb-up"]],[["Difícil de entender","hardToUnderstand","thumb-down"],["Información o código de muestra incorrectos","incorrectInformationOrSampleCode","thumb-down"],["Faltan la información o los ejemplos que necesito","missingTheInformationSamplesINeed","thumb-down"],["Problema de traducción","translationIssue","thumb-down"],["Otro","otherDown","thumb-down"]],["Última actualización: 2025-09-04 (UTC)"],[[["\u003cp\u003eThis tutorial demonstrates how to perform single-cell genomics analysis using Dask, NVIDIA RAPIDS, and GPUs configured on Dataproc, with the option to utilize either a standalone scheduler or YARN for resource management.\u003c/p\u003e\n"],["\u003cp\u003eDataproc is configured with a hosted JupyterLab instance, allowing for interactive analysis through a Jupyter Notebook that leverages the scalability of Dataproc and the ability to use one to many machines configured with multiple GPUs.\u003c/p\u003e\n"],["\u003cp\u003eThe tutorial's primary objective is to guide users through creating a Dataproc instance equipped with GPUs, JupyterLab, and open-source components, and then running a specific notebook designed for single-cell genomics analysis.\u003c/p\u003e\n"],["\u003cp\u003eUsers are advised to manage costs by utilizing the provided pricing calculator and are given instructions on how to clean up resources, such as the Dataproc cluster and Cloud Storage bucket, to avoid continued billing.\u003c/p\u003e\n"],["\u003cp\u003ePrerequisites include experience with Python, Dataproc, Dask, RAPIDS, and Jupyter Notebooks, as well as the steps on creating a Google Cloud project, enabling billing, and enabling the Dataproc API.\u003c/p\u003e\n"]]],[],null,["# Run a genomics analysis in a JupyterLab notebook on Dataproc\n\n*** ** * ** ***\n\nThis tutorial shows you how to run a single-cell genomics analysis\nusing [Dask](https://dask.org/),\n[NVIDIA RAPIDS](https://rapids.ai/), and\nGPUs, which you can configure on [Dataproc](/dataproc). You can configure Dataproc to run Dask either with its standalone scheduler or with YARN for resource management.\n\nThis tutorial configures Dataproc with a hosted [JupyterLab](https://jupyter.org/) instance to run a notebook featuring a single-cell genomics analysis. Using a Jupyter Notebook on Dataproc lets you combine the interactive capabilities of Jupyter with the workload scaling that Dataproc enables. With Dataproc, you can scale out your workloads from one to many machines, which you can configure with as many GPUs as you need.\n\nThis tutorial is intended for data scientists and researchers. It assumes that\nyou are experienced with Python and have basic knowledge of the following:\n\n- [Dataproc](/dataproc)\n- [Dask](https://dask.org/)\n- [RAPIDS](https://rapids.ai/)\n- [Jupyter Notebooks](https://jupyter.org/)\n\nObjectives\n----------\n\n- Create a Dataproc instance which is configured with GPUs, JupyterLab, and open source components.\n- Run a [notebook](https://github.com/clara-parabricks/rapids-single-cell-examples/blob/rilango/batching/notebooks/1M_brain_gpu_analysis_uvm.ipynb) on Dataproc.\n\nCosts\n-----\n\n\nIn this document, you use the following billable components of Google Cloud:\n\n\n- [Dataproc](/dataproc/pricing)\n- [Cloud Storage](/storage/pricing)\n- [GPUs](/compute/gpus-pricing)\n- To generate a cost estimate based on your projected usage, use the [pricing calculator](/products/calculator). \nNew Google Cloud users might be eligible for a [free trial](/free). \n- When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see [Clean up](#clean-up).\n\nBefore you begin\n----------------\n\n1. [Create or select a Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects).\n\n | **Note**: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.\n - Create a Google Cloud project:\n\n ```\n gcloud projects create PROJECT_ID\n ```\n\n Replace \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e with a name for the Google Cloud project you are creating.\n - Select the Google Cloud project that you created:\n\n ```\n gcloud config set project PROJECT_ID\n ```\n\n Replace \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e with your Google Cloud project name.\n2.\n [Verify that billing is enabled for your Google Cloud project](/billing/docs/how-to/verify-billing-enabled#confirm_billing_is_enabled_on_a_project).\n\n3.\n\n\n Enable the Dataproc API:\n\n\n ```bash\n gcloud services enable dataproc\n``` \n\nPrepare your environment\n------------------------\n\n1. Select a\n [location](/storage/docs/locations)\n for your resources.\n\n ```\n REGION=REGION\n ```\n\n \u003cbr /\u003e\n\n2. Create a Cloud Storage bucket.\n\n ```\n gcloud storage buckets create gs://BUCKET --location=REGION\n ```\n\n \u003cbr /\u003e\n\n3. Copy the following\n [initialization actions](/dataproc/docs/concepts/configuring-clusters/init-actions)\n to your bucket.\n\n ```\n SCRIPT_BUCKET=gs://goog-dataproc-initialization-actions-REGION\n gcloud storage cp ${SCRIPT_BUCKET}/gpu/install_gpu_driver.sh BUCKET/gpu/install_gpu_driver.sh\n gcloud storage cp ${SCRIPT_BUCKET}/dask/dask.sh BUCKET/dask/dask.sh\n gcloud storage cp ${SCRIPT_BUCKET}/rapids/rapids.sh BUCKET/rapids/rapids.sh\n gcloud storage cp ${SCRIPT_BUCKET}/python/pip-install.sh BUCKET/python/pip-install.sh\n ```\n\n \u003cbr /\u003e\n\nCreate a Dataproc cluster with JupyterLab and open source components\n--------------------------------------------------------------------\n\n1. [Create](/sdk/gcloud/reference/dataproc/clusters/create) a Dataproc cluster.\n\n```\ngcloud dataproc clusters create CLUSTER_NAME \\\n --region REGION \\\n --image-version 2.0-ubuntu18 \\\n --master-machine-type n1-standard-32 \\\n --master-accelerator type=nvidia-tesla-t4,count=4 \\\n --initialization-actions\nBUCKET/gpu/install_gpu_driver.sh,BUCKET/dask/dask.sh,BUCKET/rapids/rapids.sh,BUCKET/python/pip-install.sh\n\\\n --initialization-action-timeout=60m \\\n --metadata\ngpu-driver-provider=NVIDIA,dask-runtime=yarn,rapids-runtime=DASK,rapids-version=21.06,PIP_PACKAGES=\"scanpy==1.8.1,wget\" \\\n --optional-components JUPYTER \\\n --enable-component-gateway \\\n --single-node\n```\n\nThe cluster has the following properties:\n\n- `--region`: the [region](/compute/docs/regions-zones#available) where your cluster is located.\n- `--image-version`: `2.0-ubuntu18`, the [cluster image version](/dataproc/docs/concepts/versioning/dataproc-versions#debian_images)\n- `--master-machine-type`: `n1-standard-32`, the main [machine type](/compute/docs/machine-types).\n- `--master-accelerator`: the type and count of [GPUs](/dataproc/docs/concepts/compute/gpus) on the main node, four `nvidia-tesla-t4` GPUs.\n- `--initialization-actions`: the Cloud Storage paths to the installation scripts that install GPU drivers, Dask, RAPIDS, and extra dependencies.\n- `--initialization-action-timeout`: the timeout for the initialization actions.\n- `--metadata`: passed to the initialization actions to configure the cluster with NVIDIA GPU drivers, the standalone scheduler for Dask, and RAPIDS version `21.06`.\n- `--optional-components`: configures the cluster with the [Jupyter optional component](/dataproc/docs/concepts/components/jupyter).\n- `--enable-component-gateway`: allows access to web UIs on the cluster.\n- `--single-node`: configures the cluster as a single node (no workers).\n\nAccess the Jupyter Notebook\n---------------------------\n\n1. Open the **Clusters** page in the Dataproc Google Cloud console. \n [Open Clusters page](https://console.cloud.google.com/dataproc/clusters)\n2. Click your cluster and click the **Web Interfaces** tab.\n3. Click **JupyterLab**.\n4. Open a [new terminal](https://jupyterlab.readthedocs.io/en/stable/user/terminal.html) in JupyterLab.\n5. Clone the `clara-parabricks/rapids-single-cell-examples`\n [repository](https://github.com/clara-parabricks/rapids-single-cell-examples)\n and check out the `dataproc/multi-gpu` branch.\n\n ```\n git clone https://github.com/clara-parabricks/rapids-single-cell-examples.git\n git checkout dataproc/multi-gpu\n ```\n\n \u003cbr /\u003e\n\n6. In JupyterLab, navigate to the\n [rapids-single-cell-examples/notebooks](https://github.com/clara-parabricks/rapids-single-cell-examples) repository\n and open the\n [1M_brain_gpu_analysis_uvm.ipynb](https://github.com/clara-parabricks/rapids-single-cell-examples/blob/rilango/batching/notebooks/1M_brain_gpu_analysis_uvm.ipynb) Jupyter Notebook.\n\n7. To clear all the outputs in the notebook, select **Edit \\\u003e Clear All Outputs**\n\n8. Read the instructions in the cells of the notebook. The notebook uses\n Dask and RAPIDS on Dataproc to guide you through a\n single-cell RNA-seq workflow on 1 million cells, including processing and\n visualizing the data. To learn more, see\n [Accelerating Single Cell Genomic Analysis using RAPIDS](https://developer.nvidia.com/blog/accelerating-single-cell-genomic-analysis-using-rapids/).\n\nClean up\n--------\n\n\nTo avoid incurring charges to your Google Cloud account for the resources used in this\ntutorial, either delete the project that contains the resources, or keep the project and\ndelete the individual resources.\n\n### Delete the project\n\n| **Caution** : Deleting a project has the following effects:\n|\n| - **Everything in the project is deleted.** If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.\n| - **Custom project IDs are lost.** When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an `appspot.com` URL, delete selected resources inside the project instead of deleting the whole project.\n|\n|\n| If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects\n| can help you avoid exceeding project quota limits.\n1. Delete a Google Cloud project: \n\n```\ngcloud projects delete PROJECT_ID\n```\n\n### Delete individual resources\n\n1. [Delete your Dataproc cluster.](/dataproc/docs/guides/manage-cluster#deleting_a_cluster) \n\n ```\n gcloud dataproc clusters delete cluster-name \\\n --region=region\n ```\n2. Delete the bucket: \n\n ```bash\n gcloud storage buckets delete BUCKET_NAME\n ```\n | **Important:** Your bucket must be empty before you can delete it.\n\nWhat's next\n-----------\n\n- Discover more about [Dataproc](/dataproc).\n- Explore reference architectures, diagrams, tutorials, and best practices. Take a look at our [Cloud Architecture Center](/architecture)."]]