Cette page a été traduite par l'API Cloud Translation.

Utiliser le connecteur Cloud Storage avec Apache Spark

Ce tutoriel vous montre comment exécuter un exemple de code utilisant le connecteur Cloud Storage avec Apache Spark.

Objectifs

Écrire une tâche de décompte de mots simple Spark en Java, Scala ou Python, puis exécuter la tâche sur un cluster Dataproc.

Coûts

Dans ce document, vous utilisez les composants facturables de Google Cloudsuivants :

Compute Engine
Dataproc
Cloud Storage

Vous pouvez obtenir une estimation des coûts en fonction de votre utilisation prévue à l'aide du simulateur de coût.

Les nouveaux utilisateurs de Google Cloud peuvent bénéficier d'un essai gratuit.

Avant de commencer

Exécutez les étapes ci-dessous pour préparer l'exécution du code dans ce tutoriel.

Configurer votre projet Si nécessaire, configurez un projet avec les API Dataproc, Compute Engine et Cloud Storage activées, et la Google Cloud CLI installée sur votre ordinateur local.
1. Créer un bucket Cloud Storage Vous aurez besoin d'un stockage Cloud Storage pour stocker les données du tutoriel. Si vous n'en avez pas, créez un bucket dans votre projet.
  1. In the Google Cloud console, go to the Cloud Storage Buckets page.
    Go to Buckets
  2. Click Create.
  3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
    1. In the Get started section, do the following:
      - Enter a globally unique name that meets the bucket naming requirements.
      - To add a bucket label, expand the Labels section (), click Add label, and specify a key and a value for your label.
    2. In the Choose where to store your data section, do the following:
      1. Select a Location type.
      2. Choose a location where your bucket's data is permanently stored from the Location type drop-down menu.
        
        If you select the dual-region location type, you can also choose to enable turbo replication by using the relevant checkbox.
      3. To set up cross-bucket replication, select Add cross-bucket replication via Storage Transfer Service and follow these steps:
        
        Set up cross-bucket replication
        
        In the Bucket menu, select a bucket.
        
        In the Replication settings section, click Configure to configure settings for the replication job.
        
        The Configure cross-bucket replication pane appears.
        
        To filter objects to replicate by object name prefix, enter a prefix that you want to include or exclude objects from, then click Add a prefix.
        
        To set a storage class for the replicated objects, select a storage class from the Storage class menu. If you skip this step, the replicated objects will use the destination bucket's storage class by default.
        
        Click Done.
    3. In the Choose how to store your data section, do the following:
      1. Select a default storage class for the bucket or Autoclass for automatic storage class management of your bucket's data.
      2. To enable hierarchical namespace, in the Optimize storage for data-intensive workloads section, select Enable hierarchical namespace on this bucket.
        Note: You cannot enable hierarchical namespace in existing buckets.
    4. In the Choose how to control access to objects section, select whether or not your bucket enforces public access prevention, and select an access control method for your bucket's objects.
      Note: You cannot change the Prevent public access setting if this setting is enforced at an organization policy.
    5. In the Choose how to protect object data section, do the following:
      - Select any of the options under Data protection that you want to set for your bucket.
        
        To enable soft delete, click the Soft delete policy (For data recovery) checkbox, and specify the number of days you want to retain objects after deletion.
        
        To set Object Versioning, click the Object versioning (For version control) checkbox, and specify the maximum number of versions per object and the number of days after which the noncurrent versions expire.
        
        To enable the retention policy on objects and buckets, click the Retention (For compliance) checkbox, and then do the following:
        
        To enable Object Retention Lock, click the Enable object retention checkbox.
        
        To enable Bucket Lock, click the Set bucket retention policy checkbox, and choose a unit of time and a length of time for your retention period.
      - To choose how your object data will be encrypted, expand the Data encryption section (), and select a Data encryption method.
  4. Click Create.
2. Définissez des variables d'environnement locales. Définissez des variables d'environnement sur votre ordinateur local. Définissez l'ID de votre projet Google Cloud et le nom du bucket Cloud Storage que vous utiliserez pour ce tutoriel. Indiquez également le nom et la région d'un cluster Dataproc existant ou nouveau. Vous pouvez créer un cluster à utiliser dans ce tutoriel à l'étape suivante.
```
PROJECT=project-id
```
```
BUCKET_NAME=bucket-name
```
```
CLUSTER=cluster-name
```
```
REGION=cluster-region Example: "us-central1"
```
3. Créer un cluster Dataproc Exécutez la commande ci-dessous pour créer un cluster Dataproc à nœud unique dans la zone Compute Engine spécifiée.
```
gcloud dataproc clusters create ${CLUSTER} \
    --project=${PROJECT} \
    --region=${REGION} \
    --single-node
```
  La commande ci-dessus installe la version d'image de cluster par défaut. Vous pouvez utiliser l'option --image-version pour sélectionner une version d'image pour votre cluster. Chaque version d'image installe des versions spécifiques des composants de la bibliothèque Spark et Scala. Si vous préparez la tâche de décompte de mots Spark en Java ou Scala, vous référencez les versions Spark et Scala installées sur votre cluster lorsque vous préparez le package de la tâche.
4. Copier des données publiques dans votre bucket Cloud Storage. Copiez un extrait de texte Shakespeare public dans le dossier input de votre bucket Cloud Storage :
```
gcloud storage cp gs://pub/shakespeare/rose.txt \
    gs://${BUCKET_NAME}/input/rose.txt
```
5. Configurer un environnement de développement Java (Apache Maven), Scala (SBT) ou Python.
  Utiliser Cloud Shell. Cloud Shell inclut les outils utilisés dans ce tutoriel, y compris Apache Maven, Python et Google Cloud CLI.

Utiliser le connecteur Cloud Storage avec Apache Spark

Objectifs

Coûts

Avant de commencer

Set up cross-bucket replication

Préparer la tâche de décompte Spark

Java

Scala

Python

Envoyer la tâche

Java

Scala

Python

Consulter le résultat

Effectuer un nettoyage

Supprimer le projet

Supprimer le cluster Dataproc

Supprimer le bucket Cloud Storage

Étapes suivantes