Back up a Dataproc Metastore service

This page explains how to create a backup of a Dataproc Metastore service.

A backup takes a snapshot of your service saves its current configuration settings and all stored metadata.

After you create a backup, you can use the Restore from a backup feature to populate a new Dataproc Metastore service with the data saved in the snapshot.

Before you begin

Required roles

To get the permissions that you need to back up a Dataproc Metastore service, ask your administrator to grant you the following IAM roles:

For more information about granting roles, see Manage access.

These predefined roles contain the permissions required to back up a Dataproc Metastore service. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to back up a Dataproc Metastore service:

  • To backup a metadata service: metastore.backups.create
  • To use the Cloud Storage object:
    • orgpolicy.policy.get
    • resourcemanager.projects.get
    • resourcemanager.projects.list
    • storage.managedFolders.create
    • storage.managedFolders.delete
    • storage.managedFolders.get
    • storage.managedFolders.list
    • storage.multipartUploads.*
    • storage.objects.create
    • storage.objects.delete
    • storage.objects.get
    • storage.objects.list
    • storage.objects.restore
    • storage.objects.update

You might also be able to get these permissions with custom roles or other predefined roles.

For more information about specific Dataproc Metastore roles and permissions, see Dataproc Metastore IAM overview.

Backup considerations

Before running a backup operation, note the following considerations:

  • For each Dataproc Metastore service, you can create and store up to seven backups at a time. If you try to exceed seven backups, the backup process fails. If you want to create another backup, you must first manually delete one of your stored backup files.
  • While a backup operation is running, you can't update your Dataproc Metastore service — for example, you can't change configuration settings. However, you can still use your service for normal operations, such accessing metadata from attached Dataproc or self-managed clusters.
  • You can create scheduled backups that run at various cron intervals, such as every day.

Create a backup

To back up a Dataproc Metastore service, complete the steps in one of the following tabs:

Console

  1. In the Google Cloud console, open the Dataproc Metastore page:

    Open Dataproc Metastore

  2. On the Dataproc Metastore page, click the name of the service you want to back up.

    The Service detail page opens.

    Service detail page
    Figure 1. The Dataproc Metastore service detail page
  3. At the top of the page, click Backup.

    The Backup page opens.

  4. Enter the Backup name.

  5. Optional: Enter a Description of the backup.

  6. To start the backup operation, click Backup.

    Return to the Dataproc Metastore page, and verify that your service was successfully backed up.

    When the backup completes, Dataproc Metastore automatically returns to the active state regardless of whether or not the backup succeeded.

gcloud CLI

  1. To back up a Dataproc Metastore service, run the following gcloud metastore services backups create command:

    gcloud metastore services backups create BACKUP \
        --location=LOCATION \
        --service=SERVICE \
        --description=DESCRIPTION
    

    Replace the following:

    • BACKUP: the ID or fully qualified identifier for the backup.
    • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.
    • SERVICE: the name of your Dataproc Metastore service.
    • DESCRIPTION: a description of your backup.
  2. Verify that your service was successfully backed up.

    When the backup completes, Dataproc Metastore automatically returns to the active state regardless of whether or not the backup succeeded.

REST

Follow the API instructions to back up metadata from a service by using the APIs Explorer.

When the backup completes, Dataproc Metastore automatically returns to the active state regardless of whether or not the backup succeeded.

View backup history

To view the backup history of a Dataproc Metastore service in the Google Cloud console, complete the following steps:

  1. In the Google Cloud console console, open the Dataproc Metastore page.
  2. In the navigation bar, click Backup/Restore.

    Your backup history appear in a table under Backups.

    The history displays up to the latest 7 backups.

    Deleting a Dataproc Metastore service also deletes all associated backup history.

Delete a backup

To delete a Dataproc Metastore backup in the Google Cloud console, complete the following steps:

  1. In the Google Cloud console, open the Dataproc Metastore page.
  2. In the navigation bar, click Backup/Restore.
  3. Find the backup you want to delete and click the settings button.
  4. Click Delete.

Schedule a backup

Backups can be scheduled to run at user-specified cron job intervals, including running daily, weekly, or monthly. A cron schedule uses the unix-cron string format (* * * * *) which is a set of five fields in a line, indicating when the job should be executed.

For example, you can set a custom interval to create a backup every week, such as creating a backup every Wednesday at 2:00 PM PST.

Scheduled backup considerations

  • Scheduled backups need to specify a backup location, which must be a Cloud Storage path.
  • Scheduled backups are always created in the Avro file format.
  • Scheduled backups are configured in the UTC timezone by default. You can change the timezone when creating the backup for the first time.
  • Scheduled backups can be set to run at daily, weekly, or monthly intervals.

Create a scheduled backup

Backups schedules can be set when you create your service for the first time or added later when you update your service.

To create a Dataproc Metastore service 2 with a scheduled backup, complete the steps in one of the following tabs:

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the Dataproc Metastore page, click the Create button.

    The Create service page opens.

  3. Select Dataproc Metastore 2.

  4. Under Scheduled Backups, set the toggle to Enable.

  5. Under Location, select the Cloud Storage location where you want to store your scheduled backup.

  6. Optional: under schedule, select the following:

    1. For Repeats, select the recurrence, such as Daily or Weekly.
    2. For At time, select the time of recurrence, such as 12:00 AM.
    3. For Timezone, select the appropriate time zone, such as UTC-8.
  7. For the remaining service configuration options, use the provided defaults.

  8. Click Submit.

gcloud CLI

  1. To schedule a backup of a Dataproc Metastore service, run the following gcloud metastore services backups create command:

    gcloud metastore services create SERVICE \
       --location=LOCATION \
       --enable-scheduled-backup \
       --scheduled-backup-cron=SCHEDULED_BACKUP_CRON \
       --scheduled-backup-location=SCHEDULED_BACKUP_LOCATION
    

    Replace the following:

    • SERVICE: the ID or fully qualified identifier for the backup.
    • LOCATION: the Google Cloud region in which yourDataproc Metastore service resides.
    • SCHEDULED_BACKUP_CRON: the frequency of your backup, specified in the cron time format. For example, a cron value of 0 0 * * * schedules a daily backup.
    • SCHEDULED_BACKUP_LOCATION: the Cloud Storage location of your backup. For example: gs://my-bucket/path/to/location.

    or

    You can also schedule a backup by storing the preceding values in a configuration file:

    gcloud metastore services create SERVICE \
       --location=LOCATION \
       --scheduled-backup-configs-from-file=SCHEDULED_BACKUP_CONFIGS_FROM_FILE
    

    Replace the following:

    • SCHEDULED_BACKUP_CONFIGS_FROM_FILE: a path to a JSON file containing the backup configuration values enabled, cront_schedule, time_zone, and backup_location.

    The following example shows a backup configuration file that enables scheduled backups, sets the backup schedule to every hour, specifies the time zone as PST, and defines the backup location as a Cloud Storage bucket. You can choose time zones from the list of common tz database time zones.

    {
    "enabled": true,
    "cron_schedule": "0 0 * * *",
    "time_zone": "PST",
    "backup_location": "gs://my-bucket/path/to/location"
    }
    

REST

Follow the API instructions to create a scheduled backup by using the APIs Explorer.

Update a scheduled backup

To update a Dataproc Metastore service 2 configured with a scheduled backup, complete the steps in one of the following tabs:

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. On the Dataproc Metastore page, click the name of the service you want to schedule a backup for.

  3. Under Scheduled Backups, set the toggle to Enabled.

  4. Under location, select the Cloud Storage location where you want to store your scheduled backup.

  5. Optional: Under Schedule, select values for the following fields:

    1. For Repeats, select the recurrence, such as Daily or Weekly.
    2. For At time, select the time of recurrence, such as 12:00 AM.
    3. For Timezone, select the appropriate time zone, such as UTC-8.

gcloud CLI

  1. To schedule a backup of a Dataproc Metastore service, run the following gcloud metastore services backups update command:

    gcloud metastore services update SERVICE \
       --location=LOCATION \
       --enable-scheduled-backup \
       --scheduled-backup-cron=SCHEDULED_BACKUP_CRON \
       --scheduled-backup-location=SCHEDULED_BACKUP_LOCATION \
    

    Replace the following:

    • SERVICE: the ID or fully qualified identifier for the scheduled backup.
    • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.
    • SCHEDULED_BACKUP_CRON: the frequency of your backup, specified in the cron time format. For example, a cron value of 0 0 * * * schedules a daily backup.
    • SCHEDULED_BACKUP_LOCATION: the Cloud Storage location of your scheduled backup. For example: gs://my-bucket/path/to/location.

    You can also update a scheduled backup using the preceding values stored in a configuration file:

    gcloud metastore services update SERVICE \
       --location=LOCATION \
       --scheduled-backup-configs-from-file=SCHEDULED_BACKUP_CONFIGS_FROM_FILE
    

    Replace the following:

    • SCHEDULED_BACKUP_CONFIGS_FROM_FILE: a path to a JSON file containing the backup configuration.

    The following example shows a backup config file that disables a scheduled backup.

    {
    "enabled": false,
    }
    

REST

Follow the API instructions to update a scheduled backup by using the APIs Explorer.

View a scheduled backup

To view a Dataproc Metastore service 2 configured with a scheduled backup, complete the steps in one of the following tabs:

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the page, click Backup.

    The Backup page opens and displays your scheduled backups. Note that the backups are actually stored in the Cloud Storage bucket that you provided in the scheduled backup configuration.

gcloud CLI

  1. Run the following gsutil ls command:

    gsutil ls gs://BUCKET_NAME/SERVICE/LOCATION
    

    Replace the following:

    • BUCKET_NAME: the path to the Cloud Storage bucket that stores the scheduled backup that you want to view.
    • SERVICE: the ID or fully qualified identifier for the scheduled backup.
    • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.

REST

Follow the API instructions to view a scheduled backup by using the APIs Explorer.

Troubleshoot common issues

What's next