Working with notebooks

This guide explains different tasks associated with Cloud Datalab notebooks.

Source control

When you run datalab create VM-instance-name for the first time, it adds a datalab-notebooks Cloud Source Repository in the project (referred to, below, as the "cloud remote repo"). This is a remote repository for the /content/datalab/notebooks git repository created in the docker container running in your Cloud Datalab VM instance (referred to, below, as the "Cloud Datalab VM repo"). You can browse the cloud remote repo from the Google Cloud console Repositories page.

You can use git or ungit to manage the notebooks in the Cloud Datalab VM repo.

Using ungit in your browser

The Cloud Datalab container includes ungit, a web-based git client, which allows you to make commits to the Cloud Datalab VM repo and push notebooks to the cloud remote repo from the Cloud Datalab browser UI.

To open ungit on the Cloud Datalab /content/datalab/notebooks repo, select the repository icon in the right-top section of the Google Cloud Datalab menu bar.

A browser window opens on the Cloud Datalab VM repo.

Adding a notebook to the cloud remote repo.

  1. Navigate to the /datalab/notebooks folder in your Cloud Datalab notebook browser window.

  2. Open a new notebook from the /datalab/notebooks folder by selecting the "+ Notebook" icon.

    1. Add one or more cells to the notebook
    2. Rename the notebook by clicking on "Untitled Notebook" in the menu bar and changing the name to "New Notebook"
    3. Select Notebook→Save and Checkpoint (Ctrl-s), or wait for the notebook to be autosaved.
  3. Return to the Cloud Datalab notebook browser window, and click on the ungit icon to open an ungit browser page (see Using ungit in your browser). After providing a commit title, New Notebook.ipynb is ready to be committed to the Cloud Datalab VM repo.

  4. After committing the notebook, push it to the datalab-notebooks cloud remote repo from the ungit browser page.

Using git from the command line

Instead of using ungit from the Cloud Datalab UI for source control (see Using ungit in your browser), you can SSH into the Cloud Datalab VM and run git from a terminal running in your VM or from Cloud Shell. Here are the steps:

  1. SSH to the Cloud Datalab VM using the gcloud CLI or the console:

    gcloud command

    Run the following command after inserting project-id, zone, and instance-name.
    gcloud compute --project project-id ssh 
    --zone zone instance-name


    Go to the console VM instances section, expand the SSH menu at the right of your Cloud Datalab VM row, and then select View gcloud command.
    The gcloud command line window opens, showing the gcloud SSH command that you can copy and paste to run in a local terminal.
  2. After SSHing into the Cloud Datalab VM, run the sudo docker ps command to list the Container ID of the Cloud Datalab docker image running in the VM. Copy the Container ID that is associated with the /datalab/ command and the datalab_datalab-server name.
    docker ps
    CONTAINER ID  ...    COMMAND   ...   ...   NAMES
    b228e3392374   ...   "/datalab/" ... datalab_datalab-server-...
  3. Open an interactive shell session inside the container using the Container ID from last step.
    docker exec -it container-id bash
  4. Change to the /content/datalab/notebooks directory in the container.
    cd /content/datalab/notebooks
    This is the root Cloud Datalab VM git repo directory from which you can issue git commands.
    git status
    On branch master
    nothing to commit, working directory clean

Copying notebooks from the Cloud Datalab VM

You can copy files from your Cloud Datalab VM instance using the gcloud compute scp command. For example, to copy the contents of your Cloud Datalab VM's datalab/notebooks directory to a instance-name-notebooks directory on your local machine, run the following command after replacing instance-name with the name of your Cloud Datalab VM (the instance-name-notebooks directory directory will be created if it doesn't exist).

gcloud compute scp --recurse \
  datalab@instance-name:/mnt/disks/datalab-pd/content/datalab/notebooks \

Cloud Datalab backup

Cloud Datalab instances periodically back up user content to a Google Cloud Storage bucket in the user's project to prevent accidental loss of user content in case of a failed or deleted VM disk. By default, a Cloud Datalab instance stores all of the user’s content in an attached disk, and the backup utility works on this disk’s root. The backup job is run every ten minutes, creates a zip file of the entire disk, compares it to the last backup zip file, uploads the zip if there’s a difference between the two and if sufficient time has elapsed between the new changes and the last backup. Cloud Datalab uploads the backup files to Google Cloud Storage.

Cloud Datalab retains the last 10 hourly backups, 7 daily backups, and 20 weekly backups, and deletes older backup files to preserve space. Backups can be turned off by passing the --no-backups flag when creating a Cloud Datalab instance with the datalab create command.

Each backup file is named using the VM instance zone, instance name, notebook backup directory path within the instance, timestamp, and a tag that is either hourly, daily, or weekly. By default, Cloud Datalab will try to create the backup path $ If this path cannot be created or the user does not have sufficient permissions, the creation of a $project_id/datalab_backups path is attempted. If that attempt fails, backups to Google Cloud Storage will fail.

Restoring backups

To restore a backup, the user selects the backup file from Google Cloud Storage by examining the VM zone, VM name, notebook directory, and the human-readable timestamp.

Sample backup file path: gs://myproject/datalab-backups/us-central1-b/datalab0125/content/daily-20170127102921 /tmp/

This sample backup was created for the VM datalab0125 in zone us-central1-b, and it contains all content under the notebook's/content directory. It was created as a daily backup point on 01/27/2017 at 10:29:21.

A backup zip file can be downloaded from the browser or by using the gsutil tool that is installed as part of the Google Cloud CLI installation.

  • To use the browser, navigate to Google Cloud console, then select Storage from the left navigation sidebar. Browse to the Cloud Datalab backup bucket, then select and download the zip file to disk.

  • To use gsutil to download the backup file, run gsutil cp gs://backup_path destination_path. For example, to backup and extract the sample zip file discussed above:

    gsutil cp 
    unzip -q /tmp/ -d /tmp/restore_location/

Working with data

Cloud Datalab can access data located in any of the following places:

  • Google Cloud Storage: files and directories in Cloud Storage can be programmatically accessed using the APIs (see the /datalab/docs/tutorials/Storage/Storage APIs.ipynb notebook tutorial)

  • BigQuery: tables and views can be queried using SQL and datalab.bigquery APIs (see the datalab/docs/tutorials/BigQuery/BigQuery/BigQuery APIs.ipynb notebook tutorial)

  • Local file system on the persistent disk: you can create or copy files to the file system on the persistent disk attached to your Cloud Datalab VM.

If your data is in a different location—on premise or in another cloud—you can transfer the data to Cloud Storage using the gsutil tool or the Cloud Storage Transfer Service.