This page describes key concepts and components details for
Cloud Datalab. You can find additional overview documentation in the
datalab/docs/notebooks/intro notebook directory.
Cloud Datalab and notebooks
Cloud Datalab is packaged as a container and run in a VM (Virtual Machine) instance. The quickstart explains VM creation, running the container in that VM, and establishing a connection from your browser to the Cloud Datalab container, which
allows you to open existing Cloud Datalab notebooks and create new notebooks. Read through the introductory notebooks in the
/docs/intro directory to get a sense of how a notebook is organized and executed.
Cloud Datalab notebooks can be stored in Google Cloud Source Repository, a git repository. This git repository is cloned onto persistent disk attached to the VM. This clone forms your workspace, where you can add, remove, and modify files. To share your work with other users of the repository, you commit your changes using the git client to push your changes from this local workspace to the repository. Notebooks are automatically saved to persistent disk periodically, and you can save them whenever you wish. Note that if you delete the persistent disk, any notebooks that are not explicitly pushed to the git repository may be lost. Therefore, we strongly recommend that you do NOT delete the persistent disk.
When you open a notebook, a backend “kernel” process is launched to manage the variables defined during the session and execute your notebook code. When the executed code accesses Google Cloud services such as BigQuery or Google Machine Learning Engine, it uses the service account available in the VM. Hence, the service account must be authorized to access the data or request the service. To display the cloud project and service account names, click the user icon in the top-right corner of the Cloud Datalab notebook or notebook listing page in your browser (you may need to resize the browser window). The VM used for running Cloud Datalab is a shared resource accessible to all the members of the associated cloud project. Therefore, using an individual's personal cloud credentials to access data is strongly discouraged.
As you execute code in the notebook, the state of the process executing the code changes. If you assign or reassign a variable, its value is used for subsequent computations as a side effect. Each running notebook is shown as a session in Cloud Datalab. You can click on the sessions icon on the Cloud Datalab notebook listing page to list and stop sessions. While a session is running, the underlying process consumes memory resources. If you stop a session, the underlying process goes away along with its in-memory state, and memory used by the session is freed. Results saved in the notebook remain in persistent format on the disk.
Cloud Datalab Usage Scenarios
Cloud Datalab is an interactive data analysis and machine learning environment designed for Google Cloud Platform. You can use it to explore, analyze, transform, and visualize your data interactively and to build machine learning models from your data. In the Cloud Datalab
/docs folder, you will find a number of tutorials and samples that illustrate some of the tasks you can perform. Cloud Datalab includes a set of commonly used open source Python libraries used for data analysis, visualization, and machine learning. It also adds libraries for accessing key Google Cloud Platform services, such as Google BigQuery, Google Machine Learning Engine, Google Dataflow, and Google Cloud Storage. See Included Libraries for more information.
For Python library information, see the
pydatalab Reference Documentation.
Here are a few ideas to get you started:
- Write a few SQL queries to explore the data in BigQuery. Put the results in a Dataframe and visualize them as a histogram or a line chart.
- Read data from a CSV file in Google Cloud Storage and put it in a Dataframe to compute statistical measures such as mean, standard deviation, and quantiles using Python.
- Try a TensorFlow or scikit-learn model to predict results or classify data.
The following is a list of libraries included with and available to you in Cloud Datalab notebooks (the library list and version information is subject to change).
Installed with Conda:
crcmod at version 1.7 dask at version 0.17.1 dill at version 0.2.6 future at version 0.16.0 futures at version 3.2.0 google-api-python-client at version 1.6.2 httplib2 at version 0.10.3 h5py at version 2.7.1 ipykernel at version 4.8.2 ipywidgets at version 7.2.1 jinja2 at version 2.8 jsonschema at version 2.6.0 matplotlib at version 2.1.2 mock at version 2.0.0 nltk at version 3.2.1 numpy at version 1.14.0 oauth2client at version 2.2.0 pandas-gbq at version 0.3.0 pandas at version 0.22.0 pandocfilters at version 1.4.2 pillow at version 5.0.0 pip at version 18.1 plotly at version 1.12.5 psutil at version 4.3.0 pygments at version 2.1.3 python-dateutil at version 2.5.0 python-snappy at version 0.5.1 pytz at version 2018.4 pyzmq at version 17.1.0 requests at version 2.18.4 scikit-image at version 0.13.0 scikit-learn at version 0.19.1 scipy at version 1.0.0 seaborn at version 0.7.0 six at version 1.11.0 statsmodels at version 0.8.0 sympy at version 0.7.6.1 tornado at version 4.5.1 widgetsnbextension at version 3.2.1 xgboost at version 0.6a2
Installed with pip:
apache-airflow at version 1.9.0 apache-beam[gcp] at version 2.7.0 bs4 at version 0.0.1 ggplot at version 0.6.8 google-cloud-monitoring at version 0.28.0 lime at version 0.1.1.23 protobuf at version 3.5.2 tensorflow at version 1.8.0
apache-beam\[gcp\] are only installed for Python
2 kernels, while
notebook is only installed for Python 3 kernels.