This page describes key concepts and components details for
Cloud Datalab. You can find additional overview documentation in the
datalab/docs/notebooks/intro notebook directory.
Cloud Datalab and notebooks
Cloud Datalab is packaged as a container and run in a VM (Virtual Machine) instance. The quickstart explains VM creation, running the container in that VM, and establishing a connection from your browser to the Cloud Datalab container, which
allows you to open existing Cloud Datalab notebooks and create new notebooks. Read through the introductory notebooks in the
/docs/intro directory to get a sense of how a notebook is organized and executed.
Cloud Datalab notebooks can be stored in Google Cloud Source Repository, a git repository. This git repository is cloned onto persistent disk attached to the VM. This clone forms your workspace, where you can add, remove, and modify files. To share your work with other users of the repository, you commit your changes using the git client to push your changes from this local workspace to the repository. Notebooks are automatically saved to persistent disk periodically, and you can save them whenever you wish. Note that if you delete the persistent disk, any notebooks that are not explicitly pushed to the git repository may be lost. Therefore, we strongly recommend that you do NOT delete the persistent disk.
When you open a notebook, a backend “kernel” process is launched to manage the variables defined during the session and execute your notebook code. When the executed code accesses Google Cloud services such as BigQuery or Google Machine Learning Engine, it uses the service account available in the VM. Hence, the service account must be authorized to access the data or request the service. To display the cloud project and service account names, click the user icon in the top-right corner of the Cloud Datalab notebook or notebook listing page in your browser. The VM used for running Cloud Datalab is a shared resource accessible to all the members of the associated cloud project. Therefore, using an individual's personal cloud credentials to access data is strongly discouraged.
As you execute code in the notebook, the state of the process executing the code changes. If you assign or reassign a variable, its value is used for subsequent computations as a side effect. Each running notebook is shown as a session in Cloud Datalab. You can click on the sessions icon on the Cloud Datalab notebook listing page to list and stop sessions. While a session is running, the underlying process consumes memory resources. If you stop a session, the underlying process goes away along with its in-memory state, and memory used by the session is freed. Results saved in the notebook remain in persistent format on the disk.
Cloud Datalab Usage Scenarios
Cloud Datalab is an interactive data analysis and machine learning environment designed for Google Cloud Platform. You can use it to explore, analyze, transform, and visualize your data interactively and to build machine learning models from your data. In the Cloud Datalab
/docs folder, you will find a number of tutorials and samples that illustrate some of the tasks you can perform. Cloud Datalab includes a set of commonly used open source Python libraries used for data analysis, visualization, and machine learning. It also adds libraries for accessing key Google Cloud Platform services, such as Google BigQuery, Google Machine Learning Engine, Google Dataflow, and Google Cloud Storage. See Included Libraries for more information.
Here are a few ideas to get you started:
- Write a few SQL queries to explore the data in BigQuery. Put the results in a Dataframe and visualize them as a histogram or a line chart.
- Read data from a CSV file in Google Cloud Storage and put it in a Dataframe to compute statistical measures such as mean, standard deviation, and quantiles using Python.
- Try a TensorFlow or 1scikit-learn1 model to predict results or classify data.
The following is a list of libraries included with and available to you in Cloud Datalab notebooks (the library list and version information is subject to change):
argparse at version 1.2.1 bs4 at version 0.0.1 crcmod at version 1.7 future at version 0.16.0 futures at version 3.0.5 ggplot at version 0.6.8 google-api-python-client at version 1.5.1 google-cloud at version 0.19.0 httplib2 at version 0.9.2 ipykernel at version 4.4.1 ipywidgets at version 5.2.2 jinja2 at version 2.8 jsonschema at version 2.6.0 matplotlib at version 1.5.3 mock at version 2.0.0 nltk at version 3.2.1 notebook at version 4.2.3 numpy at version 1.11.2 oauth2client at version 2.2.0 pandas at version 0.19.1 pandas-profiling at version at least 1.0.0a2 pandocfilters at version 1.3.0 pillow at version 3.4.1 plotly at version 1.12.5 psutil at version 4.3.0 pygments at version 2.1.3 python-dateutil at version 2.5.0 pytz at version 2016.7 PyYAML at version 3.11 pyzmq at version 16.0.2 requests at version 2.9.1 scikit-learn at version 0.17.1 scipy at version 0.18.0 seaborn at version 0.7.0 six at version 1.10.0 statsmodels at version 0.6.1 sympy at version 0.7.6.1 tornado at version 4.4.2