Running RStudio® Server on a Cloud Dataproc Cluster

This tutorial shows you how to run RStudio Server on a Cloud Dataproc cluster and access the RStudio web user interface (UI) from your local machine.

This tutorial assumes that you are familiar with the R language and the RStudio web UI, and that you have some basic understanding of using Secure Shell (SSH) tunnels, Apache Spark, and Apache Hadoop running on Cloud Dataproc.

Objectives

This tutorial walks you through the following procedures:

  • Connect R through Apache Spark to Apache Hadoop YARN running on a Cloud Dataproc cluster.
  • Connect your browser through an SSH tunnel to access the RStudio, Spark, and YARN UIs.
  • Run an example query on Cloud Dataproc using RStudio.

Costs

This tutorial uses the following billable components of Google Cloud Platform:

You can use the pricing calculator to generate a cost estimate based on your projected usage. New GCP users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Project selector page

  3. Make sure that billing is enabled for your Google Cloud Platform project.

    Learn how to enable billing

  4. Enable the Cloud Dataproc and Cloud Storage APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. See Cleaning up for more information.

Creating a Cloud Dataproc cluster

  1. In the GCP Console, go to the Cloud Dataproc Clusters page:

    GO TO THE CLUSTERS PAGE

  2. Click Create Cluster:

    Create a cluster

  3. Name your cluster, and click Create.

For this tutorial, the default cluster sizes are adequate. Note the zone that you created the cluster in, because you will need that information in later steps.

Installing RStudio Server and its dependencies on the master node

Linux or macOS

  1. On your local machine, connect through SSH to the master node of your Cloud Dataproc cluster:

    gcloud compute ssh \
        --zone=[CLUSTER_ZONE] \
        --project=[PROJECT_ID] \
        [CLUSTER_NAME]-m
    

    Where:

    • [CLUSTER_ZONE] is the zone where your cluster was created.
    • [PROJECT_ID] is the ID of your project.
    • [CLUSTER_NAME] is the name of your cluster.
    • [CLUSTER_NAME]-m is the master node name of the cluster.
  2. On the master node, install the required packages and dependencies:

    sudo apt-get update
    sudo apt-get install -y \
        r-base r-base-dev \
        libcurl4-openssl-dev libssl-dev libxml2-dev
    
  3. Follow the instructions on the RStudio website to download and install the latest RStudio Server version for 64-bit Debian Linux.

Windows

  1. On your local machine, connect through SSH to the master node of your Cloud Dataproc cluster:

    gcloud compute ssh ^
        --zone=[CLUSTER_ZONE] ^
        --project=[PROJECT_ID] ^
        [CLUSTER_NAME]-m
    

    Where:

    • [CLUSTER_ZONE] is the zone where your cluster was created.
    • [PROJECT_ID] is the ID of your project.
    • [CLUSTER_NAME] is the name of your cluster.
    • [CLUSTER_NAME]-m is the master node name of the cluster.
  2. On the master node, install the required packages and dependencies:

    sudo apt-get update
    sudo apt-get install -y \
        r-base r-base-dev \
        libcurl4-openssl-dev libssl-dev libxml2-dev
    
  3. Follow the instructions on the RStudio website to download and install the latest RStudio Server version for 64-bit Debian Linux.

Creating a user account on the master node

To create a user account to log in to the RStudio UI, follow these steps.

  1. Create a new user account, replacing [USER_NAME] with the new username:

    sudo adduser [USER_NAME]
  2. When you are prompted, enter a password for the new user.

    You can create multiple user accounts on the master node to give users their own RStudio environment. For each user that you create, follow the sparklyr and Spark installation steps.

Connecting to the RStudio web UI

RStudio Server runs on the Cloud Dataproc master node and is accessible only from the GCP internal network. To access the server, you need a network path between your local machine and the master node on the GCP internal network.

You can connect by port forwarding through an SSH tunnel, which is more secure than opening a firewall port to the master node. Using an SSH tunnel encrypts your connection to the web UI, even though the server uses simple HTTP.

There are two options for port forwarding: Dynamic port forwarding using SOCKS or TCP port forwarding.

Using SOCKS, you can view all internal web interfaces that are running on the Cloud Dataproc master node; however, you need to use a custom browser configuration to redirect all browser traffic over the SOCKS proxy.

TCP port forwarding does not require a custom browser configuration, but you can only view the RStudio web interface.

Connect through an SSH SOCKS tunnel

To create an SSH SOCKS tunnel and connect by using a specially configured browser profile, follow the steps in Connecting to the web interfaces.

After you connect, use the following URLs to access the web interfaces.

  • To load the RStudio web UI, connect your specially configured browser to http://[CLUSTER_NAME]-m:8787. Then log in by using the username and password that you created.

  • To load the YARN resource manager web UI, connect your specially configured browser to http://[CLUSTER_NAME]-m:8088.

  • To load the HDFS NameNode web UI, connect your specially configured browser to http://[CLUSTER_NAME]-m:9870.

Connect through SSH port forwarding

Linux or macOS

  1. On your local machine, connect to the Cloud Dataproc master node:

    gcloud compute ssh \
        --zone=[CLUSTER_ZONE] \
        --project=[PROJECT_ID] \
        [CLUSTER_NAME]-m -- \
        -L 8787:localhost:8787
    

    The -- parameter separates arguments for the gcloud command from arguments that are sent to the ssh command. The -L option sets up TCP port forwarding from port 8787 on the local machine to port 8787 on the cluster master node where RStudio Server is listening.

  2. To load the RStudio web UI, connect your browser to http://localhost:8787.

  3. Log in by using the username and password that you created.

Windows

  1. On your local machine, connect to the Cloud Dataproc master node:

    gcloud compute ssh ^
        --zone=[CLUSTER_ZONE] ^
        --project=[PROJECT_ID] ^
        [CLUSTER_NAME]-m -- ^
        -L 8787:localhost:8787
    

    The -- parameter separates arguments for the gcloud command from arguments that are sent to the ssh command. The -L option sets up TCP port forwarding from port 8787 on the local machine to port 8787 on the cluster master node where RStudio Server is listening.

  2. To load the RStudio web UI, connect your browser to http://localhost:8787.

  3. Log in by using the username and password that you created.

Installing the sparklyr package and Spark

To install the sparklyr package and Spark, in the RStudio R console, run the following commands:

install.packages("sparklyr")
sparklyr::spark_install()

These commands download, compile, and install the required R packages and a compatible Spark instance. Each command takes several minutes to complete.

Connecting R to Spark on YARN

Each time you restart an R session, follow these steps:

  1. Load the libraries and set up the necessary environment variables:

    library(sparklyr)
    library(dplyr)
    spark_home_set()
    Sys.setenv(HADOOP_CONF_DIR = '/etc/hadoop/conf')
    Sys.setenv(YARN_CONF_DIR = '/etc/hadoop/conf')
    
  2. Connect to Spark on YARN, using the default settings:

    sc <- spark_connect(master = "yarn-client")

    The sc object references your Spark connection, which you can use to manage data and execute queries in R.

    If the command succeeds, then skip to Checking the status of the Spark connection..

    If the command fails with an error message starting with:

    Error in force(code) :
    Failed during initialize_connection: java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

    Then there is an incompatibility between the version of YARN and the version of Spark that RStudio uses. You can avoid this incompatibility by disabling the Yarn time service.

  3. In the menu of the RStudio web-UI, navigate to Tools > Shell to create a new Terminal tab.

  4. In the Terminal tab, enter the following command to disable the service causing the incompatibility.

    echo "spark.hadoop.yarn.timeline-service.enabled false" \
        >> $SPARK_HOME/conf/spark-defaults.conf
  5. Close the Terminal tab, and in the menu navigate to Session > Restart R.

Now repeat steps 1 and 2 again and you will connect to Spark successfully.

Checking the status of the Spark connection.

The sc object created above is the reference to your Spark connection. You can execute the following command to confirm that the R session is connected:

spark_connection_is_open(sc)

If your connection is established, the command returns the following:

[1] TRUE

You can tune the connection parameters by using a configuration object that can be passed to spark_connect().

For more details on sparklyr connection parameters and tuning Spark on YARN, see the following:

Optional: Verifying your installation

To verify that everything is working, you can load a table onto the Cloud Dataproc cluster and perform a query.

  1. In the R console, install the example dataset, a list of all New York City flights in 2013, and copy it into Spark:

    install.packages("nycflights13")
    flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
    
  2. If you are not using SOCKS port forwarding, skip to step 3. Otherwise, use the Spark UI to verify that the table was created.

    1. In the browser that you configured, load the YARN resource manager:

      http://[CLUSTER_NAME]-m:8088

      In the application list, a row for the sparklyr app will appear in the table.

      row for sparklyr app

    2. In the Tracking UI column, on the right side of the table, click the ApplicationMaster link to access the Spark UI.

      In the Jobs tab of the Spark UI, you will see entries for the jobs that copied the data to Spark. In the Storage tab, you will see an entry for In-memory table 'flights'.

      Storage tab with entry for In-memory table flights

  3. In the R console, run the following query:

    flights_tbl %>%
      select(carrier, dep_delay) %>%
      group_by(carrier) %>%
      summarize(count = n(), mean_dep_delay = mean(dep_delay)) %>%
      arrange(desc(mean_dep_delay))
    

    This query creates a list of the average departure delay per flight by airline in descending order, and produces the following result:

    # Source:     lazy query [?? x 3]
    # Database:   spark_connection
    # Ordered by: desc(mean_dep_delay)
       carrier  count mean_dep_delay
       <chr>    <dbl>          <dbl>
     1 F9        685.           20.2
     2 EV      54173.           20.0
     3 YV        601.           19.0
     4 FL       3260.           18.7
     5 WN      12275.           17.7
    ...
    

If you go back to the Jobs tab in the Spark UI, you can see the jobs that are used to execute this query. For longer-running jobs, you can use this tab to monitor progress.

Acknowledgments

Thanks to Mango Solutions for their assistance in preparing certain technical content for this article.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

  • Delete the Cloud Dataproc cluster.
  • If you have no other Cloud Dataproc clusters in the same region, you also need to delete the Cloud Storage bucket that was automatically created for the region.

Delete the cluster

  1. In the GCP Console, go to the Cloud Dataproc Clusters page:

    GO TO THE CLOUD DATAPROC CLUSTERS PAGE

  2. In the cluster list, find the row for the Cloud Dataproc cluster that you created, and in the Cloud Storage staging bucket column, make a note of the bucket name, which begins with the word dataproc.

    Delete the cluster

  3. Select the checkbox next to rstudio-cluster, and click Delete.

  4. When you are prompted to delete the cluster, confirm the deletion.

Delete the bucket

  1. To delete the Cloud Storage bucket, go to the Cloud Storage Browser:

    GO TO THE CLOUD STORAGE BROWSER

  2. Find the bucket that is associated with the Cloud Dataproc cluster that you just deleted.

  3. Select the checkbox next to the bucket name, and click Delete.

    Delete the storage bucket

  4. When you are prompted to delete the storage bucket, confirm the deletion.

What's next

  • For other ways of interacting with Cloud Dataproc, see Samples and Tutorials.
  • Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.
Oliko tästä sivusta apua? Kerro mielipiteesi

Palautteen aihe:

Tämä sivu
Solutions