This tutorial shows you how to run RStudio Server on a Dataproc cluster and access the RStudio web user interface (UI) from your local machine.
This tutorial assumes that you are familiar with the R language and the RStudio web UI, and that you have some basic understanding of using Secure Shell (SSH) tunnels, Apache Spark, and Apache Hadoop running on Dataproc.
Objectives
This tutorial walks you through the following procedures:
- Connect R through Apache Spark to Apache Hadoop YARN running on a Dataproc cluster.
- Connect your browser through an SSH tunnel to access the RStudio, Spark, and YARN UIs.
- Run an example query on Dataproc using RStudio.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Dataproc and Cloud Storage APIs.
- Install and initialize the Cloud SDK.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. See Cleaning up for more information.
Creating a Dataproc cluster
In the Cloud Console, go to the Dataproc Clusters page:
Click Create Cluster:
Name your cluster, and click Create.
For this tutorial, the default cluster sizes are adequate. Note the zone that you created the cluster in, because you will need that information in later steps.
Installing RStudio Server and its dependencies on the controller (master) node
Linux or macOS
On your local machine, connect through SSH to the controller (master) node of your Dataproc cluster:
gcloud compute ssh \ --zone=CLUSTER_ZONE \ --project=PROJECT_ID \ CLUSTER_NAME-m
Where:
CLUSTER_ZONE
is the zone where your cluster was created.PROJECT_ID
is the ID of your project.CLUSTER_NAME
is the name of your cluster.CLUSTER_NAME-m
is the controller node name of the cluster.
On the controller node, install the required packages and dependencies:
sudo apt-get update sudo apt-get install -y \ r-base r-base-dev \ libcurl4-openssl-dev libssl-dev libxml2-dev
Follow the instructions on the RStudio website to download and install the latest RStudio Server version for 64-bit Debian Linux.
Windows
On your local machine, connect through SSH to the controller node of your Dataproc cluster:
gcloud compute ssh ^ --zone=CLUSTER_ZONE ^ --project=PROJECT_ID ^ CLUSTER_NAME-m
Where:
CLUSTER_ZONE
is the zone where your cluster was created.PROJECT_ID
is the ID of your project.CLUSTER_NAME
is the name of your cluster.CLUSTER_NAME-m
is the controller node name of the cluster.
On the controller node, install the required packages and dependencies:
sudo apt-get update sudo apt-get install -y \ r-base r-base-dev \ libcurl4-openssl-dev libssl-dev libxml2-dev
Follow the instructions on the RStudio website to download and install the latest RStudio Server version for 64-bit Debian Linux.
Creating a user account on the controller node
To create a user account to log in to the RStudio UI, follow these steps.
Create a new user account, replacing
USER_NAME
with the new username:sudo adduser USER_NAME
When you are prompted, enter a password for the new user.
You can create multiple user accounts on the controller node to give users their own RStudio environment. For each user that you create, follow the sparklyr and Spark installation steps.
Connecting to the RStudio web UI
RStudio Server runs on the Dataproc controller node and is accessible only from the Google Cloud internal network. To access the server, you need a network path between your local machine and the controller node on the Google Cloud internal network.
You can connect by port forwarding through an SSH tunnel, which is more secure than opening a firewall port to the controller node. Using an SSH tunnel encrypts your connection to the web UI, even though the server uses simple HTTP.
There are two options for port forwarding: Dynamic port forwarding using SOCKS or TCP port forwarding.
Using SOCKS, you can view all internal web interfaces that are running on the Dataproc controller node; however, you need to use a custom browser configuration to redirect all browser traffic over the SOCKS proxy.
TCP port forwarding does not require a custom browser configuration, but you can only view the RStudio web interface.
Connect through an SSH SOCKS tunnel
To create an SSH SOCKS tunnel and connect by using a specially configured browser profile, follow the steps in Connecting to the web interfaces.
After you connect, use the following URLs to access the web interfaces.
To load the RStudio web UI, connect your specially configured browser to
http://CLUSTER_NAME-m:8787
. Then log in by using the username and password that you created.To load the YARN resource manager web UI, connect your specially configured browser to
http://CLUSTER_NAME-m:8088
.To load the HDFS NameNode web UI, connect your specially configured browser to
http://CLUSTER_NAME-m:9870
.
Connect through SSH port forwarding
Linux or macOS
On your local machine, connect to the Dataproc controller node:
gcloud compute ssh \ --zone=CLUSTER_ZONE \ --project=PROJECT_ID \ CLUSTER_NAME-m -- \ -L 8787:localhost:8787
The
--
parameter separates arguments for thegcloud
command from arguments that are sent to thessh
command. The-L
option sets up TCP port forwarding from port 8787 on the local machine to port 8787 on the cluster controller node where RStudio Server is listening.To load the RStudio web UI, connect your browser to
http://localhost:8787
.Log in by using the username and password that you created.
Windows
On your local machine, connect to the Dataproc controller node:
gcloud compute ssh ^ --zone=CLUSTER_ZONE ^ --project=PROJECT_ID ^ CLUSTER_NAME-m -- ^ -L 8787:localhost:8787
The
--
parameter separates arguments for thegcloud
command from arguments that are sent to thessh
command. The-L
option sets up TCP port forwarding from port 8787 on the local machine to port 8787 on the cluster controller node where RStudio Server is listening.To load the RStudio web UI, connect your browser to
http://localhost:8787
.Log in by using the username and password that you created.
Installing the sparklyr package and Spark
To install the sparklyr package and Spark, in the RStudio R console, run the following commands:
install.packages("sparklyr") sparklyr::spark_install()
These commands download, compile, and install the required R packages and a compatible Spark instance. Each command takes several minutes to complete.
Connecting R to Spark on YARN
Each time you restart an R session, follow these steps:
Load the libraries and set up the necessary environment variables:
library(sparklyr) library(dplyr) spark_home_set() Sys.setenv(HADOOP_CONF_DIR = '/etc/hadoop/conf') Sys.setenv(YARN_CONF_DIR = '/etc/hadoop/conf')
Connect to Spark on YARN, using the default settings:
sc <- spark_connect(master = "yarn-client")
The
sc
object references your Spark connection, which you can use to manage data and execute queries in R.If the command succeeds, then skip to Checking the status of the Spark connection..
If the command fails with an error message starting with:
Error in force(code) : Failed during initialize_connection: java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
Then there is an incompatibility between the version of YARN and the version of Spark that RStudio uses. You can avoid this incompatibility by disabling the Yarn time service.
In the menu of the RStudio web-UI, navigate to Tools > Shell to create a new Terminal tab.
In the Terminal tab, enter the following command to disable the service causing the incompatibility.
echo "spark.hadoop.yarn.timeline-service.enabled false" \ >> $SPARK_HOME/conf/spark-defaults.conf
Close the Terminal tab, and in the menu navigate to Session > Restart R.
Now repeat steps 1 and 2 again and you will connect to Spark successfully.
Checking the status of the Spark connection.
The sc
object created above is the reference to your Spark connection. You can
execute the following command to confirm that the R session is connected:
spark_connection_is_open(sc)
If your connection is established, the command returns the following:
[1] TRUE
You can tune the connection parameters by using a configuration object that can
be passed to spark_connect()
.
For more details on sparklyr connection parameters and tuning Spark on YARN, see the following:
Optional: Verifying your installation
To verify that everything is working, you can load a table onto the Dataproc cluster and perform a query.
In the R console, install the example dataset, a list of all New York City flights in 2013, and copy it into Spark:
install.packages("nycflights13") flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
If you are not using SOCKS port forwarding, skip to step 3. Otherwise, use the Spark UI to verify that the table was created.
In the browser that you configured, load the YARN resource manager:
http://CLUSTER_NAME-m:8088
In the application list, a row for the sparklyr app will appear in the table.
In the Tracking UI column, on the right side of the table, click the ApplicationMaster link to access the Spark UI.
In the Jobs tab of the Spark UI, you will see entries for the jobs that copied the data to Spark. In the Storage tab, you will see an entry for In-memory table 'flights'.
In the R console, run the following query:
flights_tbl %>% select(carrier, dep_delay) %>% group_by(carrier) %>% summarize(count = n(), mean_dep_delay = mean(dep_delay)) %>% arrange(desc(mean_dep_delay))
This query creates a list of the average departure delay per flight by airline in descending order, and produces the following result:
# Source: lazy query [?? x 3] # Database: spark_connection # Ordered by: desc(mean_dep_delay) carrier count mean_dep_delay <chr> <dbl> <dbl> 1 F9 685. 20.2 2 EV 54173. 20.0 3 YV 601. 19.0 4 FL 3260. 18.7 5 WN 12275. 17.7 ...
If you go back to the Jobs tab in the Spark UI, you can see the jobs that are used to execute this query. For longer-running jobs, you can use this tab to monitor progress.
Acknowledgments
Thanks to Mango Solutions for their assistance in preparing certain technical content for this article.
Cleaning up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- Delete the Dataproc cluster.
- If you have no other Dataproc clusters in the same region, you also need to delete the Cloud Storage bucket that was automatically created for the region.
Delete the cluster
In the Cloud Console, go to the Dataproc Clusters page:
In the cluster list, find the row for the Dataproc cluster that you created, and in the Cloud Storage staging bucket column, make a note of the bucket name, which begins with the word dataproc.
Select the checkbox next to rstudio-cluster, and click Delete.
When you are prompted to delete the cluster, confirm the deletion.
Delete the bucket
To delete the Cloud Storage bucket, go to the Cloud Storage browser:
Find the bucket that is associated with the Dataproc cluster that you just deleted.
Select the checkbox next to the bucket name, and click Delete.
When you are prompted to delete the storage bucket, confirm the deletion.
What's next
- For other ways of interacting with Dataproc, see Samples and Tutorials.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.