Some of the core open source components included with Google Cloud Dataproc clusters, such as Apache Hadoop and Apache Spark, provide web interfaces. These web interfaces can be used to manage and monitor different cluster resources and facilities, such as the YARN resource manager, the Hadoop Distributed File System (HDFS), MapReduce, and Spark.
The interfaces listed below are available on a Cloud Dataproc cluster master
master-host-name with the name of your master node).
* In earlier Cloud Dataproc releases (pre-1.2), the HDFS Namenode Web UI port was 50070.
The YARN ResourceManager has links for all currently running and completed MapReduce and Spark Applications web interfaces under the "Tracking UI" column.
Connecting to the web interfaces
To connect to the web interfaces, the best practice is to use an SSH tunnel to create a secure connection to your master node. The SSH tunnel supports traffic proxying using the SOCKS protocol. This means that you can send network requests through your SSH tunnel in any browser that supports the SOCKS protocol. This method allows you to transfer all of your browser data over SSH, eliminating the need to open firewall ports to access the web interfaces.
Connecting to the web interfaces with SSH and SOCKS is a two-step process:
Create an SSH tunnel. Use an SSH client or utility to create the SSH tunnel. Use the SSH tunnel to securely transfer web traffic data from your computer's web browser to the Cloud Dataproc cluster.
Use a SOCKS proxy to connect with your browser. Configure your browser to use the SOCKS proxy. The SOCKS proxy routes data intended for the Cloud Dataproc cluster through the SSH tunnel.
Directions for performing each step are provided below.
Step 1 - Create an SSH tunnel
Run the following command to set up an SSH tunnel to the Hadoop master instance
on port 1080 of your local machine. Note that 1080 is an arbitrary but typical
choice since it is likely to be open on your local machine. Replace
master-host-name with the
name of the master node in your Cloud Dataproc cluster and
master-host-zone with the zone of your Cloud Dataproc cluster.
gcloud compute ssh --zone=master-host-zone master-host-name -- \ -D 1080 -N -n
-- separator allows you to add
arguments to the
gcloud compute ssh command, as follows:
-Dspecifies dynamic application-level port forwarding. Port 1080 is shown in the example, but another available port on your local machine can be used.
gcloudnot to open a remote shell.
gcloudnot to read from stdin.
This command creates an SSH tunnel that operates independently from other SSH shell sessions, keeps tunnel-related errors out of the shell output, and helps prevent inadvertent closures of the tunnel.
Step 2 - Connect with your web browser
Your SSH tunnel supports traffic proxying using the SOCKS protocol. You must configure your browser to use the proxy when connecting to your cluster.
The application (executable) location of your browser on your machine/device depends on its operating system. The following are standard Google Chrome application locations for popular operating systems:
|Operating System||Google Chrome Executable Path|
|Mac OS X||
To configure your browser, start a new browser session with proxy server parameters. Here's an example that uses the Google Chrome browser:
Google Chrome executable path\ --proxy-server="socks5://localhost:1080" \ --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \ --user-data-dir=/tmp/ master-host-name
This command uses the following Google Chrome flags:
-proxy-server="socks5://localhost:1080"tells Chrome to send all
https://URL requests through the SOCKS proxy server localhost:1080, using version 5 of the SOCKS protocol. Hostnames for these URLs are resolved by the proxy server, not locally by Chrome.
--host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost"prevents Chrome from sending any DNS requests over the network.
--user-data-dir=/tmp/hadoop-master-nameforces Chrome to open a new window that is not tied to an existing Chrome session. Without this flag, Chrome may open a new window attached to an existing Chrome session, ignoring your
--proxy-serversetting. The value set for
--user-data-dircan be any nonexistent path.
Once your browser is configured to use the proxy, you can navigate to one of the web interface URLs on your Cloud Dataproc cluster (see Available interfaces).
FAQ And Debugging Tips
What if I don't see the UI in my browser?
If you don't see the UIs in your browser, the two most common reasons are:
- You have a network connectivity issue, possibly due to a firewall.
Run the following command to see if you can SSH to the master instance.
If you can't, it signals a connectivity issue.
gcloud compute ssh cluster-name-m
- Another proxy is interfering with the SOCKS proxy. To check the proxy,
run the following
curlcommand (available on Linux and Mac OS X):
curl -Is --socks5-hostname localhost:1080 http://cluster-name-m:8088If you see an HTTP response, the proxy is working, so it's possible that the SOCKS proxy is being interrupted by another proxy or browser extension.
Why should I use a SOCKS proxy instead of local port forwarding?
Instead of the SOCKS proxy, it's possible to access web application UIs running
on your master instance with SSH local port forwarding, which
forwards the master's port to a local port. For example, the following command lets
localhost:1080 to reach
cluster-name-m:8088 without SOCKS:
gcloud compute ssh cluster-name-m -- -L 1080:cluster-name-m:8088 -N -n
You can also use Google Cloud Shell to implement local port forwarding, then use the Cloud Shell Web Preview feature to access the web interface (for an example, see Install and run a Cloud Datalab notebook in a Cloud Dataproc cluster→Open the Cloud Datalab notebook).
Using a SOCKS proxy is recommended over local port forwarding since the proxy:
- allows you to access all web application ports without having to set up a port forward tunnel for each UI port
- allows the Spark and Hadoop web UIs to correctly resolve DNS hosts