Architecture for connecting visualization software to Hadoop on Google Cloud

Last reviewed 2024-04-17 UTC

This document is intended for operators and IT administrators who want to set up secure data access for data analysts using business intelligence (BI) tools such as Tableau and Looker. It doesn't offer guidance on how to use BI tools, or interact with Dataproc APIs.

This document is the first part of a series that helps you build an end-to-end solution to give data analysts secure access to data using BI tools. This document explains the following concepts:

  • A proposed architecture.
  • A high-level view of the component boundaries, interactions, and networking in the architecture.
  • A high-level view of authentication and authorization in the architecture.

The second part of this series, Connecting your visualization software to Hadoop on Google Cloud, shows you how to set up the architecture on Google Cloud.

Architecture

The following diagram shows the architecture and the flow of events that are explained in this document. For more information about the products that are used in this architecture, see Architectural components.

The flow of events in the architecture.

  1. Client applications connect through Java Database Connectivity (JDBC) to a single entry point on a Dataproc cluster. The entry point is provided by Apache Knox, which is installed on the cluster master node. The communication with Apache Knox is secured by TLS.
  2. Apache Knox delegates authentication through an authentication provider to a system such as an LDAP directory.
  3. After authentication, Apache Knox routes the user request to one or more backend clusters. You define the routes and configuration as custom topologies.
  4. A data processing service, such as Apache Hive, listens on the selected backend cluster and takes the request.
  5. Apache Ranger intercepts the request and determines whether processing should go ahead, depending on if the user has valid authorization.
  6. If the validation succeeds, the data processing service analyzes the request and returns the results.

Architectural components

The architecture is made up of the following components.

  • The managed Hadoop platform:
    • Dataproc. Dataproc is Google Cloud-managed Apache Spark, an Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc is the platform that underpins the solution described in this document.
  • User authentication and authorization:
    • Apache Knox. Apache Knox acts as a single HTTP access point for all the underlying services in a Hadoop cluster. Apache Knox is designed as a reverse proxy with pluggable providers for authentication, authorization, audit, and other services. Clients send requests to Knox, and, based on the request URL and parameters, Knox routes the request to the appropriate Hadoop service. Because Knox is an entry point that transparently handles client requests and hides complexity, it's at the center of the architecture.
    • Apache Ranger. Apache Ranger provides fine-grained authorization for users to perform specific actions on Hadoop services. It also audits user access and implements administrative actions.
  • Processing engines:
    • Apache Hive. Apache Hive is data warehouse software that enables access and management of large datasets residing in distributed storage using SQL. Apache Hive parses the SQL queries, performs semantic analysis, and builds a directed acyclic graph (DAG) of stages for a processing engine to execute. In the architecture shown in this document, Hive acts as the translation point between the user requests. It can also act as one of the multiple processing engines. Apache Hive is ubiquitous in the Hadoop ecosystem and it opens the door to practitioners familiar with standard SQL to perform data analysis.
    • Apache Tez. Apache Tez is the processing engine in charge of executing the DAGs prepared by Hive and of returning the results.
    • Apache Spark. Apache Spark is a unified analytics engine for large-scale data processing that supports the execution of DAGs. The architecture shows the Spark SQL component of Apache Spark to demonstrate the flexibility of the approach presented in this document. One restriction is that Spark SQL doesn't have official Ranger plugin support. For this reason, authorization must be done through the coarse-grained ACLs in Apache Knox instead of using the fine-grained authorization that Ranger provides.

Components overview

In the following sections, you learn about each of the components in more detail. You also learn how the components interact with each other.

Client applications

Client applications include tools that can send requests to an HTTPS REST endpoint, but don't necessarily support the Dataproc Jobs API. BI tools such as Tableau and Looker have HiveServer2 (HS2) and Spark SQL JDBC drivers that can send requests through HTTP.

Request flow from Tableau, Looker, and Beeline to HTTPS REST endpoint.

This document assumes that client applications are external to Google Cloud, executing in environments such as an analyst workstation, on-premises, or on another cloud. So, the communication between the client applications and Apache Knox must be secured with either a CA-signed or self-signed SSL/TLS certificate.

Entry point and user authentication

The proxy clusters are one or more long-lived Dataproc clusters that host the Apache Knox Gateway.

The proxy clusters.

Apache Knox acts as the single entry point for client requests. Knox is installed on the proxy cluster master node. Knox performs SSL termination, delegates user authentication, and forwards the request to one of the backend services.

In Knox, each backend service is configured in what is referred to as a topology. The topology descriptor defines the following actions and permissions:

  • How authentication is delegated for a service.
  • The URI the backend service forwards requests to.
  • Simple per-service authorization access control lists (ACLs).

Knox lets you integrate authentication with enterprise and cloud identity management systems. To configure user authentication for each topology, you can use authentication providers. Knox uses Apache Shiro to authenticate against a local demonstration ApacheDS LDAP server by default.

Alternatively, you can opt for Knox to use Kerberos. In the preceding diagram, as an example, you can see an Active Directory server hosted on Google Cloud outside of the cluster.

For information on how to connect Knox to an enterprise authentication services such as an external ApacheDS server or Microsoft Active Directory (AD), see the Apache Knox user guide and the Google Cloud Managed Active Directory and Federated AD documentation.

For the use case in this document, as long as Apache Knox acts as the single gatekeeper to the proxy and backend clusters, you don't have to use Kerberos.

Processing engines

The backend clusters are the Dataproc clusters hosting the services that process user requests. Dataproc clusters can autoscale the number of workers to meet the demand from your analyst team without manual reconfiguration.

The Dataproc backend clusters.

We recommend that you use long-lived Dataproc clusters in the backend. Long-lived Dataproc clusters allow the system to serve requests from data analysts without interruption. Alternatively, if the cluster only needs to serve requests for a brief time, you can use job-specific clusters, which are also known as ephemeral clusters. Ephemeral clusters can also be more cost effective than long-lived clusters.

If you use ephemeral clusters, to avoid modifying the topology configuration, make sure that you recreate the clusters in the same zone and under the same name. Using the same zone and name lets Knox route the requests transparently using the master node internal DNS name when you recreate the ephemeral cluster.

HS2 is responsible for servicing user queries made to Apache Hive. HS2 can be configured to use various execution engines such as the Hadoop MapReduce engine, Apache Tez, and Apache Spark. In this document, HS2 is configured to use the Apache Tez engine.

Spark SQL is a module of Apache Spark that includes a JDBC/ODBC interface to execute SQL queries on Apache Spark. In the preceding architectural diagram, Spark SQL is shown as an alternative option for servicing user queries.

A processing engine, either Apache Tez or Apache Spark, calls the YARN Resource Manager to execute the engine DAG on the cluster worker machines. Finally, the cluster worker machines access the data. For storing and accessing the data in a Dataproc cluster, use the Cloud Storage connector, not Hadoop Distributed File System (HDFS). For more information about the benefits of using the Cloud Storage connector, see the Cloud Storage connector documentation.

The preceding architectural diagram shows one Apache Knox topology that forwards requests to Apache Hive, and another that forwards requests to Spark SQL. The diagram also shows other topologies that can forward requests to services in the same or different backend clusters. The backend services can process different datasets. For example, one Hive instance can offer access to personally identifiable information (PII) for a restricted set of users while another Hive instance can offer access to non-PII data for broader consumption.

User authorization

Apache Ranger can be installed on the backend clusters to provide fine-grained authorization for Hadoop services. In the architecture, a Ranger plugin for Hive intercepts the user requests and determines whether a user is allowed to perform an action over Hive data, based on Ranger policies.

Ranger fine-grained authorization.

As an administrator, you define the Ranger policies using the Ranger Admin page. We strongly recommend that you configure Ranger to store these policies in an external Cloud SQL database. Externalizing the policies has two advantages:

  • It makes them persistent in case any of the backend clusters are deleted.
  • It enables the policies to be centrally managed for all groups or for custom groups of backend clusters.

To assign Ranger policies to the correct user identities or groups, you must configure Ranger to sync the identities from the same directory that Knox is connected to. By default, the user identities used by Ranger are taken from the operating system.

Apache Ranger can also externalize its audit logs to Cloud Storage to make them persistent. Ranger uses Apache Solr as its indexing and querying engine to make the audit logs searchable.

Unlike HiveServer2, Spark SQL doesn't have official Ranger plugin support, so you need to use the coarse-grained ACLs available in Apache Knox to manage its authorization. To use these ACLs, add the LDAP identities that are allowed to use each service, such as Spark SQL or Hive, in the corresponding topology descriptor for the service.

For more information, see Best practices to use Apache Ranger on Dataproc.

High availability

Dataproc provides a high availability (HA) mode. In this mode, there are several machines configured as master nodes, one of which is in active state. This mode allows uninterrupted YARN and HDFS operations despite any single-node failures or reboots.

However, if the master node fails, the single entry point external IP changes, so you must reconfigure the BI tool connection. When you run Dataproc in HA mode, you should configure an external HTTP(S) load balancer as the entry point. The load balancer routes requests to an unmanaged instance group that bundles your cluster master nodes. As an alternative to a load balancer, you can apply a round-robin DNS technique instead, but there are drawbacks to this approach. These configurations are outside of the scope of this document.

Cloud SQL also provides a high availability mode, with data redundancy made possible by synchronous replication between master instances and standby instances located in different zones. If there is an instance or zone failure, this configuration reduces downtime. However, note that an HA-configured instance is charged at double the price of a standalone instance.

Cloud Storage acts as the datastore. For more information about Cloud Storage availability, see storage class descriptions.

Networking

In a layered network architecture, the proxy clusters are in a perimeter network. The backend clusters are in an internal network protected by firewall rules that only let through incoming traffic from the proxy clusters.

The proxy clusters are isolated from the other clusters because they are exposed to external requests. Firewall rules only allow a restricted set of source IP addresses to access the proxy clusters. In this case, the firewall rules only allow requests that come from the addresses of your BI tools.

The configuration of layered networks is outside of the scope of this document. In Connecting your visualization software to Hadoop on Google Cloud, you use the default network throughout the tutorial. For more information on layered network setups, see the best practices for VPC network security and the overview and examples on how to configure multiple network interfaces.

What's next