Run JanusGraph on GKE with Bigtable

Graph databases can help you to discover insights by modeling your data entities and the relationships between them. JanusGraph is a graph database that supports working with large amounts of data. This tutorial shows you how to run JanusGraph on Google Cloud with Google Kubernetes Engine as the orchestration platform and Bigtable as the storage backend. The tutorial is for system architects, database administrators, and DevOps professionals who are interested in running the JanusGraph graph database on Google Cloud using a managed database as the storage backend. It assumes that you are familiar with Google Kubernetes Engine (GKE), Kubernetes Pods, Helm charts, Bigtable, and Elasticsearch. Knowledge of the Apache TinkerPop graph computing framework and of the Gremlin graph traversal machine and language are not required, but that knowledge is necessary to use Janusgraph beyond the examples provided in this tutorial.

Overview

In graph terminology, entities are known as nodes or vertices and relationships are known as edges. In JanusGraph, both vertices and edges can have additional associated data that is made available through properties.

Example of a property graph.

The preceding illustration is an example of a property graph.

Graph databases help you model a variety of domains and activities:

  • Social networks
  • Financial transactions (for fraud analysis)
  • Physical or virtual system networks

When you create graph databases, you sometimes create millions or even billions of vertices and edges. When you use JanusGraph with Bigtable as the underlying storage layer, you can both execute fast queries (known as graph traversals) and scale your storage layer independently according to the size and throughput that you need. JanusGraph also uses a pluggable indexing backend to provide full-text indexing for vertex and edge properties. In this tutorial, you deploy a scalable JanusGraph infrastructure on GKE. You use Elasticsearch as the indexing backend running in Pods in a StatefulSet, and you use Bigtable as the storage backend. When you're done, you can traverse the relationships that exist in your graph data. The following diagram shows how these elements fit together.

JanusGraph deployment with Bigtable on GKE.

The prceeding diagram shows the JanusGraph deployment on GKE with Elasticsearch andBigtable.

JanusGraph data in Bigtable

Graph data is stored by JanusGraph as an adjacency list. Each row represents a vertex, any adjacent vertices (edges), and property metadata about the vertices and edges. The row key is the unique identifier for the vertex. Each relationship between the vertex and another vertex and any properties that further define the relationship are stored as an edge or edge-property column. Both the column qualifier and column value store data that defines the edge, in accordance with Bigtable best practices. Each vertex property is stored as a separate column, again using both the column qualifier and the column value to define the property.

The following diagram shows this storage structure.

JanusGraph adjancency list storage structure.

The diagram shows the logical storage structure for a small graph fragment with logical details for two vertex rows. In the diagram, the two example rows represent two vertices. The first vertex is labeled with a single vertex property and is related to two other vertices by two separate edges. The second vertex holds columns containing two properties and one edge.

The following illustration of the vertex edge logical data model provides some detail about the column qualifiers and values for an edge or edge-property column.

JanusGraph edge and edge property column.

For each adjacent vertex, a column stores the metadata about that edge. The column qualifier contains metadata about the edge relationship and about the edge direction, and a pointer to the adjacent vertex. The column value contains the edge label and any additional edge properties. Because traversals can be followed in either direction, edges are stored twice, once for each end of the edge relationship. Bidirectional edge storage significantly increases traversal performance, but comes with some trade-offs due to the redundancy of additional storage space and non-atomic edge mutations.

The following diagram is the logical data model of a vertex property column.

JanusGraph column values for a property column.

The previous illustration provides details about the column qualifiers and values for an edge column.

Each vertex property is stored as a separate column. The column qualifier is a unique identifier for the property key. The column value contains both an identifier for the property and the value of the property.

JanusGraph also relies on Bigtable's lexicographical ordering of rows and column qualifiers to enhance query performance.

Objectives

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Prerequisites

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Bigtable, Compute Engine, and GKE APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Bigtable, Compute Engine, and GKE APIs.

    Enable the APIs

Prepare your environment

In this tutorial, you use Cloud Shell to enter commands. Cloud Shell gives you access to the command line in the Google Cloud console and includes Google Cloud CLI and other tools that you need to develop in Google Cloud. Cloud Shell appears as a window at the bottom of the Google Cloud console. It can take several minutes to initialize, but the window appears immediately.

  1. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

  2. In Cloud Shell, set the environment variables for the Compute Engine zone where you will create your Bigtable cluster and GKE cluster and the name, node type, and version for your GKE cluster:

    export PROJECT_ID=PROJECT_ID
    export GCP_ZONE=REGION
    export GKE_CLUSTER_NAME=GKE_CLUSTER_NAME
    export GKE_NODE_TYPE=n1-standard-4
    export GKE_VERSION=1.20
    

    Replace the following:

    • PROJECT_ID with your project identifier.
    • REGION with the zone where your Bigtable cluster and GKE cluster will be created.
    • GKE_CLUSTER_NAME with the name of your GKE cluster.

    The command should look similar to the following example:

    export PROJECT_ID=bt-janusgraph-project-id
    export GCP_ZONE=us-central1-f
    export GKE_CLUSTER_NAME=janusgraph-gke
    export GKE_NODE_TYPE=n1-standard-4
    export GKE_VERSION=1.20
    
  3. Create a GKE cluster where JanusGraph will be deployed:

    gcloud container clusters create ${GKE_CLUSTER_NAME} \
        --zone=${GCP_ZONE} \
        --cluster-version=${GKE_VERSION} \
        --machine-type ${GKE_NODE_TYPE} \
        --scopes "https://www.googleapis.com/auth/cloud-platform"
    

Create a Bigtable instance

For the JanusGraph storage backend, this tutorial uses Bigtable, which can scale rapidly to meet your needs. This tutorial uses a single-node cluster, which is both economical and sufficient for the tutorial. You can start your projects with a smaller cluster and then move to a larger cluster when you're ready to work with production data. The Bigtable documentation includes a detailed discussion about performance and scaling to help you pick a cluster size for your own work.

  1. In Cloud Shell, set the environment variable for your Bigtable instance identifier:

    export BIGTABLE_INSTANCE_ID=BIGTABLE_INSTANCE_ID
    

    Replace BIGTABLE_INSTANCE_IDwith the identifier for your Bigtable instance.

  2. Create the Bigtable instance:

    gcloud bigtable instances create ${BIGTABLE_INSTANCE_ID} \
        --cluster-config=id=${BIGTABLE_INSTANCE_ID}-${GCP_ZONE},zone=${GCP_ZONE},nodes=1 \
        --display-name=${BIGTABLE_INSTANCE_ID}-${GCP_ZONE}
    

Install and configure Helm

You use Helm to deploy applications to your Kubernetes cluster. In this tutorial, you use Helm to deploy both the JanusGraph and Elasticsearch services on your GKE cluster.

  1. In Cloud Shell, install Helm:

    curl -fsSL -o get_helm.sh \
        https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
    chmod 700 get_helm.sh
    DESIRED_VERSION=v3.5.0 ./get_helm.sh
    
  2. Add the elastic chart repository, so that the Elasticsearch chart dependency can be found during the JanusGraph chart deployment:

    helm repo add elastic https://helm.elastic.co
    

    This chart repository is hosted by Elastic, the creators of Elasticsearch.

Use Helm to install JanusGraph and Elasticsearch

In this section, you use a Helm chart to deploy JanusGraph and Elasticsearch to your Kubernetes cluster.

The Helm chart is pulled from GitHub. The deployment included in the Helm chart repository deploys a set of three JanusGraph Pods behind a Service that will start an internal Application Load Balancer. When the Pods are running, startup and liveness probes make HTTP requests in order to perform health checks on the JanusGraph server on each Pod. In addition, the chart includes a dependency chart provided by Elastic that deploys three Elasticsearch Pods in a StatefulSet.

  1. In Cloud Shell, set environment variables for Helm and JanusGraph names:

    export HELM_REPO=bigtable-janusgraph-helm
    export JANUSGRAPH_VERSION=0.5.3
    export HELM_CHART_RELEASE_VERSION=1
    export HELM_CHART_RELEASE_TAG=${JANUSGRAPH_VERSION}-${HELM_CHART_RELEASE_VERSION}
    export HELM_CHART_RELEASE_TAG_HASH=f8b271a4854d4a553dd5e9ba014d077fb098d9ab
    export HELM_CHART_NAME=janusgraph-bigtable
    
  2. Pull the Helm chart from GitHub:

    git clone https://github.com/GoogleCloudPlatform/${HELM_REPO} \
       --branch ${HELM_CHART_RELEASE_TAG}
    
  3. Navigate to the Helm chart directory:

    cd ${HELM_REPO}
    
  4. For security purposes, verify using the commit hash:

    HEAD_COMMIT_HASH=$(git rev-parse --verify HEAD)
    if [ _${HEAD_COMMIT_HASH} == _${HELM_CHART_RELEASE_TAG_HASH} ]
    then
        echo "Commit hash verified"
    fi
    

    If the output does not look similar to the following, do not proceed as the integrity of the cloned tag has not been verified.

    Commit hash verified
    
  5. Update chart dependencies:

    helm dep update
    
  6. Navigate to the parent directory:

    cd ..
    
  7. Set environment variables for the names of Helm and JanusGraph entities:

    export HELM_RELEASE_NAME=janusgraph-bigtable-elastic
    export ELASTICSEARCH_CLUSTER_NAME=${HELM_RELEASE_NAME}-elasticsearch
    export BIGTABLE_JANUSGRAPH_TABLE=janusgraph-table
    
  8. Create a values.yaml file, which supplies Helm with the configuration properties to use when deploying the JanusGraph chart:

    cat > values.yaml << EOF
    
    image:
      repository: docker.io/janusgraph/janusgraph
      tag: 0.5.3
      pullPolicy: IfNotPresent
    
    replicaCount: 3
    
    service:
      type: LoadBalancer
      port: 8182
      serviceAnnotations:
        networking.gke.io/load-balancer-type: "Internal"
    
    elasticsearch:
      deploy: true
      clusterName: ${ELASTICSEARCH_CLUSTER_NAME}
    
    properties:
      storage.backend: hbase
      storage.directory: null
      storage.hbase.ext.google.bigtable.instance.id: ${BIGTABLE_INSTANCE_ID}
      storage.hbase.ext.google.bigtable.project.id: ${PROJECT_ID}
      storage.hbase.ext.hbase.client.connection.impl: com.google.cloud.bigtable.hbase2_x.BigtableConnection
      storage.hbase.short-cf-names: true
      storage.hbase.table: ${BIGTABLE_JANUSGRAPH_TABLE}
      index.search.backend: elasticsearch
      index.search.hostname: ${ELASTICSEARCH_CLUSTER_NAME}-master
      index.search.directory: null
      index.search.elasticsearch.health-request-timeout: 90s
      cache.db-cache: true
      cache.db-cache-clean-wait: 20
      cache.db-cache-time: 180000
      cache.db-cache-size: 0.5
      cluster.max-partitions: 1024
      graph.replace-instance-if-exists: true
    
    persistence:
      enabled: false
    
    debugLevel: INFO
    EOF
    
  9. Deploy the JanusGraph Helm chart by using the values.yaml file that you created:

    helm upgrade --install \
                 --wait \
                  --timeout 600s \
                  ${HELM_RELEASE_NAME} \
                  ./${HELM_REPO} \
                  -f values.yaml
    

    The installation process waits until all of the resources are ready before it completes. This process might take several minutes.

Verify your JanusGraph deployment

When the Helm installation process finishes, it displays a NOTES section that describes a getting-started experience. You can follow the steps that the NOTES section outlines to verify that your JanusGraph environment is working.

  1. In Cloud Shell, verify that the Helm chart components deployed to GKE:

    1. Check the JanusGraph deployment:

      kubectl get deployments
      

      If the deployment is successful, the output is as follows:

      NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
      janusgraph-bigtable-elastic   3/3     3            3           3m28s
      
    2. Check the Elasticsearch StatefulSet:

      kubectl get statefulsets
      

      If everything is working, the output is as follows:

      NAME                                               READY   AGE
      janusgraph-bigtable-elastic-elasticsearch-master   3/3     4m13s
      
  2. Set an environment variable to the name of a Kubernetes Pod that's running the JanusGraph Gremlin server. The app label for the Pod that's running the Gremlin server is derived from the Helm chart name that's defined in the Chart.yaml file.

    export APP_LABEL_FROM_CHART_NAME=${HELM_CHART_NAME}
    export POD_NAME=$(kubectl get pods \
                         --namespace default \
                         -l "app=${APP_LABEL_FROM_CHART_NAME}, \
                             release=${HELM_RELEASE_NAME}" \
                         -o jsonpath="{.items[0].metadata.name}")
    
  3. Connect to the Pod and run the Gremlin console, a read eval print loop (REPL) shell. The name of the container is also derived from the Helm chart name in Chart.yaml.

    export GREMLIN_CONTAINER=${HELM_CHART_NAME}
    kubectl exec \
            -c ${GREMLIN_CONTAINER} \
            -it $POD_NAME \
            -- /opt/janusgraph/bin/gremlin.sh
    
  4. In the Gremlin console, connect to the Apache TinkerPop server:

    1. Start the session:

      :remote connect tinkerpop.server conf/remote.yaml session
      

      The output looks similar to the following:

      ==>Configured localhost/127.0.0.1:8182-[b08972f2-a2aa-4312-8018-bcd11bc9812c]
      
    2. Connect to the server:

      :remote console
      

      The output looks similar to the following:

      ==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[b08972f2-a2aa-4312-8018-bcd11bc9812c] - type ':remote console' to return to local mode>
      
  5. In the Gremlin console, verify that the Gremlin server is running correctly by inspecting the graph variable which represents the graph instance:

    graph
    

    The output indicates that the JanusGraph server is running with an HBase compatible database, in this case Bigtable, as the storage backend.

    ==>standardjanusgraph[hbase:[127.0.0.1]]
    
  6. In Gremlin, create two vertices

    v1 = graph.addVertex(label, 'hello')
    v2 = graph.addVertex(label, 'world')
    

    If the console output is similar to the following, it indicates that the two vertexes have been added:

    ==>v[4344]
    ==>v[4152]
    
  7. Create an edge connecting the two vertices:

    v1.addEdge('followedBy', v2)
    

    If the console output is similar to the following, it indicates that the edge between the two vertexes has been added:

    ==>e[17j-3co-4fmd-oe054][4344-followedBy->4152]
    
  8. Commit the transaction:

    graph.tx().commit()
    

    If the console output is null, it indicates that operations have been committed:

    ==>null
    

    The following diagram illustrates the graph that's created by the commands.

    JanusGraph example vertices and edge.

    The vertex labeled hello is connected by a directed edge labeled followedBy to the vertex labeled world.

  9. Issue a Gremlin query to see what the label is for the vertex that follows an edge labeled followedBy from the vertex that's labeled hello:

    g.V().has(label, 'hello').out('followedBy').label()
    

    The query syntax is explained in the next section. For now, you see the word world as the output from the query:

    ==>world
    

Load and query a sample dataset

Now that you've deployed JanusGraph and can connect to it by using Gremlin, you can begin loading and querying your own data. To see what that process looks like, load the sample dataset that comes bundled with JanusGraph—the Graph of the Gods, which depicts mythological deities of the Roman pantheon and their location properties.

  1. In Gremlin, load the graph that you created earlier:

    GraphOfTheGodsFactory.load(graph)
    

    The output is as follows:

    ==>null
    
  2. Issue a graph traversal query that finds find all brothers of Jupiter:

    g.V().has('name', 'jupiter').out('brother').values('name')
    

    The following table explains the steps that the query traverses.

    Traversal step Explanation
    g.V() Start with the collection of vertices.
    has('name', 'jupiter') Find one that has the property name with the value of jupiter.
    out('brother') From there, follow any edges that are labeled brother.
    values('name') For the vertices where those edges lead, get the name property.
    The output is as follows:

    ==>neptune
    ==>pluto
    

    To get more familiar with the traversal queries that are possible on this Graph of the Gods dataset, try out other sample queries from the JanusGraph docs.

Verify that data is stored in Bigtable

Now that you have created some sample data in your JanusGraph cluster, you can verify that Bigtable was used as the storage backend.

  1. Close the Gremlin console:

    :q
    
  2. In Cloud Shell, verify that the data as been persisted to the janusgraph table in Bigtable:

    cbt -project=${PROJECT_ID} \
        -instance=${BIGTABLE_INSTANCE_ID} \
         count ${BIGTABLE_JANUSGRAPH_TABLE}
    

    The output is similar to the following.

    2021/03/02 02:32:19 -creds flag unset, will use gcloud credential
    101
    

    The value 101 in the output represents the number of rows in the janusgraph table, and might be different for you.

Verify search index creation in Elasticsearch

  1. In Cloud Shell, set variables for the Elasticsearch Pod index and name:

    export ELASTICSEARCH_POD_ORDINAL=0
    export ELASTICSEARCH_POD_NAME_ROOT=${ELASTICSEARCH_CLUSTER_NAME}-master
    export ELASTICSEARCH_POD=${ELASTICSEARCH_POD_NAME_ROOT}-0
    

    The names of the Elasticsearch Pods are defined by the Elasticsearch Helm dependencies. The Pod names consist of the cluster name as supplied in the values.yaml file that you created, the word master, and a zero-indexed ordinal number, all separated by hyphens. For this step, you choose the first Pod, represented as zero (0).

  2. Use the Elasticsearch Aliases REST API to inspect the indexes:

    kubectl exec \
            -c elasticsearch \
            -it ${ELASTICSEARCH_POD} \
            --  \
            curl -XGET "127.0.0.1:9200/_aliases?pretty=true";
    

    The output shows two indexes, janusgraph_vertices and janusgraph_edges, have been created by JanusGraph to provide efficient lookups using vertex and edge properties:

    {
      "janusgraph_vertices" : {
        "aliases" : {
          "janusgraph" : { }
        }
      },
      "janusgraph_edges" : {
        "aliases" : {
          "janusgraph" : { }
        }
      }
    }
    
  3. Query the values from one of the indexes using the Elasticsearch Search REST API:

    kubectl exec \
           -c elasticsearch \
           -it ${ELASTICSEARCH_POD} \
           --  \
           curl -XGET "127.0.0.1:9200/janusgraph_edges/_search?pretty=true&q=*";
    

    The search results show that there are entries in the indexes created by JanusGraph. The output that you see is similar to the following truncated results, which show that there are entries in the janusgraph_edges index.

    {
     "took" : 94,
     "timed_out" : false,
     "_shards" : {
       "total" : 1,
       "successful" : 1,
       "skipped" : 0,
       "failed" : 0
     },
     "hits" : {
       "total" : {
         "value" : 6,
         "relation" : "eq"
       },
       "max_score" : 1.0,
       "hits" : [
         {
           "_index" : "janusgraph_edges",
           "_type" : "_doc",
           "_id" : "6bvp-5ovc-b2t-2yko",
           "_score" : 1.0,
           "_source" : {
             "reason" : "loves waves"
           }
         },
         {
    …
    

Delete the project

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next