Deploy a kernel space NFS caching proxy in Compute Engine

Last reviewed 2023-10-03 UTC

This tutorial shows you how to deploy, configure, and test a Linux-based, kernel-space Network File System (NFS) caching proxy in Compute Engine. The architecture that's described in this tutorial is designed for a scenario in which read-only data is synchronized at a byte level from an NFS origin file server (such as an on-premises NFS file server) to Google Cloud, or synchronized on demand from one primary source of truth to multiple read-only replicas.

This tutorial assumes that you're familiar with the following:

  • Building custom versions of the Linux operating system.
  • Installing and configuring software with startup scripts in Compute Engine.
  • Configuring and managing an NFS file system.

This architecture doesn't support file locking. The architecture is best suited for pipelines that use unique filenames to track file versions.

Architecture

The architecture in this tutorial has a kernel-space NFS Daemon (KNFSD) that acts as an NFS proxy and cache. This setup gives your cloud-based compute nodes access to local, fast storage by migrating data when an NFS client requests it. NFS client nodes write data directly back to your NFS origin file server using write-through caching. The following diagram shows this architecture:

Architecture using a KNFSD proxy in Google Cloud.

In this tutorial, you deploy and test the KNFSD proxy system. You create and configure a single NFS server, a single KNFSD proxy, and a single NFS client all in Google Cloud.

The KNFSD proxy system works by mounting a volume from the NFS server and re-exporting that volume. The NFS client mounts the re-exported volume from the proxy. When an NFS client requests data, the KNFSD proxy checks its various cache tables to determine whether the data resides locally. If the data is already in the cache, the KNFSD proxy serves it immediately. If the data requested isn't in the cache, the proxy migrates the data, updates its cache tables, and then serves the data. The KNFSD proxy caches both file data and metadata at a byte level, so only the bytes that are used are transferred as they are requested.

The KNFSD proxy has two layers of cache: L1 and L2. L1 is the standard block cache of the operating system that resides in RAM. When the volume of data exceeds available RAM, L2 cache is implemented by using FS-Cache, a Linux kernel module that caches data locally on disk. In this deployment, you use Local SSD as your L2 cache, although you can configure the system in several ways.

To implement the architecture in this tutorial, you use standard NFS tools, which are compatible with NFS versions 2, 3, and 4.

KNFSD deployment in a hybrid architecture

In a hybrid architecture, NFS clients that are running in Google Cloud request data when it's needed. These requests are made to the KNFSD proxy, which serves data from its local cache if present. If the data isn't in the cache, the proxy manages communication back to the on-premises servers. The system can mount single or multiple NFS origin servers. The proxy manages all communication and data migration necessary through a VPN or Dedicated Interconnect back to the on-premises NFS origin servers. The following diagram shows this KNFSD deployment in a hybrid architecture:

Hybrid architecture using a KNFSD deployment.

Hybrid connectivity is beyond the scope of this tutorial. For information about advanced topics such as considerations for deploying to a hybrid architecture, scaling for high performance, and using metrics and dashboards for troubleshooting and tuning, see Advanced workflow topics.

Objectives

  • Deploy and test a KNFSD proxy system.
  • Create and configure the following components in Google Cloud:
    • A custom disk image
    • A KNFSD proxy
    • An NFS server
    • An NFS client
  • Mount an NFS proxy on an NFS client.
  • Copy a file from the NFS server through the NFS proxy to the NFS client.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

For your usage, consider Networking Egress costs for data written from Google Cloud back to on-premises storage, and costs for hybrid connectivity.

Before you begin

For this reference guide, you need a Google Cloud project. You can create a new one, or select a project you already created:

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  2. Make sure that billing is enabled for your Google Cloud project.

  3. Enable the Compute Engine API.

    Enable the API

  4. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

  5. Authenticate your login in the Cloud Shell terminal:

    gcloud auth application-default login
    

    The command line guides you through completing the authorization steps.

  6. Set environment variables:

    export GOOGLE_CLOUD_PROJECT=PROJECT_NAME
    gcloud config set project $GOOGLE_CLOUD_PROJECT
    

    Replace PROJECT_NAME with the name of the project that you created or selected earlier.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Download the tutorial configuration files

  1. In Cloud Shell, clone the GitHub repository:

    cd ~/
    git clone https://github.com/GoogleCloudPlatform/knfsd-cache-utils.git
    
  2. Set the Git tag to a known good version (in this case v0.9.0):

    cd ~/knfsd-cache-utils
    git checkout tags/v0.9.0
    
  3. Navigate to the image directory in your code repository:

     cd ~/knfsd-cache-utils/image
    

Configure your network

For simplicity of deployment, this tutorial uses the default VPC network. To let you use SSH to connect to various resources for configuration and monitoring purposes, this tutorial also deploys external IP addresses.

Best practices and reference architectures for VPC design are beyond the scope of this tutorial. However, when you integrate these resources into a hybrid environment, we recommend that you follow best practices including the following:

To configure your network, do the following:

  • In Cloud Shell, set the following variables:

    export BUILD_MACHINE_NETWORK=default
    export BUILD_MACHINE_SUBNET=default
    

Create the NFS proxy build machine

In this section, you create and then sign in to a VM that acts as your NFS proxy build machine. You then run a provided installation script to update Kernel versions and install all of the necessary software for the KNFSD proxy system. The software installation script can take a few minutes to run, but you only have to run it once.

  1. In Cloud Shell, set the following variables:

    export BUILD_MACHINE_NAME=knfsd-build-machine
    export BUILD_MACHINE_ZONE=us-central1-a
    export IMAGE_FAMILY=knfsd-proxy
    export IMAGE_NAME=knfsd-proxy-image
    
  2. Launch the VM instance:

    gcloud compute instances create $BUILD_MACHINE_NAME \
      --zone=$BUILD_MACHINE_ZONE \
      --machine-type=n1-standard-16 \
      --project=$GOOGLE_CLOUD_PROJECT \
      --image=ubuntu-2004-focal-v20220615 \
      --image-project=ubuntu-os-cloud \
      --network=$BUILD_MACHINE_NETWORK \
      --subnet=$BUILD_MACHINE_SUBNET \
      --boot-disk-size=20GB \
      --boot-disk-type=pd-ssd \
      --metadata=serial-port-enable=TRUE
    

    You might receive a warning message that indicates a disk size discrepancy. You can ignore this message.

  3. Create a tar file of the required software to install, and then copy it to your build machine:

    tar -czf resources.tgz -C resources .
    gcloud compute scp resources.tgz build@$BUILD_MACHINE_NAME: \
      --zone=$BUILD_MACHINE_ZONE \
      --project=$GOOGLE_CLOUD_PROJECT
    
  4. After the VM starts, open an SSH tunnel to it:

    gcloud compute ssh build@$BUILD_MACHINE_NAME \
      --zone=$BUILD_MACHINE_ZONE \
      --project=$GOOGLE_CLOUD_PROJECT
    
  5. After the SSH tunnel is established and your command line is targeting the knfsd-build-machine instance, run the installation script:

    tar -zxf resources.tgz
    sudo bash scripts/1_build_image.sh
    

    The script clones the Ubuntu Kernel Code repository, updates the kernel version, and installs additional software. Because there is a repository clone involved, the script can take a long time to complete.

  6. After the installation script finishes and displays a SUCCESS prompt, reboot the build machine:

    sudo reboot
    

    When your build machine reboots, the following messages are displayed:

    WARNING: Failed to send all data from [stdin]
    ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255]
    

    These errors happen while Cloud Shell reverts from your NFS proxy build machine to your host machine. You can ignore these errors.

  7. After the VM reboots, re-open an SSH tunnel to it:

    gcloud compute ssh $BUILD_MACHINE_NAME \
      --zone=$BUILD_MACHINE_ZONE \
      --project=$GOOGLE_CLOUD_PROJECT
    
  8. After the SSH tunnel is established, and your command line is targeting the nfs-proxy-build instance, switch to Root and check your OS version:

    uname -r
    

    The output is similar to the following, indicating that the software updates were successful:

    linux <$BUILD_MACHINE_NAME> 5.13.*-gcp ...
    

    If the output isn't similar to the preceding example, complete this process to create the NFS proxy build machine again.

  9. Clean up the local disk and shut down the build machine:

    sudo bash /home/build/scripts/9_finalize.sh
    

    The following warnings will be displayed:

    userdel: user build is currently used by process 1431
    userdel: build mail spool (/var/mail/build) not found
    

    These warnings happen while Cloud Shell reverts from your NFS proxy build machine to your host machine. You can ignore these errors.

Create the custom disk image

In this section, you create a custom image from the instance. The custom image is stored in a multi-regional Cloud Storage bucket that is located in the United States.

  1. In Cloud Shell, set the following variables:

    export IMAGE_NAME=knfsd-image
    export IMAGE_DESCRIPTION="first knfsd image from tutorial"
    export IMAGE_LOCATION=us
    
  2. Create the disk image:

    gcloud compute images create $IMAGE_NAME \
      --project=$GOOGLE_CLOUD_PROJECT \
      --description="$IMAGE_DESCRIPTION" \
      --source-disk=$BUILD_MACHINE_NAME \
      --source-disk-zone=$BUILD_MACHINE_ZONE \
      --storage-location=$IMAGE_LOCATION
    
  3. After the disk image is created, delete the instance:

    gcloud compute instances delete $BUILD_MACHINE_NAME \
      --zone=$BUILD_MACHINE_ZONE
    
  4. When you're prompted to continue, enter Y.

    When you delete the $BUILD_MACHINE_NAME instance, you see a prompt that indicates that attached disks on the VM will be deleted. Because you just saved a custom image, you no longer need this temporary disk and it's safe to delete it.

Create the NFS origin server

As mentioned earlier, this architecture is designed to connect cloud-based resources to an on-premises file server. To simplify the process in this tutorial, you create a stand-in resource that runs in your Google Cloud project to simulate this connection. You name the stand-in resource nfs-server. The software installation and setup is contained in a startup script. For more information, examine the script, ~/knfsd-cache-utils/tutorial/nfs-server/1_build_nfs-server.sh.

  1. In Cloud Shell, go to the downloaded nfs-server scripts directory:

    cd ~/knfsd-cache-utils/tutorial
    
  2. Create your stand-in NFS server:

    gcloud compute \
      --project=$GOOGLE_CLOUD_PROJECT instances create nfs-server \
      --zone=$BUILD_MACHINE_ZONE \
      --machine-type=n1-highcpu-2 \
      --maintenance-policy=MIGRATE \
      --image-family=ubuntu-2004-lts \
      --image-project=ubuntu-os-cloud \
      --boot-disk-size=100GB \
      --boot-disk-type=pd-standard \
      --boot-disk-device-name=nfs-server \
      --metadata-from-file startup-script=nfs-server-startup.sh
    

    This script can take a few minutes to complete. You might see a warning message that indicates that your disk size is under 200 GB. You can ignore this warning.

Create the NFS proxy

In this section, you create the NFS proxy. When the proxy starts, it configures local storage, prepares mount options for the NFS server, and exports the cached results. A provided startup script orchestrates much of this workflow.

  1. In Cloud Shell, set the following variable:

    export PROXY_NAME=nfs-proxy
    
  2. Create the nfs-proxy VM:

    gcloud compute instances create $PROXY_NAME \
      --machine-type=n1-highmem-16 \
      --project=$GOOGLE_CLOUD_PROJECT \
      --maintenance-policy=MIGRATE \
      --zone=$BUILD_MACHINE_ZONE \
      --min-cpu-platform="Intel Skylake" \
      --image=$IMAGE_NAME \
      --image-project=$GOOGLE_CLOUD_PROJECT \
      --boot-disk-size=20GB \
      --boot-disk-type=pd-standard \
      --boot-disk-device-name=$PROXY_NAME \
      --local-ssd=interface=NVME \
      --local-ssd=interface=NVME \
      --local-ssd=interface=NVME \
      --local-ssd=interface=NVME \
      --metadata-from-file startup-script=proxy-startup.sh
    

    You might see a warning message that your disk size is under 200 GB. You can ignore this warning.

    The startup script configures NFS mount commands and it lets you tune the system. The settings for NFS version, sync or async, nocto, and actimeo are some of the variables that you might want to optimize through the startup script. For more information about these settings, see optimizing your NFS file system.

    The command in this step defines the --metadata-from-file flag, which injects the startup script into your image template. In this tutorial, you use a simple proxy-startup.sh script. The script includes some pre-set variables and it doesn't include many options that you might want to use if you integrate into your pipeline. For more advanced use cases, see the knfsd-cache-utils GitHub repository.

Create the NFS client

In this step, you create a single NFS client (named nfs-client) to stand in for what would likely be a larger Managed Instance Group (MIG) at scale.

  • In Cloud Shell, create your NFS client:

    gcloud compute \
      --project=$GOOGLE_CLOUD_PROJECT instances create nfs-client \
      --zone=$BUILD_MACHINE_ZONE \
      --machine-type=n1-highcpu-8 \
      --network-tier=PREMIUM \
      --maintenance-policy=MIGRATE \
      --image-family=ubuntu-2004-lts \
      --image-project=ubuntu-os-cloud \
      --boot-disk-size=10GB \
      --boot-disk-type=pd-standard \
      --boot-disk-device-name=nfs-client
    

    You might see a warning message that your disk size is under 200 GB. You can ignore this warning.

Mount the NFS proxy on the NFS client

In this step, you open a separate SSH session on your NFS client and then mount the NFS proxy. You use this same shell to test the system in the next section.

  1. In the Google Cloud console, go to the VM instances page.

    Go to VM instances

  2. To connect to nfs-client, in the Connect column, click SSH.

  3. In the nfs-client SSH window, install the necessary NFS tools on the nfs-client:

    sudo apt-get install nfs-common -y
    
  4. Create a mount point and mount the NFS proxy:

    sudo mkdir /data
    sudo mount -t nfs -o vers=3 nfs-proxy:/data /data
    

Test the system

All of your resources are now created. In this section, you run a test by copying a file from the NFS server through the NFS proxy to the NFS client. The first time that you run this test, the data comes from the originating server. It can take over one minute.

The second time that you run this test, the data is served from a cache that's stored in the Local SSDs of the NFS proxy. In this transfer, it takes much less time to copy data, which validates that caching is accelerating the data transfer.

  1. In the nfs-client SSH window that you opened in the previous section, copy the test file and view the corresponding output:

    time dd if=/data/test.data of=/dev/null iflag=direct bs=1M status=progress
    

    The output is similar to the following, which contains a line that shows the file size, time for transfer, and transfer speeds:

    10737418240 bytes (11 GB, 10 GiB) copied, 88.5224 s, 121 MB/s
    real    1m28.533s
    

    In this transfer, the file is served from the persistent disk of the NFS server, so it's limited by the speed of the NFS server's disk.

  2. Run the same command a second time:

    time dd if=/data/test.data of=/dev/null iflag=direct bs=1M status=progress
    

    The output is similar to the following, which contains a line that shows the file size, time for transfer, and transfer speeds:

    10737418240 bytes (11 GB, 10 GiB) copied, 9.41952 s, 948 MB/s
    real    0m9.423s
    

    In this transfer, the file is served from the cache in the NFS proxy, so it completes faster.

You have now completed deployment and testing of the KNFSD caching proxy.

Advanced workflow topics

This section includes information about deploying to a hybrid architecture, scaling for high performance, and using metrics and dashboards for troubleshooting and tuning.

Performance characteristics and resource sizing

As previously noted, this tutorial uses a single KNFSD proxy. Therefore, scaling the system involves modifying your individual proxy resources to optimize for CPU, RAM, networking, storage capacity, or performance. In this tutorial, you deployed KNFSD on a single Compute Engine VM with the following options:

  • 16 vCPUs, 104 GB of RAM (n1-highmem-16).
    • With 16 vCPUs and an architecture of Sandy Bridge or newer, you enable a maximum networking speed of 32 Gbps.
  • 10 GB Persistent Disk as a boot disk.
  • 4 local SSD disks. This configuration provides a high speed, 1.5 TB file system.

Although it's beyond the scope of this tutorial, you can scale this architecture by creating multiple KNFSD proxies in a MIG and by using a TCP load balancer to manage connections between the NFS clients and NFS proxies. For more information, see the knfsd-cache-utils GitHub repository which contains Terraform scripts, example code for deployment, and various FAQs covering scaling workloads.

Considerations for a hybrid deployment

In many deployments, the bandwidth of your connection from on-premises to the cloud is a major factor to consider when configuring the system. Hybrid connectivity is beyond the scope of this tutorial. For an overview of available options, see the hybrid connectivity documentation. For guidance on best practices and design patterns, see the series Build hybrid and multicloud architectures using Google Cloud.

Exploring metrics

Dashboards can be useful in providing feedback on metrics for use in performance tuning and overall troubleshooting. Exploring metrics is beyond the scope of this tutorial, however, a metrics dashboard is made available when you deploy the multi-node system that's defined in the knfsd-cache-utils GitHub repository.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next