The examples in this section show common commands we recommend to evaluate performance using the IOR benchmark (github) tool.
Prior to installing IOR, MPI needs to be installed for synchronization between benchmarking processes. We recommend use of the HPC Image for client VMs, which includes tooling to install Intel MPI 2021. For Ubuntu clients, we recommend openmpi.
Check network performance
Before running IOR it may be helpful to ensure your network has the expected throughput. If you have two client VMs, you can use a tool called iperf to test the network between them.
Install iperf on both VMs:
sudo dnf -y install iperf
sudo apt install -y iperf
Start an iperf server on one of your VMs:
iperf -s -w 100m -P 30
Start an iperf client on the other VM:
iperf -c <IP ADDRESS OF iperf server VM> -w 100m -t 30s -P 30
Observe the network throughput number between the VMs. For the highest single-client performance, ensure that Tier_1 networking is used.
Single VM performance
The following instructions provide steps and benchmarks to measure single VM performance. The tests run multiple I/O processes into and out of Parallelstore with the intention of saturating the network interface card (NIC).
Install Intel MPI
sudo google_install_intelmpi --impi_2021
To specify the correct libfabric networking stack, set the following variable on your environment:
source /opt/intel/
sudo apt install -y autoconf
sudo apt install -y pkg-config
sudo apt install -y libopenmpi-dev
sudo apt install -y make
Install IOR
To install IOR:
git clone
cd ior
sudo make install
Run the IOR commands
Run the following IOR commands. To view expected performance numbers, see the Parallelstore overview.
Max performance from a single client VM
mpirun -genv LD_PRELOAD="/usr/lib64/" -ppn 1 \
--bind-to socket ior \
-o "/tmp/parallelstore/test" -O useO_DIRECT=1 \
-w -r -e -F -t "1m" -b "8g"
mpirun --oversubscribe -x LD_PRELOAD="/usr/lib64/" -n 1 \
ior -o "/tmp/parallelstore/test" -O useO_DIRECT=1 \
-w -r -e -F -t "1m" -b "8g"
: actual benchmark. Ensure it is available in the path or provide the full path.-ppn
: the number of processes (jobs) to run. We recommend starting with1
and then increasing up to the number of vCPUs to achieve max aggregate performance.-O useO_DIRECT=1
: force the use of direct I/O to bypass the page cache and avoid reading cached data.-genv LD_PRELOAD="/usr/lib64/"
: use the DAOS interception library. This option delivers the highest raw performance but bypasses the Linux page cache for data. Metadata is still cached.-w
: Perform writes to individual files.-r
: Perform reads.-e
: Perform fsync upon completion of writes.-F
: Use individual files.-t "1m"
: Read and write data in chunks of specified size. Larger chunk sizes result in better single thread streaming I/O performance.-b "8g"
- size of each file
Max IOps from a single client VM
mpirun -genv LD_PRELOAD="/usr/lib64/" -ppn 80 \
--bind-to socket ior \
-o "/tmp/parallelstore/test" -O useO_DIRECT=1 \
-w -r -e -F -t "4k" -b "1g"
mpirun --oversubscribe -x LD_PRELOAD="/usr/lib64/" -n 80 \
ior -o "/tmp/parallelstore/test" -O useO_DIRECT=1 \
-w -r -e -F -t "4k" -b "1g"
Max performance from a single application thread
mpirun -genv LD_PRELOAD="/usr/lib64/" -ppn 1 \
--bind-to socket ior \
-o "/tmp/parallelstore/test" -O useO_DIRECT=1 \
-w -r -e -F -t "32m" -b "64g"
mpirun -x LD_PRELOAD="/usr/lib64/" -n 1 \
ior -o "/tmp/parallelstore/test" -O useO_DIRECT=1 \
-w -r -e -F -t "32m" -b "64g"
Small I/O latency from a single application thread
mpirun -genv LD_PRELOAD="/usr/lib64/" -ppn 1 \
--bind-to socket ior \
-o "/tmp/parallelstore/test" -O useO_DIRECT=1 \
-z -w -r -e -F -t "4k" -b "100m"
mpirun -x LD_PRELOAD="/usr/lib64/" -n 1 \
ior -o "/tmp/parallelstore/test" -O useO_DIRECT=1 \
-z -w -r -e -F -t "4k" -b "100m"
Multi VMs performance tests
In order to reach the limits of Parallelstore instances, it's important to test the aggregate I/O achievable with parallel I/O from multiple VMs. The instructions in this section provide details and commands on how to do this using mpirun and ior.
See the IOR guide for the full set of options that are useful to test on a larger set of nodes. Note that there are a variety of ways to launch client VMs for multi-client testing from using schedulers such as Batch, Slurm, or using the Compute Engine bulk commands. Also, the HPC Toolkit can help build templates to deploy compute nodes.
This guide uses the following steps to deploy multiple client instances configured to use Parallelstore:
- Create an SSH key to use to set up a user on each client VM. You must disable the OS Login requirement on the project if it has been enabled.
- Get the access points of the Parallelstore instance.
- Create a startup script to deploy to all client instances.
- Bulk create the Compute Engine VMs using the startup script and key.
- Copy the necessary keys and host files needed to run the tests.
Details for each step are in the following sections.
Set environment variables
The following environment variables are used in the example commands in this document:
export SSH_USER="daos-user"
export CLIENT_PREFIX="daos-client-vm"
export NUM_CLIENTS=10
Update these to your desired values.
Create an SSH key
Create an SSH key and save it locally to be distributed to the client VMs. The key is associated with the SSH user specified in the environment variables, and will be created on each VM:
# Generate an SSH key for the specified user
ssh-keygen -t rsa -b 4096 -C "${SSH_USER}" -N '' -f "./id_rsa"
chmod 600 "./id_rsa"
#Create a new file in the format [user]:[public key] user
echo "${SSH_USER}:$(cat "./") ${SSH_USER}" > "./keys.txt"
Get Parallelstore network details
Get the Parallelstore server IP addresses in a format consumable by the daos agent:
export ACCESS_POINTS=$(gcloud beta parallelstore instances describe INSTANCE_NAME \
--location LOCATION \
--format "value[delimiter=', '](format("{0}", accessPoints))")
Get the network name associated with the Parallelstore instance:
export NETWORK=$(gcloud beta parallelstore instances describe INSTANCE_NAME \
--location LOCATION \
--format "value[delimiter=', '](format('{0}', network))" | awk -F '/' '{print $NF}')
Create the startup script
The startup script is attached to the VM and will be run every time the system starts. The startup script does the following:
- Configures the daos agent
- Installs required libraries
- Mounts your Parallelstore instance to
on each VM - Installs performance testing tools
This script can be used to deploy your custom applications to multiple machines. Edit the section that is related to application specific code in the script.
The following script works on VMs running HPC Rocky 8.
# Create a startup script that configures the VM
cat > ./startup-script << EOF
sudo tee /etc/yum.repos.d/parallelstore-v2-6-el8.repo << INNEREOF
name=Parallelstore EL8 v2.6
sudo dnf makecache
# 2) Install daos-client
dnf install -y epel-release # needed for capstone
dnf install -y daos-client
# 3) Upgrade libfabric
dnf upgrade -y libfabric
systemctl stop daos_agent
mkdir -p /etc/daos
cat > /etc/daos/daos_agent.yml << INNEREOF
access_points: ${ACCESS_POINTS}
allow_insecure: true
- numa_node: 0
- iface: eth0
domain: eth0
echo -e "Host *\n\tStrictHostKeyChecking no\n\tUserKnownHostsFile /dev/null" > /home/${SSH_USER}/.ssh/config
chmod 600 /home/${SSH_USER}/.ssh/config
usermod -u 2000 ${SSH_USER}
groupmod -g 2000 ${SSH_USER}
chown -R ${SSH_USER}:${SSH_USER} /home/${SSH_USER}
chown -R daos_agent:daos_agent /etc/daos/
systemctl enable daos_agent
systemctl start daos_agent
mkdir -p /tmp/parallelstore
dfuse -m /tmp/parallelstore --pool default-pool --container default-container --disable-wb-cache --thread-count=16 --eq-count=8 --multi-user
chmod 777 /tmp/parallelstore
#Application specific code
#Install Intel MPI:
sudo google_install_intelmpi --impi_2021
source /opt/intel/
#Install IOR
git clone
cd ior
make install
Create the client VMs
The overall performance of your workloads depends on the client machine types.
The following example uses c2-standard-30
VMs; modify the machine-type
value to increase performance with faster NICs. See
Machine families resource and comparison guide for details of the
available machine types.
To create VM instances in bulk, use the gcloud compute instances create
gcloud compute instances bulk create \
--name-pattern="${CLIENT_PREFIX}-####" \
--zone="LOCATION " \
--machine-type="c2-standard-30 " \
--network-interface=subnet=${NETWORK},nic-type=GVNIC \
--network-performance-configs=total-egress-bandwidth-tier=TIER_1 \
--create-disk=auto-delete=yes,boot=yes,device-name=client-vm1,image=projects/cloud-hpc-image-public/global/images/hpc-rocky-linux-8-v20240126,mode=rw,size=100,type=pd-balanced \
--metadata=enable-oslogin=FALSE \
--metadata-from-file=ssh-keys=./keys.txt,startup-script=./startup-script \
--count ${NUM_CLIENTS}
Copy keys and files
Retrieve and save the private and public IP addresses for all VMs.
Private IPs:
gcloud compute instances list --filter="name ~ '^${CLIENT_PREFIX}*'" --format="csv[no-heading](INTERNAL_IP)" > hosts.txt
Public IPs:
gcloud compute instances list --filter="name ~ '^${CLIENT_PREFIX}*'" --format="csv[no-heading](EXTERNAL_IP)" > external_ips.txt
Copy the private key to allow for inter-node passwordless SSH. This is required for the IOR test using SSH to orchestrate machines.
while IFS= read -r IP do echo "Copying id_rsa to ${SSH_USER}@$IP" scp -i ./id_rsa -o StrictHostKeyChecking=no ./id_rsa ${SSH_USER}@$IP:~/.ssh/ done < "./external_ips.txt"
Retrieve the IP of the first node, and copy the list of internal IPs to that node. This will be the head node for the test run.
export HEAD_NODE=$(head -n 1 ./external_ips.txt) scp -i ./id_rsa -o "StrictHostKeyChecking=no" -o UserKnownHostsFile=/dev/null ./hosts.txt ${SSH_USER}@${HEAD_NODE}:~
Run IOR commands on multiple VMs
Connect to the head node with the specified user:
ssh -i ./id_rsa -o "StrictHostKeyChecking=no" -o UserKnownHostsFile=/dev/null ${SSH_USER}@${HEAD_NODE}
source /opt/intel/
rm -f /tmp/client.log.*
export D_LOG_FILE=/tmp/client.log
Max performance from multiple client VMs
Test performance in a multi-process, maximum throughput scenario.
mpirun -f hosts.txt -genv LD_PRELOAD="/usr/lib64/" -ppn 30 \
--bind-to socket ior \
-o "/tmp/parallelstore/test" -O useO_DIRECT=1 \
-w -r -e -F -t "1m" -b "8g"
Max IOPs from multiple client VMs
Test performance in a multi-process, maximum IOPs scenario.
mpirun -f hosts.txt -genv LD_PRELOAD="/usr/lib64/" -ppn 30 \
--bind-to socket ior \
-o "/tmp/parallelstore/test" -O useO_DIRECT=1 \
-w -r -e -F -t "4k" -b "1g"
Unmount the DAOS container:
sudo umount /tmp/parallelstore/
Delete the Parallelstore instance:
gcloud beta parallelstore instances delete
INSTANCE_NAME --location=LOCATION curl -X DELETE -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json"
PROJECT_ID /locations/LOCATION /instances/INSTANCE_NAME Delete the Compute Engine VMs: