Introduction
If your app experiences higher-than-usual latency, poor throughput, or timeouts with the Java Datastore client, the issue might stem from the client-side gRPC configuration rather than the Firestore/Datastore backend. This guide will help diagnose and resolve common client-side throttling issues, improper channel pool settings, and excessive channel churn.
Diagnosis
When using Java clients, enabling verbose gRPC and gax-java logging is crucial for monitoring channel pool dynamics, diagnosing throttling, connection reuse problems, or identifying excessive channel churn.
Enable logging for Java clients
To enable detailed logging, modify logging.properties
file as follows:
## This tracks the lifecycle events of each grpc channel
io.grpc.ChannelLogger.level=FINEST
## Tracks channel pool events(resizing, shrinking) from GAX level
com.google.api.gax.grpc.ChannelPool.level=FINEST
Additionally, update the output logging level in logging.properties
to capture these logs:
# This could be changed to a file or other log output
handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level=FINEST
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
Apply the configuration
logging.properties
file can be applied using one of two methods:
1. Via a JVM System Property
Add this argument when starting the Java application:
-Djava.util.logging.config.file=/path/to/logging.properties
2. Load Programmatically
This method is useful for integration tests or applications where logging configuration is managed within the code. Ensure this code runs early in the application's lifecycle.
LogManager logManager = LogManager.getLogManager();
try (final InputStream is = Main.class.getResourceAsStream("/logging.properties")) {
logManager.readConfiguration(is);
}
Logging example
After enabling verbose logging, you'll see a mix of messages from com.google.api.gax.grpc.ChannelPool
and io.grpc.ChannelLogger
:
Channels under-provisioned which triggered channel pool expanding:
09:15:30.123 [pool-1-thread-1] DEBUG com.google.api.gax.grpc.ChannelPool - Detected throughput peak of 40, expanding channel pool size: 4 -> 6.
09:15:30.124 [grpc-nio-worker-ELG-1-5] DEBUG io.grpc.ChannelLogger - [Channel<5>: (datastore.googleapis.com:443)] Entering IDLE state
09:15:30.124 [grpc-nio-worker-ELG-1-5] DEBUG io.grpc.ChannelLogger - [Channel<6>: (datastore.googleapis.com:443)] Entering IDLE state
09:15:30.125 [grpc-nio-worker-ELG-1-5] TRACE io.grpc.ChannelLogger - [Channel<5>: (datastore.googleapis.com:443)] newCall() called
09:15:30.126 [grpc-nio-worker-ELG-1-5] DEBUG io.grpc.ChannelLogger - [Channel<5>: (datastore.googleapis.com:443)] Entering CONNECTING state 09:15:30.127 [grpc-nio-worker-ELG-1-5] DEBUG io.grpc.ChannelLogger - [Channel<5>: (datastore.googleapis.com:443)] Entering READY state with picker: Picker{result=PickResult{subchannel=Subchannel<7>: (datastore.googleapis.com:443), streamTracerFactory=null, status=Status{code=OK, description=null, cause=null}, drop=false, authority-override=null}}
09:15:31.201 [grpc-nio-worker-ELG-1-6] TRACE io.grpc.ChannelLogger - [Channel<6>: (datastore.googleapis.com:443)] newCall() called
09:15:31.202 [grpc-nio-worker-ELG-1-6] DEBUG io.grpc.ChannelLogger - [Channel<6>: (datastore.googleapis.com:443)] Entering CONNECTING state 09:15:31.203 [grpc-nio-worker-ELG-1-6] DEBUG io.grpc.ChannelLogger - [Channel<6>: (datastore.googleapis.com:443)] Entering READY state with picker: Picker{result=PickResult{subchannel=Subchannel<8>: (datastore.googleapis.com:443), streamTracerFactory=null, status=Status{code=OK, description=null, cause=null}, drop=false, authority-override=null}}
Channels over-provisioned which triggered channel pool shrinking:
09:13:59.609 [grpc-nio-worker-ELG-1-4] DEBUG io.grpc.ChannelLogger - [Channel<21>: (datastore.googleapis.com:443)] Entering READY state with picker: Picker{result=PickResult{subchannel=Subchannel<23>: (datastore.googleapis.com:443), streamTracerFactory=null, status=Status{code=OK, description=null, cause=null}, drop=false, authority-override=null}}
09:14:01.998 [pool-1-thread-1] DEBUG com.google.api.gax.grpc.ChannelPool - Detected throughput drop to 0, shrinking channel pool size: 8 -> 6.
09:14:01.999 [pool-1-thread-1] TRACE io.grpc.ChannelLogger - [Channel<13>: (datastore.googleapis.com:443)] shutdown() called
09:14:01.999 [pool-1-thread-1] DEBUG io.grpc.ChannelLogger - [Channel<13>: (datastore.googleapis.com:443)] Entering SHUTDOWN state
09:14:01.999 [pool-1-thread-1] DEBUG io.grpc.ChannelLogger - [Channel<13>: (datastore.googleapis.com:443)] Terminated
09:14:01.999 [pool-1-thread-1] TRACE io.grpc.ChannelLogger - [Channel<15>: (datastore.googleapis.com:443)] shutdown() called
09:14:01.999 [pool-1-thread-1] DEBUG io.grpc.ChannelLogger - [Channel<15>: (datastore.googleapis.com:443)] Entering SHUTDOWN state
09:14:01.999 [pool-1-thread-1] DEBUG io.grpc.ChannelLogger - [Channel<15>: (datastore.googleapis.com:443)] Terminated
These log entries are useful for:
- Monitoring Channel Resizing Decisions
- Identifying Channel Reuse versus. Churn
- Spotting Transport Connectivity Issues
Interpreting logs for symptom diagnosis
When troubleshooting performance issues like high latency or errors, start by enabling logging for com.google.api.gax.grpc.ChannelPool
and io.grpc.ChannelLogger
. Then, match the symptoms you're observing with the common scenarios in this guide to interpret the logs and find a resolution.
Symptom 1: High latency at startup or during traffic spikes
You may notice that the first few requests after your application starts are slow, or that latency increases dramatically whenever traffic suddenly increases.
Possible cause
This latency often happens when your channel pool is under-provisioned. Each gRPC channel can handle a limited number of concurrent requests (100 limited by Google Middleware). Once this limit is reached, new RPCs will be queued on the client-side, waiting for an available slot. This queuing is the primary source of latency.
While the channel pool is designed to adapt to changes in load, its resizing logic runs periodically (by default, every minute) and expands gradually (by default, adding at most 2 channels at a time). Therefore, there is an inherent delay between the initial traffic spike and the pool's expansion. The latency will occur during this period, while new requests are waiting for either the saturated channels to clear or for the pool to add new channels in its next resize cycle.
Logs to look for
Look for logs indicating that the channel pool is expanding. This is the primary evidence that the client is reacting to a traffic peak that has exceeded its current capacity.
Channel Pool Expansion Log:
[pool-1-thread-1] DEBUG com.google.api.gax.grpc.ChannelPool - Detected throughput peak of 40, expanding channel pool size: 4 -> 6.
Interpretation: The pool has detected a traffic spike and is opening new connections to handle the increased load. Frequent expansion, especially near startup, is a clear sign that your initial configuration isn't sufficient for your typical workload.
How to resolve
The goal is to have enough channels ready before the traffic arrives, preventing the client from having to create them under pressure.
Increase initialChannelCount
(High Latency at Startup):
If you observe high latency at startup, the most effective solution is to increase the initialChannelCount
in the ChannelPoolSettings
. By default, this value is set to 10, which could be too low for applications that handle significant traffic at startup.
To find the right value for the application, you should:
- Enable
FINEST
level logging forcom.google.api.gax.grpc.ChannelPool
. - Run your application under a typical starting workload or traffic spike.
- Observe the channel pool expansion logs to see what size the pool stabilizes at.
For example, if you see logs like this:
[pool-1-thread-1] DEBUG com.google.api.gax.grpc.ChannelPool - Detected throughput peak of 80, expanding channel pool size: 4 -> 6.
and then a minute later, you see logs:
[pool-1-thread-1] DEBUG com.google.api.gax.grpc.ChannelPool - Detected throughput peak of 95, expanding channel pool size: 6 -> 8.
If you see expanded channel pool size stabilized at 8 channels and no more expanding logs can be found, this indicates that your application needed 8 channels to handle its workload. A good starting point would be to set your initialChannelCount
to 8. This ensures the connections are established upfront, reducing or eliminating latency caused by on-the-fly scaling.
The goal is to ensure the pool has enough capacity to handle peak load without queuing requests.
Increase minChannelCount
(Traffic spikes):
If you experience high latency specifically at the beginning of traffic spikes, it's a sign that your traffic is increasing faster than the channel pool can reactively create new connections. This is common for applications with predictable, sudden bursts of traffic.
The pool stabilizes with a small number of channels during normal traffic and proactively adds more as traffic increases to prevent saturation. This scaling process takes time, causing the initial requests in a burst to be queued, which results in high latency.
Increasing minChannelCount
sets a higher baseline of permanently open channels. This ensures enough capacity is already available to handle the initial burst, eliminating the scaling delay and the associated latency spike.
Increase maxChannelCount
(Traffic spikes):
If you observe high latency during traffic spikes and your logs show the channel pool is consistently at its maximum size (maxChannelCount
), the current limit is likely too low for your peak traffic. By default, maxChannelCount
is set to 200. To determine a better value, use the connection pool configuration guide to calculate the optimal number of connections based on your application's peak Queries Per Second (QPS) and average request latency.
Symptom 2: Intermittent timeouts or RPC failures
Sometimes the application runs fine most of the time but occasionally suffers from intermittent timeouts, or failed RPCs, even during periods of normal traffic.
Possible cause: network instability
Connections between the client and the Datastore service are being dropped. The gRPC client will try to reconnect automatically, but this process can cause temporary failures and latency.
Logs to look for
The most critical log for diagnosing network problems is TRANSIENT_FAILURE
.
- Transient Failure Log:
[grpc-nio-worker-ELG-1-7] DEBUG io.grpc.ChannelLogger - [Channel<9>: (datastore.googleapis.com:443)] Entering TRANSIENT_FAILURE state
- Interpretation: This log is a major red flag indicating the channel has lost its connection. A single, isolated failure might just be a minor network blip. However, if we see these messages frequently or a channel gets stuck in this state, it points to a significant underlying issue.
How to resolve
Investigate your network environment. Check for issues with firewalls, proxies, routers, or general network instability between the application and datastore.googleapis.com
.
Symptom 3: High latency with over 20,000 concurrent requests per client
This specific symptom applies to applications running at a very high scale, typically when a single client instance must handle more than 20,000 concurrent requests. The application performs well under normal conditions, but as traffic increases to more than 20,000 concurrent requests per client, you observe a sharp and sudden degradation in performance. Latency jumps significantly and remains high for the entire duration of the peak traffic period. This occurs because the client's connection pool has reached its maximum size and cannot scale out further.
Possible cause
The client is suffering from channel saturation because the channel pool has reached its configured maxChannelCount
. By default, the pool is configured with a limit of 200 channels. Since each gRPC channel can handle up to 100 concurrent requests, this limit is only reached when a single client instance is processing approximately 20,000 requests simultaneously.
Once the pool hits this ceiling, it cannot create more connections. All 200 channels become overloaded, and new requests are queued on the client side, causing the sharp spike in latency.
The following logs can help confirm that the maxChannelCount
has been reached.
Logs to look for
The key indicator is observing that the channel pool stops expanding right at its configured limit, even as the load continues.
- Logs: You will see earlier logs of the pool expanding. With default settings, the last expansion log you see before latency spikes will be the one where the pool reaches 200 channels:
[pool-1-thread-1] DEBUG com.google.api.gax.grpc.ChannelPool - ... expanding channel pool size: 198 -> 200.
- Indicator: During the period of high latency, you will see no further "expanding channel pool size" logs. The absence of these logs, combined with the high latency, is a strong indicator that the
maxChannelCount
limit has been reached.
How to resolve
The goal is to ensure the pool has enough capacity to handle peak load without queuing requests.
- Increase
maxChannelCount
: The primary solution is to increase themaxChannelCount
setting to a value that can support the application's peak traffic. Refer to the connection pool configuration guide to calculate the optimal number of connections based on the application's peak Queries Per Second (QPS) and average request latency.
Appendix
The following sections provide supplementary information to aid in troubleshooting.
Understand channel states
The following channel states can appear in the logs, providing insights into connection behavior:
State | Description |
---|---|
IDLE | The channel is created but has no active connections or RPCs. It's waiting for traffic. |
CONNECTING | The channel is actively trying to establish a new network transport (connection) to the gRPC server. |
READY | The channel has an established and healthy transport and is ready to send RPCs. |
TRANSIENT_FAILURE | The channel encountered a recoverable failure (e.g., network blip, temporary server unavailability). It will automatically attempt to reconnect. |
SHUTDOWN | The channel has been closed, either manually (e.g., shutdown() called) or due to an idle timeout. No new RPCs can be initiated. |
Tips
- If you use a structured logging framework like SLF4J or Logback, you must configure equivalent log levels in
logback.xml
or other logger configuration files. Thejava.util.logging
levels will be mapped by the logging facade.