Optimizing Apache Tomcat to Handle 300+ Concurrent Connections with Slow Backend Services


2 views

When dealing with external web services that have response times as high as 300 seconds during peak hours, even robust EC2 instances (m2.xlarge with 34GB RAM) can struggle. The key symptoms we observed:

  • Server choking at ~300 httpd processes
  • 1000+ TCP connections in TIME_WAIT state
  • mod_jk errors showing backend connection failures

While you've already made several good tweaks, here's what actually solved the issue:

<Connector port="8009" protocol="AJP/1.3"
    maxThreads="500"
    minSpareThreads="50"
    acceptCount="300"
    connectionTimeout="600000"
    redirectPort="8443"
    enableLookups="false"/>

The missing piece was proper AJP connector configuration. Key parameters:

# In worker.properties
worker.tom1.maxThreads=500
worker.tom1.connection_pool_size=200
worker.tom1.connection_pool_timeout=600
worker.tom1.socket_timeout=600000
worker.tom1.socket_keepalive=true

These settings work in conjunction with your existing sysctl.conf:

# For handling many keepalive connections
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15

# For TIME_WAIT optimization
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1

For truly resilient performance with slow backends:

  1. Implement async servlets (Servlet 3.0+)
  2. Use Nginx as reverse proxy with:
location / {
    proxy_pass http://tomcat_backend;
    proxy_read_timeout 600s;
    proxy_connect_timeout 60s;
    proxy_buffer_size 16k;
    proxy_buffers 4 64k;
}

Key metrics to watch:

Metric Command Healthy Range
Thread Usage jconsole/jvisualvm < 80% maxThreads
TIME_WAIT netstat -n | grep TIME_WAIT | wc -l < 30% of tcp_max_tw_buckets

When dealing with external web services that take 300+ seconds to respond during peak hours, traditional Tomcat configurations simply won't cut it. The root issue isn't just about connection limits - it's about resource starvation caused by blocked worker threads.

Your worker.properties reveals missing AJP thread configuration. Here's the proper setup:

worker.tom1.threads=500
worker.tom2.threads=500
worker.tom1.connection_pool_size=200
worker.tom2.connection_pool_size=200
worker.loadbalancer.retries=3

The TIME_WAIT connections indicate TCP stack misconfiguration. Update /etc/sysctl.conf with:

net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_intvl = 15

Your HTTP connector needs these critical NIO parameters:

<Connector 
  executor="tomcatThreadPool"
  acceptCount="1000" 
  maxConnections="10000"
  maxThreads="500"
  minSpareThreads="50"
  processorCache="2000"
  tcpNoDelay="true"
  socketBuffer="8192"
/>

For slow external services, implement async servlets:

@WebServlet(asyncSupported = true)
public class AsyncServiceServlet extends HttpServlet {
  protected void doGet(HttpServletRequest request, 
                      HttpServletResponse response) {
    AsyncContext asyncCtx = request.startAsync();
    CompletableFuture.runAsync(() -> {
      // Call slow external service
      asyncCtx.complete();
    });
  }
}

Install JMX monitoring with these key metrics:

Catalina:type=ThreadPool,name="http-nio-8080" 
  - currentThreadCount
  - currentThreadsBusy
  
Catalina:type=GlobalRequestProcessor,name="http-nio-8080"
  - requestCount
  - errorCount
  - processingTime

Modify your worker.properties for better failover:

worker.loadbalancer.retry_interval=30
worker.loadbalancer.recover_time=300
worker.tom1.retries=3
worker.tom2.retries=3
worker.tom1.recovery_options=3

For extreme cases, consider these architectural changes:

  • Implement circuit breakers (Hystrix/Resilience4j)
  • Add message queue buffering
  • Implement edge caching
  • Consider service mesh for external calls