We have a popular iPhone app where people duel each other a la Wordfeud. We have almost 1 M registered users today.
During peak hours the app gets really long response times, and there are also quite a lot of time outs. We have tried to find the bottleneck, but have had a hard time doing so. CPU, memory and I/O are all under 50 % on all servers. The problem ONLY appears during peak hours.
Our setup
1 VPS with nginx (1.1.9) as load balancer
4 front servers with Ruby (1.9.3p194) on Rails (3.2.5) / Unicorn (4.3.1)
1 database server with PostgreSQL 9.1.5
The database logs doesn't show enough long request times to explain all the timeouts shown in the nginx error log.
We have also tried to build and run the app directly against the front servers (during peak hour when all other users are running against the load balancer). The surprising thing is that the app bypassing the load balancer is quick as a bullet even under peak hours.
NGINX SETTINGS
worker_processes=16
worker_connections=4096
multi_accept=on
LINUX SETTINGS
fs.file-max=13184484
net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 16384 4194304"
net.ipv4.ip_local_port_range="32768 61000"
Why is the app bypassing the load balancer so fast? Can nginx as load balancer be the bottle neck? Is there any good way to compare timeouts in nginx with timeouts in the unicorns to see where the problem resides?