I am running a fairly large-scale Node.js 0.8.8 app using Cluster with 16 worker processes on a 16-processor box with hyperthreading (so 32 logical cores). We are finding that since moving to the Linux 3.2.0 kernel (from 2.6.32), the balancing of incoming requests between worker child processes seems be heavily weighted to 5 or so processes, with the other 11 not doing much work at all. This may be more efficient for throughput, but seems to increase request latency and is not optimal for us because many of these are long-lived websocket connections that can start doing work at the same time.
The child processes are all accepting on a socket (using epoll), and while this problem has a fix in Node 0.9 (https://github.com/bnoordhuis/libuv/commit/be2a2176ce25d6a4190b10acd1de9fd53f7a6275), that fix does not seem to help in our tests. Is anyone aware of kernel tuning parameters or build options that could help, or are we best-off moving back to the 2.6 kernel or load balancing across worker processes using a different approach?
We boiled it down to a simple HTTP Siege test, though note that this is running with 12 procs on a 12-core box with hyperthreading (so 24 logical cores), and with 12 worker processes accepting on the socket, as opposed to our 16 procs in production.
HTTP Siege with Node 0.9.3 on Debian Squeeze with 2.6.32 kernel on bare metal:
reqs pid
146 2818
139 2820
211 2821
306 2823
129 2825
166 2827
138 2829
134 2831
227 2833
134 2835
129 2837
138 2838
Same everything except with the 3.2.0 kernel:
reqs pid
99 3207
186 3209
42 3210
131 3212
34 3214
53 3216
39 3218
54 3220
33 3222
931 3224
345 3226
312 3228