amazon-web-services - Kafka: Understanding Broker failure

Question

I have a Kafka cluster with:

2 brokers b-1 and b-2.
2 topics with both: PartitionCount:1 ReplicationFactor:2 min.insync.replicas=1

Here is what happened:

%6|1613807298.974|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Disconnected (after 3829996ms in state UP)
%3|1613807299.011|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Connect to ipv4#172.31.18.172:9096 failed: Connection refused (after 36ms in state CONNECT)
%3|1613807299.128|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Connect to ipv4#172.31.18.172:9096 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
%4|1613807907.225|REQTMOUT|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 1 partially-sent requests
%3|1613807907.225|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: 1 request(s) timed out: disconnect (after 343439ms in state UP)
%5|1613807938.942|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60767ms, timeout #0)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60459ms, timeout #1)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60342ms, timeout #2)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60305ms, timeout #3)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60293ms, timeout #4)
%4|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out 6 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%3|1613807938.943|FAIL|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: 6 request(s) timed out: disconnect (after 4468987ms in state UP)

Within the code, I got this error when my producer performed a poll around that time:

2021-02-20 07:59:08,174 - ERROR - Failed to deliver message due to error: KafkaError{code=REQUEST_TIMED_OUT,val=7,str="Broker: Request timed out"}

Broker b-2 logs have this:

[2021-02-20 07:57:24,781] WARN Client session timed out, have not heard from server in 15103ms for sessionid 0x2000190b5d40001 (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:24,782] WARN Client session timed out, have not heard from server in 12701ms for sessionid 0x2000190b5d40000 (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:24,931] INFO Client session timed out, have not heard from server in 12701ms for sessionid 0x2000190b5d40000, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:24,932] INFO Client session timed out, have not heard from server in 15103ms for sessionid 0x2000190b5d40001, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:32,884] INFO Opening socket connection to server INTERNAL_ZK_DNS/INTERNAL_IP. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:32,910] INFO Opening socket connection to server INTERNAL_ZK_DNS/INTERNAL_IP. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:33,032] INFO Socket connection established to INTERNAL_ZK_DNS/INTERNAL_IP, initiating session (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:33,032] INFO Socket connection established to INTERNAL_ZK_DNS/INTERNAL_IP, initiating session (org.apache.zookeeper.ClientCnxn

My understanding here is that (1) b-2 went down i.e. unable to connect to Zookeeper (2) Messages were produced to b-1 successfully during this time. (3) b-1 was also trying to forward messages to b-2during this downtime due to the replication factor set to 2 (4) All these forwarded messages (ProduceRequests) got timed-out after 600s

My question:

Is my understanding correct and how I can prevent this from happening again?
If I had 3 brokers here, would b-1 have tried to connect to b-3 right away rather than waiting for b-2? Is that a good workaround? (Assuming topic replication factor = 2 everywhere)

amazon-web-services - Kafka: Understanding Broker failure

0 に答える 0

Related

Reference