2 つの Windows ワーカー ノードを持つ Kubernetes クラスターがあります。Windows ノードを実行するとkubectl top nodes
、不明なレポートが表示されます。調査を行ったところ、ログにエラーが表示されています。
メトリック サーバーからのログ
E0529 12:04:50.809303 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:05:50.838175 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:06:50.815777 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:07:50.800927 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:08:50.821804 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:09:12.819567 1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-02": no metrics known for node
E0529 12:09:12.819592 1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-01": no metrics known for node
E0529 12:09:50.809012 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist."]
E0529 12:10:53.085842 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = container 29f403ebb265389ac1bcbe39f8a555045e1e461a2abf065f11d2f8b267f83b12 encountered an error during Properties: failure in a Windows system call: A system shutdown is in progress. (0x45b)"]
E0529 12:12:00.147458 1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-02": no metrics known for node
E0529 12:12:00.147485 1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-01": no metrics known for node
E0529 12:12:44.741135 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]
E0529 12:13:44.740851 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]
E0529 12:14:44.740965 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]
E0529 12:15:44.740936 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]
メトリクス サーバー ポッドを実行しているノードからメトリクスをフェッチしようとしてエラーが発生しました (実行qa-k8sm-02
しましたcurl -v -k https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true
curl -v -k https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true
* About to connect() to 10.4.111.68 port 10250 (#0)
* Trying 10.4.111.68...
* Connected to 10.4.111.68 (10.4.111.68) port 10250 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
* subject: CN=qa-k8sw-win-02@1589888891
* start date: May 19 10:48:10 2020 GMT
* expire date: May 19 10:48:10 2021 GMT
* common name: qa-k8sw-win-02@1589888891
* issuer: CN=qa-k8sw-win-02-ca@1589888890
> GET /stats/summary?only_cpu_and_memory=true HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.4.111.68:10250
> Accept: */*
>
< HTTP/1.1 401 Unauthorized
< Date: Fri, 29 May 2020 12:30:40 GMT
< Content-Length: 12
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 10.4.111.68 left intact
サーバーへの接続は機能していたので、エラーを詳しく調べたところ 500 だったので、Windows サーバーのポッドに問題があると考えたので、のログを表示します。cattle-node-agent-windows-gkrf9
WARN: Default docker named pipe is not found
WARN: Please bind mount in the docker named pipe to //./pipe/docker_engine if docker errors occur
WARN: example: docker run -v //./pipe/custom_docker_named_pipe://./pipe/docker_engine ...
INFO: https://rancher.mycompany.com is accessible
time="2020-05-29T07:12:09-05:00" level=info msg="Rancher agent version v2.4.3 is starting"
time="2020-05-29T07:12:09-05:00" level=info msg="Listening on /tmp/log.sock"
time="2020-05-29T07:12:09-05:00" level=info msg="Option etcd=false"
time="2020-05-29T07:12:09-05:00" level=info msg="Option controlPlane=false"
time="2020-05-29T07:12:09-05:00" level=info msg="Option worker=true"
time="2020-05-29T07:12:09-05:00" level=info msg="Option requestedHostname=qa-k8sw-win-02"
time="2020-05-29T07:12:09-05:00" level=info msg="Option customConfig=map[address:10.4.111.68 internalAddress: label:map[rke.cattle.io/windows-build:17763 rke.cattle.io/windows-kernel-version:17763.1.amd64fre.rs5_release.180914-1434 rke.cattle.io/windows-major-version:10 rke.cattle.io/windows-minor-version:0 rke.cattle.io/windows-release-id:1809 rke.cattle.io/windows-version:10.0.17763.1098] roles:[worker] taints:[]]"
time="2020-05-29T07:12:09-05:00" level=info msg="Connecting to wss://rancher.mycompany.com/v3/connect with token qt6xtcslz7gwrjfw5tszj8r5tjrjpqk5kwnfqm85l7wc64tkjwfcqj"
time="2020-05-29T07:12:09-05:00" level=info msg="Connecting to proxy" url="wss://rancher.mycompany.com/v3/connect"
time="2020-05-29T07:12:09-05:00" level=info msg="Starting plan monitor, checking every 120 seconds"
そこにはエラーは見られなかったので、Windows サーバーにログインしてコンテナーのログを調べ始めました。kubelet のログを表示すると、大量のエラーが表示されました...
E0529 18:01:50.824959 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 29f403ebb265389ac1bcbe39f8a555045e1e461a2abf065f11d2f8b267f83b12: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 18:02:50.797328 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 29f403ebb265389ac1bcbe39f8a555045e1e461a2abf065f11d2f8b267f83b12: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]