mongodb - Mongodb シャードバランスが正しく機能せず、多くの moveChunk エラーが報告される

Question

3 つのシャードを持つ mongoDb クラスターがあり、各シャードは 3 つのノードを含むレプリカセットであり、使用する mongoDb バージョンは 3.2.6 です。サイズが約 230G の大きなデータベースがあり、約 5500 のコレクションが含まれています。約 2300 のコレクションがバランスが取れておらず、他の 3200 のコレクションが 3 つのシャードに均等に分散されていることがわかりました。

以下はsh.statusの結果です（結果全体が大きすぎるため、一部を投稿するだけです）：

mongos> sh.status()
--- Sharding Status --- 
  sharding version: {
    "_id" : 1,
    "minCompatibleVersion" : 5,
    "currentVersion" : 6,
    "clusterId" : ObjectId("57557345fa5a196a00b7c77a")
}
  shards:
    {  "_id" : "shard1",  "host" : "shard1/10.25.8.151:27018,10.25.8.159:27018" }
    {  "_id" : "shard2",  "host" : "shard2/10.25.2.6:27018,10.25.8.178:27018" }
    {  "_id" : "shard3",  "host" : "shard3/10.25.2.19:27018,10.47.102.176:27018" }
  active mongoses:
    "3.2.6" : 1
  balancer:
    Currently enabled:  yes
    Currently running:  yes
        Balancer lock taken at Sat Sep 03 2016 09:58:58 GMT+0800 (CST) by iZ23vbzyrjiZ:27017:1467949335:-2109714153:Balancer
    Collections with active migrations: 
        bdtt.normal_20131017 started at Sun Sep 18 2016 17:03:11 GMT+0800 (CST)
    Failed balancer rounds in last 5 attempts:  0
    Migration Results for the last 24 hours: 
        1490 : Failed with error 'aborted', from shard2 to shard3
        1490 : Failed with error 'aborted', from shard2 to shard1
        14 : Failed with error 'data transfer error', from shard2 to shard1
  databases:
    {  "_id" : "bdtt",  "primary" : "shard2",  "partitioned" : true }
      bdtt.normal_20160908
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard2  142
            too many chunks to print, use verbose if you want to force print
        bdtt.normal_20160909
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard1  36
                shard2  42
                shard3  46
            too many chunks to print, use verbose if you want to force print
        bdtt.normal_20160910
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard1  34
                shard2  32
                shard3  32
            too many chunks to print, use verbose if you want to force print
        bdtt.normal_20160911
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard1  30
                shard2  32
                shard3  32
            too many chunks to print, use verbose if you want to force print
        bdtt.normal_20160912
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard2  126
            too many chunks to print, use verbose if you want to force print
        bdtt.normal_20160913
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard2  118
            too many chunks to print, use verbose if you want to force print
    }

コレクション「normal_20160913」はバランスが取れていません。このコレクションの getShardDistribution() の結果を以下に掲載します。

mongos> db.normal_20160913.getShardDistribution()

Shard shard2 at shard2/10.25.2.6:27018,10.25.8.178:27018
 data : 4.77GiB docs : 203776 chunks : 118
 estimated data per chunk : 41.43MiB
 estimated docs per chunk : 1726

Totals
 data : 4.77GiB docs : 203776 chunks : 118
 Shard shard2 contains 100% data, 100% docs in cluster, avg obj size on shard : 24KiB

バランサープロセスは実行中の状態で、チャンクサイズはデフォルト (64M) です。

mongos> sh.isBalancerRunning()
true
mongos> use config
switched to db config
mongos> db.settings.find()
{ "_id" : "chunksize", "value" : NumberLong(64) }
{ "_id" : "balancer", "stopped" : false }

そして、mogos ログから多数の moveChunk エラーを見つけました。これが、一部のコレクションのバランスが取れていない理由である可能性があります。最新のコレクションは次のとおりです。

2016-09-19T14:25:25.427+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:25:59.620+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:25:59.644+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:35:02.701+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:35:02.728+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:18.232+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:18.256+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:27.101+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:27.112+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:43:41.889+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }

moveChunk コマンドを手動で使用しようとしましたが、同じエラーが返されます:

mongos> sh.moveChunk("bdtt.normal_20160913", {_id:ObjectId("57d6d107edac9244b6048e65")}, "shard3")
{
    "cause" : {
        "ok" : 0,
        "errmsg" : "Not starting chunk migration because another migration is already in progress",
        "code" : 117
    },
    "code" : 117,
    "ok" : 0,
    "errmsg" : "move failed"
}

作成されたコレクションが多すぎて移行が圧倒されているかどうかはわかりませんか? 毎日、約 60 ～ 80 の新しいコレクションが作成されます。

以下の質問に答えるためにここで助けが必要です。どんなヒントでも素晴らしいでしょう:

一部のコレクションのバランスが取れていないのはなぜですか?これは、新しく作成された多数のコレクションに関連していますか?
処理中の移行ジョブの詳細を確認できるコマンドはありますか? いくつかの移行ジョグが実行されていることを示す多くのエラーログを取得しましたが、実行中のものを見つけることができません。

score 3 · Accepted Answer

私自身の質問に答えてください: ついに根本的な原因が見つかりました。これは、異常なレプリカセットの構成によって引き起こされた、この「レプリカの遅延による MongoDB バランサーのタイムアウト」とまったく同じ問題です。この問題が発生した場合、レプリカセットの構成は次のようになります。

shard1:PRIMARY> rs.conf()
{
    "_id" : "shard1",
    "version" : 3,
    "protocolVersion" : NumberLong(1),
    "members" : [
        {
            "_id" : 0,
            "host" : "10.25.8.151:27018",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 1,
            "host" : "10.25.8.159:27018",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 2,
            "host" : "10.25.2.6:37018",
            "arbiterOnly" : true,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 3,
            "host" : "10.47.114.174:27018",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : true,
            "priority" : 0,
            "tags" : {

            },
            "slaveDelay" : NumberLong(86400),
            "votes" : 1
        }
    ],
    "settings" : {
        "chainingAllowed" : true,
        "heartbeatIntervalMillis" : 2000,
        "heartbeatTimeoutSecs" : 10,
        "electionTimeoutMillis" : 10000,
        "getLastErrorModes" : {

        },
        "getLastErrorDefaults" : {
            "w" : 1,
            "wtimeout" : 0
        },
        "replicaSetId" : ObjectId("5755464f789c6cd79746ad62")
    }
}

レプリカセット内には 4 つのノードがあります。1 つのプライマリ、1 つのスレーブ、1 つのアービター、および 1 つの 24 時間遅延スレーブです。アービターにはデータが存在しないため、バランサーは遅延スレーブが書き込みの問題を満たすのを待つ必要があります (レシーバーシャードがチャンクを受信したことを確認してください)。

問題を解決するにはいくつかの方法があります。アービターを削除したところ、バランサーは正常に動作するようになりました。

score 0 · Accepted Answer

推測するつもりですが、あなたのコレクションは非常に不均衡であり、現在チャンクの移行によってバランスが取れていると思います (時間がかかる場合があります)。したがって、手動のチャンク移行はキューに入れられますが、すぐには実行されません。

もう少し明確にするかもしれないいくつかのポイントを次に示します。

一度に 1 つのチャンク: MongoDB チャンクの移行はキューメカニズムで行われ、一度に 1 つのチャンクのみが移行されます。
バランサーロック: バランサーロック情報により、何が移行されているかがわかります。また、mongos ログファイルで、ログエントリがチャンクの移行であることも確認できるはずです。

選択肢の 1 つは、コレクションで事前に分割することです。事前分割プロセスでは、基本的に空のコレクションを構成してバランスを取り始め、そもそもバランスが崩れないようにしました。それらが不均衡になると、チャンク移行プロセスはあなたの友人ではないかもしれないからです.

また、シャードキーを再検討することもできます。おそらく、シャードキーに何か問題があり、多くの不均衡が生じています。

さらに、シャード構成を保証するにはデータサイズが大きすぎるとは思えません。データサイズ/ワーキングセットサイズの属性によって強制されない限り、シャード構成を行わないでください。シャーディングは無料ではないためです (おそらく、すでに痛みを感じているでしょう)。

mongodb - Mongodb シャード バランスが正しく機能せず、多くの moveChunk エラーが報告される

2 に答える 2

Related

Reference

mongodb - Mongodb シャードバランスが正しく機能せず、多くの moveChunk エラーが報告される