0

現在、同期を停止することを決定した drbd サーバーのペアがあり、それらを再び同期させるために何もできないようです。同期プロセスは、2 台のサーバー間の専用クロスオーバー ケーブル (1 Gbps 銅線) を介して行われます。

r01 のログには次のように表示されます。

Aug  9 16:09:44 r02 kernel: [12739.178449] block drbd0: receiver (re)started
Aug  9 16:09:44 r02 kernel: [12739.178454] block drbd0: conn( Unconnected -> WFConnection ) 
Aug  9 16:09:44 r02 kernel: [12739.912037] block drbd0: Handshake successful: Agreed network protocol version 91
Aug  9 16:09:44 r02 kernel: [12739.912048] block drbd0: conn( WFConnection -> WFReportParams ) 
Aug  9 16:09:44 r02 kernel: [12739.912074] block drbd0: Starting asender thread (from drbd0_receiver [3740])
Aug  9 16:09:44 r02 kernel: [12739.936681] block drbd0: data-integrity-alg: <not-used>
Aug  9 16:09:44 r02 kernel: [12739.936691] block drbd0: Considerable difference in lower level device sizes: 256503768s vs. 1344982880s
Aug  9 16:09:44 r02 kernel: [12739.942918] block drbd0: drbd_sync_handshake:
Aug  9 16:09:44 r02 kernel: [12739.942923] block drbd0: self E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug  9 16:09:44 r02 kernel: [12739.942928] block drbd0: peer E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug  9 16:09:44 r02 kernel: [12739.942933] block drbd0: uuid_compare()=-1 by rule 50
Aug  9 16:09:44 r02 kernel: [12739.942935] block drbd0: Becoming sync target due to disk states.
Aug  9 16:09:44 r02 kernel: [12739.942946] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Aug  9 16:09:44 r02 kernel: [12740.099597] block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
Aug  9 16:09:44 r02 kernel: [12740.104324] block drbd0: updated sync uuid BF8D25FBE26085B0:0000000000000000:0000000000000000:0000000000000000
Aug  9 16:09:44 r02 kernel: [12740.104423] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug  9 16:09:44 r02 kernel: [12740.106582] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug  9 16:09:44 r02 kernel: [12740.106591] block drbd0: conn( WFSyncUUID -> SyncTarget ) 
Aug  9 16:09:44 r02 kernel: [12740.106599] block drbd0: Began resync as SyncTarget (will sync 128250804 KB [32062701 bits set]).
Aug  9 16:09:44 r02 kernel: [12740.140796] block drbd0: meta connection shut down by peer.
Aug  9 16:09:44 r02 kernel: [12740.141304] block drbd0: sock was shut down by peer
Aug  9 16:09:44 r02 kernel: [12740.141309] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Aug  9 16:09:44 r02 kernel: [12740.141316] block drbd0: short read expecting header on sock: r=0
Aug  9 16:09:44 r02 kernel: [12740.142235] block drbd0: asender terminated
Aug  9 16:09:44 r02 kernel: [12740.142238] block drbd0: Terminating drbd0_asender
Aug  9 16:09:44 r02 kernel: [12740.151561] block drbd0: bitmap WRITE of 979 pages took 2 jiffies
Aug  9 16:09:44 r02 kernel: [12740.151567] block drbd0: 122 GB (32062701 bits) marked out-of-sync by on disk bit-map.
Aug  9 16:09:44 r02 kernel: [12740.151580] block drbd0: Connection closed
Aug  9 16:09:44 r02 kernel: [12740.151586] block drbd0: conn( BrokenPipe -> Unconnected ) 
Aug  9 16:09:44 r02 kernel: [12740.151592] block drbd0: receiver terminated

r01 の場合:

Aug  9 16:09:44 r01 kernel: [3438273.766768] block drbd0: receiver (re)started
Aug  9 16:09:44 r01 kernel: [3438273.771898] block drbd0: conn( Unconnected -> WFConnection ) 
Aug  9 16:09:44 r01 kernel: [3438274.474411] block drbd0: Handshake successful: Agreed network protocol version 91
Aug  9 16:09:44 r01 kernel: [3438274.483299] block drbd0: conn( WFConnection -> WFReportParams ) 
Aug  9 16:09:44 r01 kernel: [3438274.490420] block drbd0: Starting asender thread (from drbd0_receiver [6366])
Aug  9 16:09:44 r01 kernel: [3438274.498900] block drbd0: data-integrity-alg: <not-used>
Aug  9 16:09:44 r01 kernel: [3438274.505166] block drbd0: Considerable difference in lower level device sizes: 1344982880s vs. 256503768s
Aug  9 16:09:44 r01 kernel: [3438274.516226] block drbd0: max_segment_size ( = BIO size ) = 65536
Aug  9 16:09:44 r01 kernel: [3438274.523385] block drbd0: drbd_sync_handshake:
Aug  9 16:09:44 r01 kernel: [3438274.528677] block drbd0: self E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug  9 16:09:44 r01 kernel: [3438274.541195] block drbd0: peer E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug  9 16:09:44 r01 kernel: [3438274.553710] block drbd0: uuid_compare()=1 by rule 70
Aug  9 16:09:44 r01 kernel: [3438274.559677] block drbd0: Becoming sync source due to disk states.
Aug  9 16:09:44 r01 kernel: [3438274.566897] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) 
Aug  9 16:09:44 r01 kernel: [3438274.666397] block drbd0: conn( WFBitMapS -> SyncSource ) 
Aug  9 16:09:44 r01 kernel: [3438274.672845] block drbd0: Began resync as SyncSource (will sync 128250804 KB [32062701 bits set]).
Aug  9 16:09:44 r01 kernel: [3438274.683196] block drbd0: /build/buildd-linux-2.6_2.6.32-48squeeze3-amd64-mcoLgp/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/block/drbd/drbd_receiver.c:1932: sector: 0s, size: 65536
Aug  9 16:09:45 r01 kernel: [3438274.702834] block drbd0: error receiving RSDataRequest, l: 24!
Aug  9 16:09:45 r01 kernel: [3438274.702837] block drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError ) 
Aug  9 16:09:45 r01 kernel: [3438274.703005] block drbd0: asender terminated
Aug  9 16:09:45 r01 kernel: [3438274.703009] block drbd0: Terminating drbd0_asender
Aug  9 16:09:45 r01 kernel: [3438274.711319] block drbd0: Connection closed
Aug  9 16:09:45 r01 kernel: [3438274.711323] block drbd0: conn( ProtocolError -> Unconnected ) 
Aug  9 16:09:45 r01 kernel: [3438274.711329] block drbd0: receiver terminated

これが延々と繰り返されるだけです。

構成は、両方のサーバーで同じである必要があります。

r01:~$ rsync --dry-run --verbose --checksum --itemize-changes 10.0.255.254:/etc/drbd.conf /etc/

sent 11 bytes  received 51 bytes  124.00 bytes/sec
total size is 615  speedup is 9.92 (DRY RUN)

構成は次のようになります。

r01:~$ cat /etc/drbd.conf
global {
   usage-count no;
}

resource drbd0 {
  protocol C;
  handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; exit 1"; }
  startup {
    degr-wfc-timeout 60;    # 1 minute.
    wfc-timeout 55;
  }

  disk {
    on-io-error   detach;
  }

  syncer {
    rate 100M;
    al-extents 257;
  }

  on r01.c07.mtsvc.net {
    device     /dev/drbd0;
    disk       /dev/cciss/c0d0p3;
    address    10.0.255.253:7788;
    meta-disk  internal;
  }

  on r02.c07.mtsvc.net {
    device     /dev/drbd0;
    disk       /dev/cciss/c0d0p6;
    address    10.0.255.254:7788;
    meta-disk  internal;
  }
}

両側のネットワーク構成は次のようになります。

r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255

eth2      Link encap:Ethernet  HWaddr 00:26:55:d6:f8:fc  
          inet addr:10.0.255.253  Bcast:10.0.255.255  Mask:255.255.255.0
          inet6 addr: fe80::226:55ff:fed6:f8fc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4062510240 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5692251259 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5512604514975 (5.0 TiB)  TX bytes:5820995499388 (5.2 TiB)
          Interrupt:24 Memory:fbe80000-fbea0000 

r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255

eth2      Link encap:Ethernet  HWaddr 00:1b:78:5c:a8:fd  
          inet addr:10.0.255.254  Bcast:10.0.255.255  Mask:255.255.255.252
          inet6 addr: fe80::21b:78ff:fe5c:a8fd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:321977747 errors:0 dropped:0 overruns:0 frame:0
          TX packets:264683964 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:332813827055 (309.9 GiB)  TX bytes:328142295363 (305.6 GiB)
          Interrupt:17 Memory:fdfa0000-fdfc0000 

もともと、r01 と r02 の両方で Debian Squeeze (drbd 8.3.7) が実行されていました。次に、Debian Wheezy (drbd 8.3.13) で r02 を再構築しました。数日間は問題なく動作し、drbd を再起動した後、この問題が発生しました。これと同じ方法でアップグレードしている他のいくつかの drbd クラスターがあります。それらのいくつかは Wheezy に完全にアップグレードされていますが、その他はまだ半分 Squeeze、半分 Wheezy で問題ありません。

これまでのところ、この問題を解決するために私が試みたことは次のとおりです。

  • r02 の drbd ボリュームを消去し、再同期を試みます
  • r02 を消去、再インストール、および再構成します。
  • r02 を別のハードウェアに置き換え、最初から再構築します。
  • クロスオーバーケーブルを交換します (2 回)

次の数日間で、r01 を 100% 異なるハードウェアに置き換えます。しかし、それが機能したとしても、私はまだ途方に暮れています。この問題の原因と、それを解決する適切な方法を本当に理解したいと思っています。

4

1 に答える 1