删除了一个434G的库

大概分析了下: 删除的时候要刷内存,因为库太大了所以这个时间很长,而这个过程是实例锁,此时arb发心跳过来,当前主库因为锁住了所以无法响应,arb认为主库挂了,切主,切主之后发现老主库的oplog之间超前于新主库的。需要尝试rollback。而dropDatabase行为是无法被rollback的,于是同步就直接坏了,日志如下:

Tue Jul 29 14:45:16.660 [conn685428] dropDatabase aaaaaaaaaa starting
Tue Jul 29 14:45:26.518 [conn685430]  authenticate db: local { authenticate: 1, nonce: "7bcf2abd12a61b34", user: "__system", key: "db0b67e198f39402f10362326163497f" }
Tue Jul 29 14:45:30.020 [conn685428] removeJournalFiles
Tue Jul 29 14:45:38.449 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:38.518 [conn685431]  authenticate db: local { authenticate: 1, nonce: "60dc3aa4ed7522f9", user: "__system", key: "d4df419a793870e3563950fb8e5a456b" }
Tue Jul 29 14:45:40.449 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:41.801 [conn685432]  authenticate db: local { authenticate: 1, nonce: "fb54914649052374", user: "__system", key: "fbd04c04366d97d6c257c11a1efd8466" }
Tue Jul 29 14:45:42.450 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:44.451 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:46.451 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:47.781 [conn685433]  authenticate db: local { authenticate: 1, nonce: "b6d124f7c797a134", user: "__system", key: "d3f0e0ec41152de349f0aef6fe8a0a31" }
Tue Jul 29 14:45:48.452 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:49.179 [conn685434] command denied: { serverStatus: 1 }
Tue Jul 29 14:45:49.185 [conn685434]  authenticate db: admin { authenticate: 1, user: "monitor", nonce: "c2dc12f60ba780e8", key: "7de7aeacd284c79765d8318ad48239e4" }
Tue Jul 29 14:45:50.452 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:50.519 [conn685435]  authenticate db: local { authenticate: 1, nonce: "c359896575e8b9ec", user: "__system", key: "12673c15769c2d864ea3e5a052d571a8" }
Tue Jul 29 14:45:52.453 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:53.829 [conn685436]  authenticate db: local { authenticate: 1, nonce: "983188cf1047cb28", user: "__system", key: "91db862c2eced12678499ffd602aec6b" }
Tue Jul 29 14:45:54.454 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:56.455 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:58.455 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:45:58.548 [rsHealthPoll] DBClientCursor::init call() failed
Tue Jul 29 14:45:58.548 [rsHealthPoll] replset info 192.168.1.3:27017 heartbeat failed, retrying
Tue Jul 29 14:45:59.549 [rsHealthPoll] DBClientCursor::init call() failed
Tue Jul 29 14:45:59.549 [rsHealthPoll] replSet info 192.168.1.3:27017 is down (or slow to respond): 
Tue Jul 29 14:45:59.549 [rsHealthPoll] replSet member 192.168.1.3:27017 is now in state DOWN
Tue Jul 29 14:46:00.456 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:46:02.456 [rsHealthPoll] replset info 192.168.1.2:17017 thinks that we are down
Tue Jul 29 14:46:02.520 [conn685437]  authenticate db: local { authenticate: 1, nonce: "b007206c66445c6d", user: "__system", key: "2c808979c32e64773ae05ff9b06435ee" }
Tue Jul 29 14:46:02.996 [conn685428] dropDatabase aaaaaaaaaa finished
Tue Jul 29 14:46:02.997 [conn685428] command aaaaaaaaaa.$cmd command: { dropDatabase: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:46336849 reslen:64 46336ms
Tue Jul 29 14:46:02.997 [conn245691] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1401518232000|2 } } cursorid:8986404439040780591 ntoreturn:0 keyUpdates:0 numYields: 1 locks(micros) r:316 nreturned:78 reslen:9747 46351ms
Tue Jul 29 14:46:02.997 [conn685430] command local.$cmd command: { authenticate: 1, nonce: "7bcf2abd12a61b34", user: "__system", key: "db0b67e198f39402f10362326163497f" } ntoreturn:1 keyUpdates:0 locks(micros) r:355 reslen:74 36479ms
Tue Jul 29 14:46:02.997 [conn685431] command local.$cmd command: { authenticate: 1, nonce: "60dc3aa4ed7522f9", user: "__system", key: "d4df419a793870e3563950fb8e5a456b" } ntoreturn:1 keyUpdates:0 locks(micros) r:298 reslen:74 24479ms
Tue Jul 29 14:46:02.997 [conn685435] command local.$cmd command: { authenticate: 1, nonce: "c359896575e8b9ec", user: "__system", key: "12673c15769c2d864ea3e5a052d571a8" } ntoreturn:1 keyUpdates:0 locks(micros) r:167 reslen:74 12478ms
Tue Jul 29 14:46:02.997 [conn685433] command local.$cmd command: { authenticate: 1, nonce: "b6d124f7c797a134", user: "__system", key: "d3f0e0ec41152de349f0aef6fe8a0a31" } ntoreturn:1 keyUpdates:0 locks(micros) r:206 reslen:74 15216ms
Tue Jul 29 14:46:02.997 [conn685436] command local.$cmd command: { authenticate: 1, nonce: "983188cf1047cb28", user: "__system", key: "91db862c2eced12678499ffd602aec6b" } ntoreturn:1 keyUpdates:0 locks(micros) r:165 reslen:74 9168ms
Tue Jul 29 14:46:02.997 [conn685432] command local.$cmd command: { authenticate: 1, nonce: "fb54914649052374", user: "__system", key: "fbd04c04366d97d6c257c11a1efd8466" } ntoreturn:1 keyUpdates:0 locks(micros) r:292 reslen:74 21196ms
Tue Jul 29 14:46:03.005 [conn685434] command admin.$cmd command: { authenticate: 1, user: "monitor", nonce: "c2dc12f60ba780e8", key: "7de7aeacd284c79765d8318ad48239e4" } ntoreturn:1 keyUpdates:0 locks(micros) r:434 reslen:73 13819ms
Tue Jul 29 14:46:03.009 [rsHealthPoll] replset info 192.168.1.3:27017 thinks that we are down
Tue Jul 29 14:46:03.009 [rsHealthPoll] replSet member 192.168.1.3:27017 is up
Tue Jul 29 14:46:03.009 [rsHealthPoll] replSet member 192.168.1.3:27017 is now in state SECONDARY
Tue Jul 29 14:46:05.010 [rsHealthPoll] replSet member 192.168.1.3:27017 is now in state PRIMARY
Tue Jul 29 14:46:05.010 [rsMgr] replSet relinquishing primary state
Tue Jul 29 14:46:05.010 [rsMgr] replSet SECONDARY
Tue Jul 29 14:46:05.010 [rsMgr] replSet closing client sockets after relinquishing primary
Tue Jul 29 14:46:05.229 [rsBackgroundSync] replSet syncing to: 192.168.1.3:27017
Tue Jul 29 14:46:07.023 [rsHealthPoll] replset info 192.168.1.3:27017 heartbeat failed, retrying
Tue Jul 29 14:46:07.429 [rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
Tue Jul 29 14:46:07.435 [rsBackgroundSync] replSet rollback 0
Tue Jul 29 14:46:07.435 [rsBackgroundSync] replSet ROLLBACK
Tue Jul 29 14:46:07.435 [rsBackgroundSync] replSet rollback 1
Tue Jul 29 14:46:07.435 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
Tue Jul 29 14:46:07.452 [rsBackgroundSync] replSet info rollback our last optime:   Jul 29 14:46:03:1
Tue Jul 29 14:46:07.452 [rsBackgroundSync] replSet info rollback their last optime: Jul 29 14:45:16:5fd9
Tue Jul 29 14:46:07.452 [rsBackgroundSync] replSet info rollback diff in end of log times: 47 seconds
Tue Jul 29 14:46:07.452 [rsBackgroundSync] replSet error rollback : can't rollback drop database full resync will be required
Tue Jul 29 14:46:07.452 [rsBackgroundSync] replSet { dropDatabase: 1.0 }
Tue Jul 29 14:46:07.462 [rsBackgroundSync] replSet error fatal, stopping replication
1 回复
ccj
#1 ccj • 2014-07-29 22:05

这都让你碰上了,一般很少删库吧。

需要 登录 后方可回复, 如果你还没有账号你可以 注册 一个帐号。