![MongoDB 頻繁切換主節點](https://rvso.com/image/658725/MongoDB%20%E9%A0%BB%E7%B9%81%E5%88%87%E6%8F%9B%E4%B8%BB%E7%AF%80%E9%BB%9E.png)
我們正在運行一個包含 3 個成員的 Mongo 2.6 副本集:主要成員、輔助成員、仲裁者。幾乎每天我們的 MongoDB 都會切換主伺服器,這會導致與該資料庫的所有連線中斷。如果這樣做是完全沒問題的,因為其中一台伺服器確實宕機了,但挑戰在於,在每種情況下,「宕機」的伺服器似乎並沒有真正宕機。一直都是這樣。
這是我們所知道的:
- 所有 3 台伺服器上的進程
mongod
都沒有重新啟動或停止。 - 伺服器一直在向 New Relic 報告。
- 從 mongo 日誌中我們看到頻繁的心跳失敗。
- 伺服器在任何時候都沒有真正承受很高的負載。我每小時大約 10 分鐘都會看到 CPU 峰值,但這與故障並不一致。
以下是show log rs
while shelld 到目前primary 的結果。
2015-05-17T15:05:49.339+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017
2015-05-17T15:05:49.358+0000 [rsBackgroundSync] replSet syncing to: server1:27017
2015-05-17T15:05:56.444+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017
2015-05-17T22:11:36.638+0000 [rsHealthPoll] replSet info server1:27017 is down (or slow to respond):
2015-05-17T22:11:36.644+0000 [rsHealthPoll] replSet member server1:27017 is now in state DOWN
2015-05-17T22:11:37.495+0000 [rsMgr] not electing self, we are not freshest
2015-05-17T22:11:38.656+0000 [rsHealthPoll] replSet member server1:27017 is up
2015-05-17T22:11:38.656+0000 [rsHealthPoll] replSet member server1:27017 is now in state PRIMARY
2015-05-17T22:11:39.140+0000 [rsBackgroundSync] replSet syncing to: server1:27017
2015-05-17T22:11:39.147+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017
2015-05-17T23:05:47.431+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017
2015-05-17T23:05:47.431+0000 [rsBackgroundSync] replSet syncing to: server1:27017
2015-05-17T23:05:47.876+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017
2015-05-18T10:05:46.821+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017
2015-05-18T10:05:46.822+0000 [rsBackgroundSync] replSet syncing to: server1:27017
2015-05-18T10:05:51.014+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017
2015-05-18T22:12:11.433+0000 [rsHealthPoll] replSet info server1:27017 is down (or slow to respond):
2015-05-18T22:12:11.434+0000 [rsHealthPoll] replSet member server1:27017 is now in state DOWN
2015-05-18T22:12:11.507+0000 [rsMgr] replSet info electSelf 3
2015-05-18T22:12:14.708+0000 [rsMgr] replSet PRIMARY
2015-05-18T22:12:14.709+0000 [rsHealthPoll] replSet member server1:27017 is up
2015-05-18T22:12:14.709+0000 [rsHealthPoll] replSet member server1:27017 is now in state PRIMARY
2015-05-18T22:12:21.610+0000 [rsHealthPoll] replSet member server1:27017 is now in state ROLLBACK
2015-05-18T22:12:23.612+0000 [rsHealthPoll] replSet member server1:27017 is now in state SECONDARY
2015-05-19T22:13:13.004+0000 [rsHealthPoll] couldn't connect to server1:27017: couldn't connect to server server1:27017 (x.x.x.x), connection attempt failed
2015-05-19T22:13:24.127+0000 [rsHealthPoll] couldn't connect to server1:27017: couldn't connect to server server1:27017 (x.x.x.x) failed, connection attempt failed
2015-05-19T22:13:29.267+0000 [rsHealthPoll] replset info server1:27017 just heartbeated us, but our heartbeat failed: , not changing state
2015-05-20T22:14:35.832+0000 [rsHealthPoll] replset info server1:27017 just heartbeated us, but our heartbeat failed: , not changing state
您可以看到我們經常收到心跳失敗和停機通知,但在每種情況下,伺服器每次都會在幾秒鐘內從停機狀態恢復到備份狀態。我不太確定下一步該從哪裡開始尋找,試圖找出可能導致問題的原因。
答案1
我經常看到這種情況,但它總是在mongod
流程之外。 DNS 解析器問題、TCP/IP 堆疊問題、網路連結、實體硬體等mongod
。檢查主機作業系統上的網路錯誤,檢查實體連結(如果實體連結在等式中),如果跨區域,請檢查兩台伺服器之間的雲端提供者。這很可能是主機作業系統的問題,與 MongoDB 本身無關。
答案2
這個問題已經解決了。核心問題是我們的託管提供者正在運行 VMWare 快照作為備份機制。這些快照導致虛擬機器暫時進入停滯期,我相信技術術語是虛擬機器停頓。
一旦停用這些快照,我們就不再遇到任何問題。