Patroni 互連故障轉移

2024-6-28 • tag-icon

3 個資料中心：

帕特羅尼版本：2.1.4

PostgreSQL 版本：14.4

etcd 版本：3.3.11

直流	伺服器	姓名	主持人	地位
第一名	派特羅尼	帕托尼-S11	172.16.0.2	領導者
第一名	派特羅尼	帕托尼-S12	172.16.0.3	同步待機
第一名	ETCD	etcd-s11	172.16.0.4	領導者
第二名	派特羅尼	帕托尼-S21	172.16.1.2	複製品
第二名	派特羅尼	帕托尼-S22	172.16.1.3	複製品
第二名	ETCD	etcd-s21	172.16.1.4	奴隸
第三名	派特羅尼	帕托尼-S31	172.16.2.2	複製品
第三名	ETCD	etcd-s31	172.16.2.4	奴隸

我模擬了第一個資料中心和第二個資料中心之間的互連故障，兩個 DC 都已啟動，但第一個和第二個資料中心沒有「看到」彼此。

在這種情況下，Patroni 領導者仍然留在第一個 DC。但第二個 DC 中的伺服器不與叢集同步。如果相信叢集健康，一切都很好，伺服器之間沒有複製延遲。實際上，主伺服器上的所有變更都不會與第二個資料中心上的副本同步。

[user@patroni-s11 ~]$ sudo patronictl -c /etc/patroni/patroni.yml list
2022-12-01 16:00:00,015 - ERROR - Request to server 172.16.1.4:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.1.4', port=2379): Max retries exceeded with url: /v2/keys/service/patroni_cluster/?recursive=true (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))",)
+ Cluster: patroni_cluster (7117639577766255236) ---+---------+-----+-----------+
| Member          | Host          | Role         | State   |  TL | Lag in MB |
+-----------------+---------------+--------------+---------+-----+-----------+
| patroni-s11     | 172.16.0.2    | Leader       | running | 103 |           |
| patroni-s12     | 172.16.0.3    | Sync Standby | running | 103 |         0 |
| patroni-s21     | 172.16.1.2    | Replica      | running | 103 |         0 |
| patroni-s22     | 172.16.1.3    | Replica      | running | 103 |         0 |
| patroni-s31     | 172.16.2.2    | Replica      | running | 103 |         0 |
+-----------------+---------------+--------------+---------+-----+-----------+

Etcd 伺服器仍然會發生這種情況，領導者仍然位於第 1 個 DC。

[user@etcd-s11 ~]$ sudo etcdctl cluster-health
failed to check the health of member a85c06b926e6c6c8 on 172.16.1.4:2379: Get 172.16.1.4:2379/health: read tcp 10.220.0.3:38836->172.16.1.4:2379: read: connection reset by peer
member 261f8081db14d568 is healthy: got healthy result from 172.16.0.4:2379
member a85c06b926e6c6c8 is unreachable: [172.16.1.4: 2379] are all unreachable
member b87bd1df518cc9e4 is healthy: got healthy result from 172.16.2.4:2379
cluster is degraded

[user@etcd-s11 ~]$ sudo etcdctl member list
261f8081db14d568: name=etcd-s11 peerURLs=172.16.0.4:2380 clientURLs=172.16.0.4:2379 isLeader=true
a85c06b926e6c6c8: name=etcd-s21 peerURLs=172.16.1.4:2380 clientURLs=172.16.1.4:2379 isLeader=false
b87bd1df518cc9e4: name=etcd-s31 peerURLs=172.16.2.4:2380 clientURLs=172.16.2.4: 2379 isLeader=false

但是第三個資料中心的 Etcd 看到叢集是健康的

[user@etcd-s31 ~]$ sudo etcdctl cluster-health
member 261f8081db14d568 is healthy: got healthy result from http:// 172.16.0.4: 2379
member a85c06b926e6c6c8 is healthy: got healthy result from http:// 172.16.1.4: 2379
member b87bd1df518cc9e4 is healthy: got healthy result from http:// 172.16.2.4: 2379
cluster is healthy

我預計，領導者將成為第三個DC的伺服器。

Patroni\etcd 在這種情況下可以更換領導者嗎？

答案1

首先，qourm 是 5/2，升級後，如果您執行 site1 + site 3 並且您看到的行為是預期的，則將滿足 3 個伺服器的要求

如果站點 1 + 站點 3 不滿足 qourm，則將是 diff seinario

答案1

相關內容