etcd集群故障处理
文章目录
[隐藏]
- 1. etcd安装
- 2. etcd配置
- 3. 故障报错
- 4. etcd恢复数据
- etcd节点加入集群
- 5. etcd常用命令
1. etcd安装
rpm -ivh etcd-3.2.15-1.el7.x86_64.rpm systemctl daemon-reload systemctl enable etcd systemctl start etcd export ETCDCTL_API=3 systemctl status etcd
hosts如下
192.168.0.100 etcd01 192.168.0.101 etcd02 192.168.0.102 etcd03
2. etcd配置
etcd02配置如下,详细见kubernetes1.9版本集群配置向导
# egrep -v "^$|^#" /etc/etcd/etcd.conf ETCD_DATA_DIR="/var/lib/etcd/" ETCD_LISTEN_PEER_URLS="https://192.168.0.101:2380" ETCD_LISTEN_CLIENT_URLS="https://192.168.0.101:2379,http://127.0.0.1:2379" ETCD_NAME="etcd02" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.0.101:2380" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.0.101:2379" ETCD_INITIAL_CLUSTER="etcd01=https://192.168.0.100:2380,etcd02=https://192.168.0.101:2380,etcd03=https://192.168.0.102:2380" ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster" ETCD_INITIAL_CLUSTER_STATE="existing" ETCD_CERT_FILE="/etc/kubernetes/ssl/etcd.pem" ETCD_KEY_FILE="/etc/kubernetes/ssl/etcd-key.pem" ETCD_CLIENT_CERT_AUTH="true" ETCD_TRUSTED_CA_FILE="/etc/kubernetes/ssl/ca.pem" ETCD_AUTO_TLS="true" ETCD_PEER_CERT_FILE="/etc/kubernetes/ssl/etcd.pem" ETCD_PEER_KEY_FILE="/etc/kubernetes/ssl/etcd-key.pem" ETCD_PEER_CLIENT_CERT_AUTH="true" ETCD_PEER_TRUSTED_CA_FILE="/etc/kubernetes/ssl/ca.pem" ETCD_PEER_AUTO_TLS="true"
3. 故障报错
3个节点做集群,直接关机后,etcd02故障,报错:
etcd: advertise client URLs = https://192.168.0.101:2379 etcd: read wal error (wal: crc mismatch) and cannot be repaired systemd: etcd.service: main process exited, code=exited, status=1/FAILURE
wal的cec校验出错,谷歌了一下,没什么结果,于是移除这个etcd,再恢复
在正常的etcd节点移除
# etcdctl member list 1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379 9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379 ac2f188e97f50eb7, started, etcd02, https://192.168.0.101:2380, https://192.168.0.101:2379 # etcdctl member remove ac2f188e97f50eb7 Member ac2f188e97f50eb7 removed from cluster 194cd14a48430083
再启动etcd服务
# systemctl start etcd
报错:
etcd: error validating peerURLs {ClusterID:194cd14a48430083 Members:[&{ID:1ce6d6d01109192 RaftAttributes:{PeerURLs:[https://192.168.0.102:2380]} Attributes:{Name:etcd03 ClientURLs:[https://192.168.0.102:2379]}} &{ID:9b534175b46ea789 RaftAttributes:{PeerURLs:[https://192.168.0.100:2380]} Attributes:{Name:etcd01 ClientURLs:[https://192.168.0.100:2379]}}] RemovedMemberIDs:[]}: member count is unequal
报错:
etcd: the member has been permanently removed from the cluster the data-dir used by this member must be removed
4. etcd恢复数据
在etcd02节点恢复一下数据试试:
# mv /var/lib/etcd/member /var/lib/member # rm -rf /var/lib/etcd/* # etcdctl snapshot restore /var/lib/member/snap/db --skip-hash-check=true 2018-06-22 11:28:35.622666 I | mvcc: restore compact to 10177401 2018-06-22 11:28:35.659626 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 # systemctl start etcd
服务启动了,自己把自己选做主,服务倒是启动了,加入集群还是出错,用正常的节点备份再恢复
# etcdctl snapshot save etcdback.db
# etcdctl member add etcd02 http://192.168.0.101:2380 Error: member name not provided.
看看现在集群的其他2个etcd
# curl -k --key /etc/kubernetes/ssl/etcd-key.pem --cert /etc/kubernetes/ssl/etcd.pem https://192.168.0.100:2380/members [{"id":130161754177048978,"peerURLs":["https://192.168.0.102:2380"],"name":"etcd03","clientURLs":["https://192.168.0.102:2379"]},{"id":11192361472739944329,"peerURLs":["https://192.168.0.100:2380"],"name":"etcd01","clientURLs":["https://192.168.0.100:2379"]}]
参考文档:
etcdctl member add etcd_name –peer-urls=”https://peerURLs”
再次添加
# etcdctl member add etcd02 --peer-urls="https://192.168.0.101:2380" Member 41c2a7b938a5e387 added to cluster 194cd14a48430083 ETCD_NAME="etcd02" ETCD_INITIAL_CLUSTER="etcd03=https://192.168.0.102:2380,etcd02=https://192.168.0.101:2380,etcd01=https://192.168.0.100:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
查看etcd member状态:
# etcdctl member list 1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379 9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379 ad17c3da831c84c7, unstarted, , https://192.168.0.101:2380,
报错:
etcd: request cluster ID mismatch (got 194cd14a48430083 want cdf818194e3a8c32)
发现步骤顺序错误,应该是先添加到etcd集群,再启动etcd服务,我们现在先启动etcd服务,就是一个etcd单点
etcd节点加入集群
故障的etcd主机
# systemctl stop etcd
正常的etcd主机:
# etcdctl member remove ad17c3da831c84c7
# etcdctl member add etcd02 --peer-urls="https://192.168.0.101:2380" Member 41c2a7b938a5e387 added to cluster 194cd14a48430083 ETCD_NAME="etcd02" ETCD_INITIAL_CLUSTER="etcd03=https://192.168.0.102:2380,etcd02=https://192.168.0.101:2380,etcd01=https://192.168.0.100:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
故障的etcd主机,查看现在etcd状态:
# etcdctl endpoint health 127.0.0.1:2379 is healthy: successfully committed proposal: took = 24.52485ms
# etcdctl member list 1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379 41c2a7b938a5e387, started, etcd02, https://192.168.0.101:2380, https://192.168.0.101:2379 9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379
到这里,etcd故障修复完毕
5. etcd常用命令
查看状态
# export ETCDCTL_API=3 # etcdctl endpoint status --write-out=table +----------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +----------------+------------------+---------+---------+-----------+-----------+------------+ | 127.0.0.1:2379 | 41c2a7b938a5e387 | 3.2.15 | 15 MB | true | 317 | 11051403 | +----------------+------------------+---------+---------+-----------+-----------+------------+
备份及恢复
etcdctl snapshot save etcdback.db etcdctl snapshot status etcdback.db --write-out=table etcdctl snapshot restore etcdback.db --skip-hash-check=true
etcd监控
# curl -L http://localhost:2379/metrics # HELP etcd_debugging_mvcc_keys_total Total number of keys. # TYPE etcd_debugging_mvcc_keys_total gauge etcd_debugging_mvcc_keys_total 776 # HELP etcd_debugging_mvcc_pending_events_total Total number of pending events to be sent. # TYPE etcd_debugging_mvcc_pending_events_total gauge etcd_debugging_mvcc_pending_events_total 0 # HELP etcd_debugging_mvcc_put_total Total number of puts seen by this member. # TYPE etcd_debugging_mvcc_put_total counter etcd_debugging_mvcc_put_total 9.548201e+06 # HELP etcd_debugging_mvcc_range_total Total number of ranges seen by this member. # TYPE etcd_debugging_mvcc_range_total counter etcd_debugging_mvcc_range_total 2.1052143e+07 # HELP etcd_debugging_mvcc_slow_watcher_total Total number of unsynced slow watchers. # TYPE etcd_debugging_mvcc_slow_watcher_total gauge etcd_debugging_mvcc_slow_watcher_total 0 # HELP etcd_debugging_mvcc_txn_total Total number of txns seen by this member. # TYPE etcd_debugging_mvcc_txn_total counter etcd_debugging_mvcc_txn_total 0 # HELP etcd_debugging_mvcc_watch_stream_total Total number of watch streams. # TYPE etcd_debugging_mvcc_watch_stream_total gauge etcd_debugging_mvcc_watch_stream_total 125 # HELP etcd_debugging_mvcc_watcher_total Total number of watchers. # TYPE etcd_debugging_mvcc_watcher_total gauge etcd_debugging_mvcc_watcher_total 125 # HELP etcd_debugging_server_lease_expired_total The total number of expired leases. # TYPE etcd_debugging_server_lease_expired_total counter etcd_debugging_server_lease_expired_total 3649
适合用prometheus监控
global: scrape_interval: 10s scrape_configs: - job_name: etcd static_configs: - targets: ['192.168.0.100:2379','192.168.0.101:2379','192.168.0.102:2379']
图解raft算法 http://thesecretlivesofdata.com/raft/
etcd获取kubernetes的数据
# export ETCDCTL_API=3 # etcdctl get /registry/namespaces/default --prefix -w json|python -m json.tool { "count": 1, "header": { "cluster_id": 1823062066148343939, "member_id": 11192361472739944329, "raft_term": 317, "revision": 10880816 }, "kvs": [ { "create_revision": 6, "key": "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA==", "mod_revision": 6, "value": "azhzAAoPCgJ2MRIJTmFtZXNwYWNlEl8KRQoHZGVmYXVsdBIAGgAiACokOTVlNzdjMWEtM2Q1Ny0xMWU4LTk5YzItMDA1MDU2YmU3NWEzMgA4AEIICK7qttYFEAB6ABIMCgprdWJlcm5ldGVzGggKBkFjdGl2ZRoAIgA=", "version": 1 } ] }
查看key的内容 # echo L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA== |base64 -d /registry/namespaces/default
#!/bin/bash # Get kubernetes keys from etcd export ETCDCTL_API=3 keys=`etcdctl get /registry --prefix -w json|python -m json.tool|grep key|cut -d ":" -f2|tr -d '"'|tr -d ","` for x in $keys;do echo $x|base64 -d|sort done
获取etcd中kubernetes所有对象的key
原文出处:bbotte -> http://bbotte.com/server-config/etcd-cluster-troubleshooting/
本站所发布的一切资源仅限用于学习和研究目的;不得将上述内容用于商业或者非法用途,否则,一切后果请用户自负。本站信息来自网络,版权争议与本站无关。您必须在下载后的24个小时之内,从您的电脑中彻底删除上述内容。如果您喜欢该程序,请支持正版软件,购买注册,得到更好的正版服务。如果侵犯你的利益,请发送邮箱到 [email protected],我们会很快的为您处理。