2019独角兽企业重金招聘Python工程师标准>>>
概述
这里只记录操作步骤和集群测试,保证快速搭建集群环境。具体原理请查阅官方文档(中文版):
http://www.redis.cn/topics/cluster-spec.html
集群容灾:Redis集群(二):集群容灾
准备工作
-
下载并编译redis
-
创建6个以端口号为名字的目录,7000~7005,修改redis.conf的端口号
-
修改配置:
#节点必须作为集群节点启动,普通的Redis实例不能成为Redis集群的一部分。cluster-enabled yes
#每个集群节点都有一个集群配置文件,这个文件不需要手工编辑,它由Redis节点创建和更新。每个Redis集群节点都需要一个不同的集群配置文件,确保统一系统中运行的实例没有重复的集群配置文件名。cluster-config-file nodes-6379.conf
#集群节点超时是指当一个节点超时达到指定的毫秒数时让它处于失败状态。大多数其他内部时间限制是节点超时的倍数。cluster-node-timeout 15000
#以独立日志的方式记录每次写命令,重启时再重新执行AOF文件中的命令达到恢复数据的目的,一定要开启
appendonly yes -
分别启动这6个服务:src/redis-server redis.conf
安装Ruby环境(5.0版本不再需要)
sudo apt-get install ruby
安装ruby的redis接口:
sudo gem install redis
如果不安装,后面执行脚本会报错:
custom_require.rb:36:in `require': cannot load such file -- redis (LoadError)
from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
from ./redis-trib.rb:25:in `<main>'
启动集群
进入到redis的src目录下,执行命令:
5.0版本:
redis要求集群至少有3个节点,这里为了方便,只启动了3个节点,没有为每个节点做一个从节点,所以--cluster-replicas为0
./redis-cli --cluster create 127.0.0.1:7000 127.0.0.1:7001 127.0.0.1:7002 --cluster-replicas 0
命令返回信息:
>>> Performing hash slots allocation on 3 nodes...
Master[0] -> Slots 0 - 5460
Master[1] -> Slots 5461 - 10922
Master[2] -> Slots 10923 - 16383
M: 3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000slots:[0-5460] (5461 slots) master
M: 2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001slots:[5461-10922] (5462 slots) master
M: 9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002slots:[10923-16383] (5461 slots) master
Can I set the above configuration? (type 'yes' to accept):输入yes接受redis的上述配置,然后返回信息:
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join
.
>>> Performing Cluster Check (using node 127.0.0.1:7000)
M: 3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000slots:[0-5460] (5461 slots) master
M: 9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002slots:[10923-16383] (5461 slots) master
M: 2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001slots:[5461-10922] (5462 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
5.0之前的版本:
./redis-trib.rb create --replicas 1 127.0.0.1:7000 127.0.0.1:7001 127.0.0.1:7002 127.0.0.1:7003 127.0.0.1:7004 127.0.0.1:7005
命令的意义如下:
-
给定
redis-trib.rb
程序的命令是create
, 这表示我们希望创建一个新的集群。 -
选项
--replicas 1
表示我们希望为集群中的每个主节点创建一个从节点。 -
之后跟着的其他参数则是实例的地址列表, 我们希望程序使用这些地址所指示的实例来创建新集群。
查看集群目前状况
#-c:连接集群结点时使用,此选项可防止moved和ask异常。
$ redis-cli -c -p 7000
127.0.0.1:7000> cluster info
返回信息:
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:3
cluster_size:3
cluster_current_epoch:3
cluster_my_epoch:1
cluster_stats_messages_ping_sent:774
cluster_stats_messages_pong_sent:762
cluster_stats_messages_sent:1536
cluster_stats_messages_ping_received:760
cluster_stats_messages_pong_received:774
cluster_stats_messages_meet_received:2
cluster_stats_messages_received:1536
查看节点基本信息:
127.0.0.1:7002> cluster nodes
返回信息:
节点id 节点IP 端口 主从 负责的Hash槽
2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001@17001 master - 0 1542785512690 2 connected 5461-10922
3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000@17000 master - 0 1542785511654 1 connected 0-5460
9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002@17002 myself,master - 0 1542785510000 3 connected 10923-16383
测试存值取值
注意:使用redic-cli命令连接时,必须加上-c参数,表示以集群的方式连接,这样在操作数据时会自动跳转到数据所在的节点。否则会收到类似提示:(error) MOVED 12182 127.0.0.1:7002。
127.0.0.1:7000> set foo bar
-> Redirected to slot [12182] located at 127.0.0.1:7002
OK127.0.0.1:7002> set hello world
-> Redirected to slot [866] located at 127.0.0.1:7000
OK127.0.0.1:7000> get foo
-> Redirected to slot [12182] located at 127.0.0.1:7002
"bar"127.0.0.1:7000> get hello
-> Redirected to slot [866] located at 127.0.0.1:7000
"world"
添加Master节点到集群
-
按照Redis集群一的方式,创建端口为7003的新实例,并启动该实例
-
将7003添加到集群:
第二个参数127.0.0.1:7000为当前集群已存在的节点,这里只要是该集群中的任意一个可用节点都可以,不要求必须是第一个。
新节点不能有数据,否则会报错:[ERR] Node 127.0.0.1:7003 is not empty. Either the node already knows other nodes (check with CLUSTER NODES) or contains some key in database 0。一般是直接复制正在使用的redis目录导致的,使用redis-cli连接该服务,然后依次执行flushall和cluster reset命令。
./redis-cli --cluster add-node 127.0.0.1:7003 127.0.0.1:7000
返回信息:
>>> Adding node 127.0.0.1:7003 to cluster 127.0.0.1:7000
>>> Performing Cluster Check (using node 127.0.0.1:7000)
M: 3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000slots:[0-5460] (5461 slots) master
M: 9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002slots:[10923-16383] (5461 slots) master
M: 2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001slots:[5461-10922] (5462 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 127.0.0.1:7003 to make it join the cluster.
[OK] New node added correctly.
查看操作后的节点信息:
./redis-cli -c -p 7003
127.0.0.1:7003> cluster nodes
返回信息:
2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001@17001 master - 0 1542787865456 2 connected 5461-10922
a70d7fff6d6dde511cb7cb632a347be82dd34643 127.0.0.1:7003@17003 myself,slave 3bcdfbed858bbdd92dd760632b9cb4c649947fed 0 1542787863000 0 connected
9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002@17002 master - 0 1542787865000 3 connected 10923-16383
3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000@17000 master - 0 1542787862000 1 connected 0-5460
可以看到7003节点的connected后面没有Hash槽(slot),新加入的加点是一个主节点, 当集群需要将某个从节点升级为新的主节点时, 这个新节点不会被选中,也不会参与选举。
-
给新节点分配哈希槽:
#参数127.0.0.1:7000只是表示连接到这个集群,具体对哪个节点进行操作后面会提示输入
./redis-cli --cluster reshard 127.0.0.1:7000
返回信息:
>>> Performing Cluster Check (using node 127.0.0.1:7000)
M: 3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000slots:[0-5460] (5461 slots) master1 additional replica(s)
M: 9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002slots:[10923-16383] (5461 slots) master
S: a70d7fff6d6dde511cb7cb632a347be82dd34643 127.0.0.1:7003slots: (0 slots) slavereplicates 3bcdfbed858bbdd92dd760632b9cb4c649947fed
M: 2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001slots:[5461-10922] (5462 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.#根据提示选择要迁移的slot数量(这里选择1000)
How many slots do you want to move (from 1 to 16384)? 1000#选择要接受这些slot的node-id(这里是7003)
What is the receiving node ID? a70d7fff6d6dde511cb7cb632a347be82dd34643#选择slot来源:
#all表示从所有的master重新分配,
#或者数据要提取slot的master节点id(这里是7000),最后用done结束
Please enter all the source node IDs.Type 'all' to use all the nodes as source nodes for the hash slots.Type 'done' once you entered all the source nodes IDs.
Source node #1:3bcdfbed858bbdd92dd760632b9cb4c649947fed
Source node #2:done#打印被移动的slot后,输入yes开始移动slot以及对应的数据.
Do you want to proceed with the proposed reshard plan (yes/no)? yes
#结束
-
查看操作结果:
./redis-cli -c -p 7000
cluster nodes
返回信息:
9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002@17002 master - 0 1542790503483 3 connected 10923-16383
3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000@17000 myself,master - 0 1542790503000 1 connected 1000-5460
e852e07181f20dd960407e5b08f7122870f67c89 127.0.0.1:7003@17003 master - 0 1542790502458 4 connected 0-999
2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001@17001 master - 0 1542790504513 2 connected 5461-10922
可以看到返回的集群信息中,7003拥有了0-999哈希槽,而7000变成了1000-5460
添加Slave节点到集群
-
按照Redis集群一的方式,创建端口为7004的新实例,并启动该实例
-
将7004添加到集群:
由于没有指定master节点,所以redis会自动分配master节点,这里把7000作为7004的master。
注意:add-node命令后面的127.0.0.1:7000并不是指7000作为新节点的master。
./redis-cli --cluster add-node 127.0.0.1:7004 127.0.0.1:7000 --cluster-slave
返回信息:
>>> Adding node 127.0.0.1:7004 to cluster 127.0.0.1:7000
>>> Performing Cluster Check (using node 127.0.0.1:7000)
M: 3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000slots:[1000-5460] (4461 slots) master
M: 9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002slots:[10923-16383] (5461 slots) master
M: e852e07181f20dd960407e5b08f7122870f67c89 127.0.0.1:7003slots:[0-999] (1000 slots) master
M: 2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001slots:[5461-10922] (5462 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
Automatically selected master 127.0.0.1:7000
>>> Send CLUSTER MEET to node 127.0.0.1:7004 to make it join the cluster.
Waiting for the cluster to join>>> Configure node as replica of 127.0.0.1:7000.
[OK] New node added correctly.
-
也可以添加时指定master节点:
--cluster-master-id为master节点的 id
./redis-cli --cluster add-node 127.0.0.1:7004 127.0.0.1:7000 --cluster-slave --cluster-master-id 2a8f29e22ec38f56e062f588e5941da24a2bafa0
-
更改master节点为7002:
./redis-cli -p 7004
127.0.0.1:7004> cluster replicate 9b022d79cf860c87dc2190cdffc55b282dd60e42
OK
删除一个Slave节点
#redis-trib del-node ip:port '<node-id>'
#这里移除的是7004
./redis-cli --cluster del-node 127.0.0.1:7000 74957282ffa94c828925c4f7026baac04a67e291
返回信息:
>>> Removing node 74957282ffa94c828925c4f7026baac04a67e291 from cluster 127.0.0.1:7000
>>> Sending CLUSTER FORGET messages to the cluster...
>>> SHUTDOWN the node.
删除一个Master节点
删除master节点之前首先要使用reshard移除master的全部slot,然后再删除当前节点(目前只能把被删除master的slot迁移到一个节点上)
./redis-cli --cluster reshard 127.0.0.1:7000
#根据提示选择要迁移的slot数量(7003上有1000个slot全部转移)
How many slots do you want to move (from 1 to 16384)? 1000
#选择要接受这些slot的node-id
What is the receiving node ID? 3bcdfbed858bbdd92dd760632b9cb4c649947fed
#选择slot来源:
#all表示从所有的master重新分配,
#或者数据要提取slot的master节点id,最后用done结束
Please enter all the source node IDs.Type 'all' to use all the nodes as source nodes for the hash slots.Type 'done' once you entered all the source nodes IDs.
Source node #1:a70d7fff6d6dde511cb7cb632a347be82dd34643
Source node #2:done
#打印被移动的slot后,输入yes开始移动slot以及对应的数据.
#Do you want to proceed with the proposed reshard plan (yes/no)? yes
#结束#删除空master节点
./redis-cli --cluster del-node 127.0.0.1:7000 'a70d7fff6d6dde511cb7cb632a347be82dd34643'
故障测试
启动一个集群,其中7004节点是7003节点的从节点:
127.0.0.1:7003> cluster nodes
ea4e0dcf8dbf6d4611659b5abbd6563926224f0f 127.0.0.1:7004@17004 slave e852e07181f20dd960407e5b08f7122870f67c89 0 1542793126295 4 connected
3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000@17000 master - 0 1542793125260 1 connected 1000-5460
2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001@17001 master - 0 1542793124000 2 connected 5461-10922
e852e07181f20dd960407e5b08f7122870f67c89 127.0.0.1:7003@17003 myself,master - 0 1542793124000 4 connected 0-999
9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002@17002 master - 0 1542793126000 3 connected 10923-16383
在集群中添加了key:"hello",该key被放到了7003节点
127.0.0.1:7002> set hello world
-> Redirected to slot [866] located at 127.0.0.1:7003
OK
让7003节点崩溃:
./redis-cli -p 7003 debug segfault
Error: Server closed the connection
查看节点状态,发现7003状态为fail,7004被提升为master
127.0.0.1:7000> cluster nodes
9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002@17002 master - 0 1542793571000 3 connected 10923-16383
3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000@17000 myself,master - 0 1542793570000 1 connected 1000-5460
2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001@17001 master - 0 1542793571422 2 connected 5461-10922
ea4e0dcf8dbf6d4611659b5abbd6563926224f0f 127.0.0.1:7004@17004 master - 0 1542793572442 5 connected 0-999
e852e07181f20dd960407e5b08f7122870f67c89 127.0.0.1:7003@17003 master,fail - 1542793477237 1542793474000 4 disconnected
从集群获取"hello",被转发到7004
127.0.0.1:7000> get hello
-> Redirected to slot [866] located at 127.0.0.1:7004
"world"
重新启动7003,发现7003自动加入集群,并变成了slave
127.0.0.1:7004> cluster nodes
ea4e0dcf8dbf6d4611659b5abbd6563926224f0f 127.0.0.1:7004@17004 myself,master - 0 1542793764000 5 connected 0-999
3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000@17000 master - 0 1542793765000 1 connected 1000-5460
2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001@17001 master - 0 1542793765560 2 connected 5461-10922
9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002@17002 master - 0 1542793764529 3 connected 10923-16383
e852e07181f20dd960407e5b08f7122870f67c89 127.0.0.1:7003@17003 slave ea4e0dcf8dbf6d4611659b5abbd6563926224f0f 0 1542793766585 5 connected
获取"hello",被转发到master节点7004
127.0.0.1:7002> get hello
-> Redirected to slot [866] located at 127.0.0.1:7004
"world"
将7003和7004都下线,然后再获取"hello",报错提示集群已下线
127.0.0.1:7000> cluster nodes
9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002@17002 master - 0 1542794095233 3 connected 10923-16383
3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000@17000 myself,master - 0 1542794094000 1 connected 1000-5460
2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001@17001 master - 0 1542794096261 2 connected 5461-10922
ea4e0dcf8dbf6d4611659b5abbd6563926224f0f 127.0.0.1:7004@17004 master,fail - 1542794075628 1542794074000 5 disconnected 0-999
e852e07181f20dd960407e5b08f7122870f67c89 127.0.0.1:7003@17003 slave,fail ea4e0dcf8dbf6d4611659b5abbd6563926224f0f 1542794070058 1542794067000 5 disconnected127.0.0.1:7000> get hello
(error) CLUSTERDOWN The cluster is down
启动7003,报错,原因是7003下线时是7004的从节点,启动后默认去主节点同步数据。
Connecting to MASTER 127.0.0.1:7004
MASTER <-> REPLICA sync started
Error condition on socket for SYNC: Connection refused
启动7004,获取"hello",恢复正常
127.0.0.1:7002> get hello
-> Redirected to slot [866] located at 127.0.0.1:7004
"world"
集群故障修复
现在有7000~7003共4个节点的集群,其中0~999slot在7003上,"set hello world"并被分配到7003。
127.0.0.1:7000> cluster nodes
e852e07181f20dd960407e5b08f7122870f67c89 127.0.0.1:7003@17003 master - 0 1542852396911 6 connected 0-999
9b022d79cf860c87dc2190cdffc55b282dd60e42 127.0.0.1:7002@17002 master - 0 1542852395887 3 connected 10923-16383
2a8f29e22ec38f56e062f588e5941da24a2bafa0 127.0.0.1:7001@17001 master - 0 1542852394863 2 connected 5461-10922
3bcdfbed858bbdd92dd760632b9cb4c649947fed 127.0.0.1:7000@17000 myself,master - 0 1542852395000 1 connected 1000-5460
关闭7003之后,在集群做任何存取值操作都会报错:(error) CLUSTERDOWN The cluster is down。原因是0~999 sloat在7003上,Redis认为slot不完整,所以报错。
127.0.0.1:7000> get hello
(error) CLUSTERDOWN The cluster is down
127.0.0.1:7000> get foo
(error) CLUSTERDOWN The cluster is down
执行fix命令,提示无法连接,因为7003已经被关闭了[Facepalm]
./redis-cli --cluster fix 127.0.0.1:7003
Could not connect to Redis at 127.0.0.1:7003: Connection refused
重新开启7003恢复正常。