官方文档:
https://github.com/FederatedAI/AnsibleFATE/blob/main/docs/ansible_deploy_FATE_manual.md
https://github.com/FederatedAI/AnsibleFATE/blob/main/docs/ansible_deploy_two_sides.md
gitee详细文档:
docs/ansible_deploy_one_side.md · 亦一亦二/AnsibleFATE - Gitee.com
一、前置操作
1、主机映射
第一台:
hostnamectl set-hostname fate01
第二台:
hostnamectl set-hostname fate02
2、关闭selinux
确认是否已安装selinux
centos系统执行:rpm -qa | grep selinux
ubuntu系统执行:apt list --installed | grep selinux
如果已经安装selinux就执行:setenforce 0
3、修改linux系统参数
vi /etc/security/limits.conf
如果没有一下内容,则添加:
* soft nofile 65535
* hard nofile 65535
* soft nproc 65535
* hard nproc 65535
4、清理20-nproc.conf文件
cd /etc/security/limits.d
ls -lrt 20-nproc.conf如果存在该文件则:
mv 20-nproc.conf 20-nproc.conf_bak
5、修改系统mysql配置
mv /etc/my.cnf /etc/my.cnf_bak
6、关闭防火墙(可选)
systemctl disable firewalld.service
systemctl stop firewalld.service
systemctl status firewalld.service
## 如果是Ubuntu系统:
ufw disable
ufw status
7、创建用户
groupadd apps
useradd -s /bin/bash -g apps -d /home/app app
8、创建目录并设置sudo权限以及免密登录
# 创建用户
mkdir -pv /data/projects /data/temp /data/logs
chown -R app:apps /data/projects /data/temp /data/logs
# 设置sudo权限
echo "app ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
# 免密登录:
# 切换app用户免密登录:
su - app
ssh-keygen -t rsa
ssh-copy-id -i app@192.168.0.1
9、增加虚拟内存
cd /data
dd if=/dev/zero of=/data/swapfile128G bs=1024 count=134217728
mkswap /data/swapfile128G
swapon /data/swapfile128G
cat /proc/swaps
echo '/data/swapfile128G swap swap defaults 0 0' >> /etc/fstab
10、安装ansible
# yum源没有ansible,配置yum源:
mv /etc/yum.repos.d/epel.repo /etc/yum.repos.d/epel.repo.backup
mv /etc/yum.repos.d/epel-testing.repo /etc/yum.repos.d/epel-testing.repo.backup
wget -O /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo
yum clean all && yum makecache
yum install ansible# 修改ansible: /etc/ansible/ansible.cfg
remote_user = app
二、部署
本章是通过ansible 部署FATE集群单边场景之一:单独部署host和guest。
role | partyid | IP地址 | 操作系统 | 主机配置 | 存储 | 部署模块 |
host | 10000 | 192.168.0.1 (有外网) | CentOS 7.2/Ubuntu 18.04 | 8C16G | 500G | fate_flow,fateboard,clustermanager,nodemanger,rollsite,mysql |
guest | 9999 | 192.168.0.2 | CentOS 7.2/Ubuntu 18.04 | 8C16G | 500G | fate_flow,fateboard,clustermanager,nodemanger,rollsite,mysql |
主机资源和操作系统要求
类别 | 说明 |
主机配置 | 不低于8C16G500G,千兆网卡 |
操作系统 | CentOS linux 7.2及以上同时低于8/Ubuntu 18.04 |
依赖包 | 需要安装如下依赖包: |
用户 | 用户:app,属主:apps(app用户需可以sudo su root而无需密码) |
文件系统 | 1、数据盘挂载在/data目录下。 |
虚拟内存 | 不低于128G |
系统参数 | 1、文件句柄数不低于65535。 |
2 部署目标介绍
(1) Host端
Party Id: 10000
角色 | IP | 端口 | 介绍 |
rollsite | 192.168.0.1 | 9370 | 跨站点或者说跨party通讯组件 |
fate_flow | 192.168.0.1 | 9360;9380 | 联合学习任务流水线管理模块 |
clustermanager | 192.168.0.1 | 4670 | cluster manager管理集群 |
nodemanager | 192.168.0.1 | 4671 | node manager管理每台机器资源 |
fateboard | 192.168.0.1 | 8080 | 联合学习过程可视化模块 |
mysql | 192.168.0.1 | 3306 | 数据存储,clustermanager和fateflow依赖 |
(2) Guest端
Party Id: 9999
角色 | IP | 端口 | 介绍 |
rollsite | 192.168.0.2 | 9370 | 跨站点或者说跨party通讯组件 |
fate_flow | 192.168.0.2 | 9360;9380 | 联合学习任务流水线管理模块 |
clustermanager | 192.168.0.2 | 4670 | cluster manager管理集群 |
nodemanager | 192.168.0.2 | 4671 | node manager管理每台机器资源 |
fateboard | 192.168.0.2 | 8080 | 联合学习过程可视化模块 |
mysql | 192.168.0.2 | 3306 | 数据存储,clustermanager和fateflow依赖 |
3 下载离线安装包
wget https://webank-ai-1251170195.cos.ap-guangzhou.myqcloud.com/AnsibleFATE_1.7.2_release-offline.tar.gz
tar -zxvf AnsibleFATE_1.7.2_release-offline.tar.gz
cd AnsibleFATE_1.7.2_release-offline
4 配置(host)
4.1 初始化配置
- 步骤一:
# 使用辅助脚本产生初始化配置:
sh deploy/deploy.sh init -h="10000:192.168.0.1"
- 步骤二:按需修改配置
vim deploy/conf/setup.conf
#base setup
env: prod
pname: fate
ssh_port: 22
deploy_user: app
deploy_group: apps
#
#deploy mode: deploy|install|config|uninstall
deploy_mode: deploy
#
#moduel list: mysql|eggroll|fate_flow|fateboard
modules:- mysql- eggroll- fate_flow- fateboard
#
#role list: host|guest|exchange
roles:- host:10000
#
#ssl role list: host && guest | host&&exchange | guest&&exchange
ssl_roles: []
#
polling: {}
#host ip lists
#host_ips: []
host_ips:- default:192.168.0.1
#
#extra host rules
host_special_routes: - default:192.168.0.2:9370 ---guest IP,此处需要手工添加,可以设置额外路由指向exchange
#guest ip lists
#guest_ips: []
guest_ips: []
#
#extra guest rules
guest_special_routes: []
#
#exchange ip lists
exchange_ips: []
#
#extra exchange rules
exchange_special_routes: []
default_engines: eggroll
- 步骤3:执行辅助脚本产生配置
bash deploy/deploy.sh render
4.2 配置host信息
修改如下文件,默认可以不修改。
vi var_files/prod/fate_host
host:partyid: 10000rollsite:enable: truecoordinator: fateips:- 192.168.0.1port: 9370secure_port: 9371server_secure: falseclient_secure: falsepolling:enable: falseroute_tables:- id: defaultroutes:- name: defaultip: 192.168.0.2port: 9370is_secure: false- id: 10000routes:- name: defaultip: 192.168.0.1port: 9370is_secure: false- name: fateflowip: 192.168.0.1port: 9360clustermanager:enable: trueips:- 192.168.0.1port: 4670cores_per_node: 16nodemanager:enable: trueips:- 192.168.0.1port: 4671eggroll:dbname: eggroll_metaegg: 4fate_flow:enable: trueips:- 192.168.0.1grpcPort: 9360httpPort: 9380dbname: fate_flowproxy: rollsitehttp_app_key:http_secret_key:use_deserialize_safe_module: falsedefault_engines: eggrollfateboard:enable: trueips:- 192.168.168.104port: 8080dbname: fate_flowmysql:enable: truetype: insideips:- 192.168.0.1port: 3306dbuser: fatedbpasswd: fate_deV2999zk:enable: falselists:- ip: 127.0.0.1port: 2181use_acl: falseuser: fatepasswd: fateservings:ips:- 127.0.0.1port: 8000
4.3 执行部署
部署所有服务
bash deploy/deploy.sh deploy
查看部署日志:tailf logs/deploy-??.log
5 配置(guest)
5.1 初始化配置
- 步骤一:
# 使用辅助脚本产生初始化配置:
sh deploy/deploy.sh init -g="9999:192.168.1.1"
- 步骤二:按需修改配置
vim deploy/conf/setup.conf
#base setup
env: prod
pname: fate
ssh_port: 22
deploy_user: app
deploy_group: apps
#
#deploy mode: deploy|install|config|uninstall
deploy_mode: deploy
#
#moduel list: mysql|eggroll|fate_flow|fateboard
modules:- mysql- eggroll- fate_flow- fateboard
#
#role list: host|guest|exchange
roles:- guest:9999
#
#ssl role list: host && guest | host&&exchange | guest&&exchange
ssl_roles: []
#
polling: {}
#host ip lists
#host_ips: []
host_ips: []
#
#extra host rules
host_special_routes: []
#guest ip lists
#guest_ips: []
guest_ips: - default:192.168.0.2
#
#extra guest rules
guest_special_routes: - default:192.168.0.1:9370 ---host IP,此处需要手工添加,可以设置额外路由指向exchange
#
#exchange ip lists
exchange_ips: []
#
#extra exchange rules
exchange_special_routes: []
default_engines: eggroll
- 步骤3:执行辅助脚本产生配置
bash deploy/deploy.sh render
5.2 配置guest信息
修改如下文件,默认可以不修改。
vi var_files/prod/fate_guest
guest:partyid: 9999rollsite:enable: truecoordinator: fateips:- 192.168.0.2port: 9370secure_port: 9371server_secure: falseclient_secure: falsepolling:enable: falseroute_tables:- id: defaultroutes:- name: defaultip: 192.168.0.1port: 9370is_secure: false- id: 9999routes:- name: defaultip: 192.168.0.2port: 9370is_secure: false- name: fateflowip: 192.168.0.2port: 9360clustermanager:enable: trueips:- 192.168.0.2port: 4670cores_per_node: 16nodemanager:enable: trueips:- 192.168.0.2port: 4671eggroll:dbname: eggroll_metaegg: 4fate_flow:enable: trueips:- 192.168.0.2grpcPort: 9360httpPort: 9380dbname: fate_flowproxy: rollsitehttp_app_key:http_secret_key:use_deserialize_safe_module: falsedefault_engines: eggrollfateboard:enable: trueips:- 192.168.0.2port: 8080dbname: fate_flowmysql:enable: truetype: insideips:- 192.168.0.2port: 3306dbuser: fatedbpasswd: fate_deV2999zk:enable: falselists:- ip: 127.0.0.1port: 2181use_acl: falseuser: fatepasswd: fateservings:ips:- 127.0.0.1port: 8000
5.3 执行部署
部署所有服务
bash deploy/deploy.sh deploy
查看部署日志:tailf logs/deploy-??.log
5.4 查看进程
# 根据部署规划查看进程是否启动
ps -ef | grep -i clustermanager
ps -ef | grep -i nodemanager
ps -ef | grep -i rollsite
ps -ef | grep -i fate_flow_server.py
ps -ef | grep -i fateboard
三、后置操作
1、清理部署临时目录
bash /data/projects/tools/clean_tmp.sh
2、启停操作
/data/projects/common/supervisord/service.sh status|start|restart 某服务名|all
# eg:
/data/projects/common/supervisord/service.sh status all
3、部署之后fate-board url
http://192.168.0.1:8080
# admin/admin
# mysql的root密码默认在var_files/*/fate_init
# /data/projects/fate/eggroll/conf/route_table.json 路由表
4、新增节点
(1)复制只部署nodemanager服务的机器的所有文件(/data/projects)过去,排除数据目录下面的数据(/data/projects/data/fate/eggroll
),启动nodemanger服务
(2)数据库eggroll*的server_node表增加一行新增ip的记录角色为nodemanger(请参考表中已有的数据操作)
(3)crontab定时任务启动supervisor服务的那条记录也需要复制过去设置一下。
执行/data/projects/common/supervisord/boot.sh(启动supervisor)
执行/data/projects/common/supervisord/service.sh status all查看该节点nodemanger是否启动