一、场景复现
(1)业务需求
在不同的nacos注册集群,不同的namespace,由页面发起查询nacos集群上注册的服务实例ip。
(2)故障现象
nacos集群配置推送push超时,查看nacos日志请求一直在20~30qps不间断。把源站点的请求日志打印出来,发现除了端口不一样几乎都是参数相同的请求。
2021-01-11 12:43:20,133 INFO /nacos/v1/ns/instance/list,app=unknown&healthyOnly=false&namespaceId=test&clientIP=172.16.2.21&serviceName=MusicService&udpPort=59393&encoding=UTF-8,172.16.2.21
2021-01-11 12:43:20,256 INFO /nacos/v1/ns/instance/list,app=unknown&healthyOnly=false&namespaceId=test&clientIP=172.16.2.21&serviceName=PhotoService&udpPort=52456&encoding=UTF-8,172.16.2.21
2021-01-11 12:43:20,155 INFO /nacos/v1/ns/instance/list,app=unknown&healthyOnly=false&namespaceId=test&clientIP=172.16.2.21&serviceName=FruitService&udpPort=30299&encoding=UTF-8,172.16.2.21
2021-01-11 12:43:20,236 INFO /nacos/v1/ns/instance/list,app=unknown&healthyOnly=false&namespaceId=pro&clientIP=172.16.2.21&serviceName=CityService&udpPort=48202&encoding=UTF-8,172.16.2.21
2021-01-11 12:43:20,742 INFO /nacos/v1/ns/instance/list,app=unknown&healthyOnly=false&namespaceId=dev&clientIP=172.16.2.21&serviceName=CountryService&udpPort=12920&encoding=UTF-8,172.16.2.21
源站点的代码
maven设置
<dependency><groupId>com.alibaba.nacos</groupId><artifactId>nacos-client</artifactId><version>1.3.3</version></dependency>
代码
private List<Instance> queryInstances(String serviceName, String ip, int port, String namespace) {try {Properties properties = new Properties();properties.setProperty("namespace", namespace);properties.setProperty("serverAddr", ip + ":" + port);NamingService naming = NamingFactory.createNamingService(properties);List<Instance> instances = naming.selectInstances(serviceName, true);log.info("nacos查询结果, ip:{}, size:{}", ip, instances.szie());return instances;}catch (Exception e){log.warn("nacos查询异常", e);}return Collections.emptyList();
}
通过关键字查询源站点的日志,统计得到请求次数才是十几次,跟nacos上的请求数几十万相差很远,百思不得其解。停掉源站点的进程,请求就消失没有了。
过了两天后,有需要使用到查询服务ip的功能,又把源站点的进程启动起来,不久又出现nacos请求的20-30qps。
(3)分析问题
上源站点的机器把tcp dump快照打印:发现请求一次页面接口,发送了tcp nacos查询uri连接20~30次,有些请求是没有响应体的。
No. Time Source Destination Protocol Length Info
5 0.291331 172.16.2.21 172.16.2.22 HTTP 372 GET /nacos/v1/ns/instance/list,app=unknown&healthyOnly=false&namespaceId=test&clientIP=172.16.2.21&serviceName=MusicService&udpPort=43212&encoding=UTF-8
8 0.223167 172.16.2.21 172.16.2.22 HTTP 372 GET /nacos/v1/ns/instance/list,app=unknown&healthyOnly=false&namespaceId=test&clientIP=172.16.2.21&serviceName=MusicService&udpPort=21345&encoding=UTF-8
10 0.182930 172.16.2.21 172.16.2.22 HTTP 372 GET /nacos/v1/ns/instance/list,app=unknown&healthyOnly=false&namespaceId=test&clientIP=172.16.2.21&serviceName=MusicService&udpPort=21245&encoding=UTF-8
11 0.308301 172.16.2.21 172.16.2.22 HTTP 371 GET /nacos/v1/ns/instance/list,app=unknown&healthyOnly=false&namespaceId=test&clientIP=172.16.2.21&serviceName=CityService&udpPort=23145&encoding=UTF-8
GET /nacos/v1/ns/instance/list?
app=unknown&healthyOnly=false&namespaceId=test&clientIP=172.16.2.21&serviceName=CityService&udpPort=23145&encoding=UTF-8 HTTP/1.1
Client-Version: Nacos-Java-client:v1.3.3
User-Agent: Nacos-Java-Client:v1.3.3
Accept-Encoding: gzip,deflate,sdch
RequestId: ba341cd2-a5cd-d132-67cb-e1f34135a12c
Request-Module: Naming
Content-Type: application/x-www-form-urlencoded;charset=UTF-8
Accept-Charset: UTF-8
Host: 172.16.2.22:10812
Accept: text/html,image/gif,image/jpeg,*; q=.2,*/*;q=.2
Connection: keep-alive
从这里可以推断出,应该是nacos 订阅机制,很多不断请求。
翻看naocs client的源码,发现很多reator后缀的属性和类对象的初始化,进一部验证了订阅机制猜测。
二、解决方案
基本的解决方法:NamingService.selectInstances(serviceName, health)方法默认是订阅的,不使用该方法,改成NamingService.selectInstances(serviceName, health, subscribe)方法,使用后及时调用NamingService.shutdown()方法,避免sdk里面继续订阅请求。
SDK本身就是订阅响应式的,不适合查询一次的实例ip的场景,使用restful api查询最合适,避免多了订阅同步请求。