skywalking 请求链路采样设置和原理

目标

skywalking 默认情况会采集大量 trace 数据，这样可以比较全的追踪所有请求调用链路的请求，但同时对 ES 存储资源要求非常高，需要我们投入很大的存储节点才可以。那么有没有一种采样的请求上报的机制呢？答案是有的，通过设置采样数据的比例，我们就可以在资源成本和采集条目之前取得一个平衡。

现状

日常排查问题的现实情况往往是，我们只需要能采集到出现问题的请求链路即可，而不需要能够采集所有请求的链路数据，在加了采样条目设置后，有一部分 trace 数据会被丢掉，当然如果正好丢掉了我们出问题的 trace 请求，那么就需要我们稳定复现请求，从而控制概率采集到该问题数据。

skywalking 里面配置采样条目

在 skwalking 的 agent 目录下的 agent.conf 文件有采样参数的设置 sample_n_per_3_secs，默认情况下该值是 0 或者 -1 的情况下是关闭采样功能，如果大于 0，则代表agent 3 秒内采集多少条数据上报 oap

agent.sample_n_per_3_secs=${SW_AGENT_SAMPLE:40}

注意：该值的修改需要每次进行 agent 打包后生效

采样功能的原理分析

SamplingService 类里面的 handleSamplingRateChanged 方案，会启动一个线程，每隔 3 秒定时重置采样值的计数器：

void handleSamplingRateChanged() {if (getSamplingRate() > 0) {if (!on) {on = true;this.resetSamplingFactor();ScheduledExecutorService service = Executors.newSingleThreadScheduledExecutor(new DefaultNamedThreadFactory("SamplingService"));scheduledFuture = service.scheduleAtFixedRate(new RunnableWithExceptionProtection(this::resetSamplingFactor, t -> LOGGER.error("unexpected exception.", t)), 0, 3, TimeUnit.SECONDS);LOGGER.debug("Agent sampling mechanism started. Sample {} traces in 3 seconds.",getSamplingRate());}} else {if (on) {if (scheduledFuture != null) {scheduledFuture.cancel(true);}on = false;}}
}

计数器采用 java 并发包下面的原子类计数，从而确保多线程环境下该值的并发更新问题：

    private void resetSamplingFactor() {samplingFactorHolder = new AtomicInteger(0);}

然后提供了一个方法，用于判断是否到达采样阈值：

    public boolean trySampling(String operationName) {if (on) {int factor = samplingFactorHolder.get();if (factor < getSamplingRate()) {return samplingFactorHolder.compareAndSet(factor, factor + 1);} else {return false;}}return true;}

在这个方法里面可以看到，如果原子类 AtomicInteger 实例的 get 方法的值小于阈值，然后就进行一次 CAS 更新操作，当 CAS 成功时代表该 trace context 数据允许上报 oap，否则就代表达到了采样阈值，该 trace context 数据丢弃。

上报还是不上报的逻辑在 ContextManagerExtendService 类的 createTraceContext 方法中可以找到：

 public AbstractTracerContext createTraceContext(String operationName, boolean forceSampling) {AbstractTracerContext context;/** Don't trace anything if the backend is not available.*/if (!Config.Agent.KEEP_TRACING && GRPCChannelStatus.DISCONNECT.equals(status)) {return new IgnoredTracerContext();}int suffixIdx = operationName.lastIndexOf(".");if (suffixIdx > -1 && Arrays.stream(ignoreSuffixArray).anyMatch(a -> a.equals(operationName.substring(suffixIdx)))) {context = new IgnoredTracerContext();} else {SamplingService samplingService = ServiceManager.INSTANCE.findService(SamplingService.class);//如果该条打上了强制采样标签 或 满足满足采样条件就可以直接上报 oapif (forceSampling || samplingService.trySampling(operationName)) {context = new TracingContext(operationName);} else {//否则就忽略该次 trace context， 不做任何处理context = new IgnoredTracerContext();}}return context;}