回家的路上还讨论了个关于 TCP TLP 的问题,闲着无事缕一缕。本文内容参考自 Tail Loss Probe (TLP): An Algorithm for Fast Recovery of Tail Losses 以及 Linux 内核源码。
TLP,先说缘由。自 TCP 引入 Fast retrans 机制就是为了尽力避免 RTO,但如果 sender 发送的一系列数据包中尾包被丢弃,就没有触发 dupack,sack 的可能,于是就有了 TLP,它的目的是在原始序列被丢了尾部没有机会触发 FR 时通过发送探测包来触发 FR,避免跌入 RTO。
那么问题就是如何选择探测包。
如果有新数据,当然以发送新数据为主,如果没有新数据,则重传队列中最后一个报文,如果该探测包能顺利到达对端,可以覆盖所有的丢包场景,触发 FR,draft 中总结了所有的几种情况,如下:
number of scoreboard afterlosses TLP retrans ACKed mechanism final outcome-------- ----------------- ----------------- -------------(1) AAAL AAAA TLP loss detection all repaired(2) AALL AALS early retransmit all repaired(3) ALLL ALLS early retransmit all repaired(4) LLLL LLLS FACK fast recovery all repaired(5) >=5 L ..LS FACK fast recovery all repairedkey:A = ACKed segmentL = lost segmentS = SACKed segment
无论如何,紧着越后面的数据包发送,可避免重传浪费,最值得注意的是,TLP 的核心目标是通过这次探测来诱导对端携带足够的 sack 以触发 FR,ER,enhanced ER(这些不再赘述,详见 TCP-TLP,ER),而不是通过这次探测来补洞。核心一句话,它的目的不是重传,而是探测。
相反,TLP 还要额外区分成功捎带的重传。如果发送的是新数据,该新数据诱导了对端足够的 sack 并触发了 FR,那么没有任何无用功,但如果没有新数据,重传了队列中最后一个数据包,而该数据包恰好补足了空洞,它没有触发 FR,但确实发生了丢包恢复,按照 congestion control 原则,此时应该执行收敛降窗动作:ssthresh = β*cwnd & cwnd = ssthresh。
因此要识别这种探测补洞,以满足并执行拥塞控制收敛原则,即降窗。
用新数据进行探测当然无需任何额外检查,因为它并没有重传任何东西,需要检查的是重传最后一个数据包的情形。TLP draft 没有规定重传探测包发送的次数,但限制在 2(约数,为什么不是 3?) 次以内:
(2) Conditions for scheduling PTO:...(c) Number of consecutive PTOs <= 2.(3) When PTO fires:...(d) If conditions in (2) are satisfied:-> Reschedule next PTO.Else:-> Rearm RTO to fire at epoch 'now+RTO'.
这意味着它可以发送好多遍,这就需要计数器管理这些重传探测包的功效,即是否发生了补洞。只要有一次发生了补洞,就应该执行收敛降窗。
那么如何界定检查时机,draft 规定 after(ack, TLPHighRxt) 是合理的,如果不满足,可能马上下一个 ack = HighRxt 就来了,至于何时,又不好确定,因此 ack 越过 HighRxt 就很合理,在此之前,通过下面的规则计数 TLPRtxOut:
(3) Upon sending a TLP retransmission:if (TLPRtxOut == 0)TLPHighRxt = SND.NXT;TLPRtxOut++;(4) Upon receiving an ACK:(a) Tracking ACKsWe define a "TLP dupack" as a dupack that has all the regularproperties of a dupack that can trigger fast retransmit, plus the ACKacknowledges TLPHighRxt, and the ACK carries no new SACK information(as noted earlier, TLP requires that the receiver supports SACK).This is the kind of ACK we expect to see for a TLP transmission ifthere were no losses. More precisely, the TLP sender considers a TLPprobe segment as acknowledged if all of the following conditions aremet:(a) TLPRtxOut > 0(b) SEG.ACK == TLPHighRxt(c) the segment contains no SACK blocks for sequence rangesabove TLPHighRxt(d) the ACK does not advance SND.UNA(e) the segment contains no data(f) the segment is not a window updateIf all of those conditions are met, then the sender executes thefollowing:TLPRtxOut--;
最后,当满足 after(ack, TLPHighRxt),只要 TLPRtxOut > 0,就执行降窗:ssthresh = β*cwnd & cwnd = ssthresh。
为了一碟醋,包了一顿饺子,这个判定 “是否探测包补足了空洞” 过程有点复杂,着实让人觉得有什么深意,但理解了 TLP 的根本目的就觉得其实没什么大不了的。在大多数情况下,TLP 探测后带来足够的 sack 足以触发 FR,丢包重传流程自然交给 FR,只有在极小概率下,即这个重传探测包恰好补足了空洞,且恰好只有重传探测包这一个包丢失的情形下,这一大坨才起作用。
所以说回到写这篇文章最初的原因,为什么 Linux TCP 没有实现多次重传探测,而仅仅实现了一次(这是允许的):
Implementations MAY use one or two consecutive PTOs.
我以为 Linux 是对的,首先这种复杂判定发生的概率并不高,其次它的实现非常复杂,特别是定时器管理。如果一次 PTO 超时都没能搞定尾部丢包问题,再来一次大概率还是无解,不如交给 RTO 兜底更加简洁,所以你会发现Linux TLP 的实现非常简单,核心十几行代码就完事了。
再者说,TCP 非常难以精确区别原始包和重传包,以至于 TLP 必须谨慎行事:
(5) Senders must only send a TLP loss probe retransmission if all theconditions from section 2.1 are met and the following condition alsoholds:(TLPRtxOut == 0) || (SND.NXT == TLPHighRxt)This ensures that there is at most one sequence range withoutstanding TLP retransmissions. The sender maintains this invariantso that there is at most one TLP retransmission "episode" happeningat a time, so that the sender can use the algorithm described abovein this section to determine when the episode is over, and thus whenit can infer whether any data segments were lost.
而 QUIC 做这件事非常简单,QUIC 对每包编号,可轻松区别一次重传是不是无效的,因此它的实现就非常简单,多一行代码不多,这又是结构决定行为的例子。
最后,说说 TLP 初衷。
较大的 RTO 通常是由测量 RTT 的差异引起,这在无线环境和低密度统计复用环境尤其明显。大 RTO 造成了统计长尾。但简单减少 RTO 时间并不能解决问题。首先,它增加了统计意义上虚假重传,其次,更重要的是,RTO 一旦发生,将极大影响性能。这对现代 TCP 传输影响巨大,在此背景下,TLP 是对 RTO 的精细化优化,它做了更多的事,以避免 RTO 发生。当然,这又是一次买卖。
To get a sense of just how long the RTOs are in relation toconnection RTTs, following is the distribution of RTO/RTT values onGoogle Web servers. [percentile, RTO/RTT]: 50th percentile, 4.375th percentile, 11.390th percentile, 28.995th percentile, 53.999th percentile, 214 Large RTOs, typically caused by variance in measured RTTs, can be a result of intermediate queuing, and service variability in mobile channels. Such large RTOs make a huge contribution to the long tail on the latency statistics of short flows. Note that simply reducing the length of RTO does not address the latency problem for two reasons: first, it increases the chances of spurious retransmissions. Second and more importantly, an RTO reduces TCP's congestion window to one and forces a slow start. Recovery of losses without relying primarily on the RTO mechanism is beneficial for short TCP transfers.
今天除夕夜,祝各位经理和工人,新年快乐!
浙江温州皮鞋湿,下雨进水不会胖。