WALT(Windows-Assist Load Tracing)算法是由Qcom开发, 通过把时间划分为窗口,对
task运行时间和CPU负载进行跟踪计算的方法。为任务调度、迁移、负载均衡及CPU调频
提供输入。
WALT相对PELT算法,更能及时反映负载变化, 更适用于移动设备的交互场景。
1. 不同cpu不同频点的相对runtime
task在不同cpu的不同频点上运行,运行时长是不同的。那么怎么度量task的大小,体现task对CPU的能力需求呢?我们知道task在某个cpu某个频点的绝对runtime时间,此runtime期间这个cpu的能力,和最大capacity cpu在最高频点运行的能力,如果能有对应关系, 就可以把这个绝对runtime时间,归一化到相对最大cpu能力的runtime时间。
WALT中,使用wrq->task_exec_scale表征cpu当前频点的能力。把capacity最大的cpu在最高频点运行的能力定义为1024。task_exec_scale的计算方法如下:
赋值调用路径:walt_update_task_ravg->update_task_rq_cpu_cycles
如果不使用cpu cycles计算负载:
task_exec_scale=
{cpu_cur_freq(cpu)/(wrq->cluster->max_possible_freq)}
*arch_scale_cpu_capacity(cpu)
如果使用cpu cycles计算负载:
task_exec_scale=
{(cycles_delta /time_delta)/(wrq->cluster->max_possible_freq)}
*arch_scale_cpu_capacity(cpu)
wrq->task_exec_scale会在cpu频点发生变化时从新计算更新,此时就可以通过这个归一化后的cpu能力,计算task的相对runtime了。
WALT是通过scale_exec_time函数实现的:
delta * (wrq->task_exec_scale) / 1024
例如,此时task 在cpu0 1G频点对应的task_exec_scale=200运行了10ms,那么task的相对运行时间即为:
10*200/1024=1.953125(ms)
2. scaled runtime
在实际项目中,WALT定义了5个历史窗口,每个窗口长16ms(可根据帧率动态设置为8ms)。在每次对task runtime delta进行更新时,都要转换为相对runtime。
但对调度器默认的单位util,还要做进一步的归一化。WALT中把task 在capacity最大cpu最高频点连续运行一个窗口(16ms)时间,定义为最大uti值1024。所以把相对runtime按一个窗口归一化到1024的util值为:
runtime_scaled = (delta/sched_ravg_window)*1024
实现函数为:
scale_time_to_util(runtime)
一、关键数据结构
1. 变量
sched_ravg_window(EXPORT_SYMBOL())
walt负载统计窗口大小
定义变量时初始化:20000000
walt初始化时:walt_init->walt_tunables->DEFAULT_SCHED_RAVG_WINDOW:
300HZ时对齐刷新率:(3333333*5);其它刷新率16000000
单位:ns
NUM_LOAD_INDICES
#define NUM_LOAD_INDICES 1000
把window size(现在是16ms,不会根据动态window size调整)分为1000份
sched_load_granule
unsigned int __read_mostly sched_load_granule
sched_load_granule = DEFAULT_SCHED_RAVG_WINDOW / NUM_LOAD_INDICES
#define DEFAULT_SCHED_RAVG_WINDOW 16000000
计算得知window size分为1000份之后的单位粒度是16000ns=16us
2. 结构体
walt_task_struct
struct walt_task_struct {u32 flags;u64 mark_start;u64 window_start;u32 sum, demand;u32 coloc_demand;u32 sum_history[RAVG_HIST_SIZE];u16 sum_history_util[RAVG_HIST_SIZE];u32 curr_window_cpu[WALT_NR_CPUS];u32 prev_window_cpu[WALT_NR_CPUS];u32 curr_window, prev_window;u8 busy_buckets[NUM_BUSY_BUCKETS];u16 bucket_bitmask;u16 demand_scaled;u16 pred_demand_scaled;u64 active_time;u64 last_win_size;int boost;bool wake_up_idle;bool misfit;bool rtg_high_prio;u8 low_latency;u64 boost_period;u64 boost_expires;u64 last_sleep_ts;u32 init_load_pct;u32 unfilter;u64 last_wake_ts;u64 last_enqueued_ts;struct walt_related_thread_group __rcu *grp;struct list_head grp_list;u64 cpu_cycles;bool iowaited;int prev_on_rq;int prev_on_rq_cpu;struct list_head mvp_list;u64 sum_exec_snapshot_for_slice;u64 sum_exec_snapshot_for_total;u64 total_exec;int mvp_prio;int cidx;int load_boost;int64_t boosted_task_load;int prev_cpu;int new_cpu;u8 enqueue_after_migration;u8 hung_detect_status;int pipeline_cpu; };
3.mark_start:同步wrq->mark_start
记录task某些event的开始时间, 这些event在walt.h task_event中定义。
4.window_start
标识了per task window。同步于wrq->window_start:
walt_update_task_ravg->update_cpu_busy_time:
/** Handle per-task window rollover. We don't care about the* idle task.*/if (new_window) {if (!is_idle_task(p))rollover_task_window(p, full_window);wts->window_start = window_start;}
新task wts->window_start初始化为0, 在第一次调用walt_update_task_ravg时同步为wrq->window_start
5.wts->sum += scale_exec_time(delta, rq, wts)
计算当前窗口task运行时间总和, 这些时间是实际运行时间delta按cpu当前频率和cpu capacity归一化到最大capacity cpu最高频率的运行时长。比如在最大capacity cpu上以最高频点运行的task,实际运行时间delta和统计到sum的时间是1:1的。
更新路径:
walt_update_task_ravg->update_task_demand->add_to_task_demand->(wts->sum += scale_exec_time(delta, rq, wts))
walt_update_task_ravg->update_task_demand->update_history->(wts->sum = 0), 在event跨窗口时,才会复位为0。
5.wts->demand
历史窗口中统计的相对runtime,通过不同的策略计算出当前采用的相对runtime值
#define WINDOW_STATS_RECENT 0
#define WINDOW_STATS_MAX 1
#define WINDOW_STATS_MAX_RECENT_AVG 2
#define WINDOW_STATS_AVG 3
6.wts->coloc_demand
5个历史窗口task 相对runtime的平均值
wts->coloc_demand = div64_u64((sum += hist[i]), RAVG_HIST_SIZE)
7.wts->sum_history[5]
记录了task 5个历史窗口的相对runtime。相对 runtime,即为task实际running时间归一化到最大capacity cpu最高频点后的相对运行时间, 最大值为16ms。
walt_update_task_ravg()->update_task_demand()->update_history()->
hist[wts->cidx] = runtime
8.wts->sum_history_util[5]
把task在每个历史窗口累加的相对runtime(0ms~16ms), 归一化到(0~1024)的util值, 这里称之为scaled runtime:
walt_update_task_ravg()->update_task_demand()->update_history()->
scale_time_to_util(runtime)=(wts->sum/sched_ravg_window)*1024
9.curr_window_cpu[WALT_NR_CPUS]
统计task在每个cpu上当前窗口中相对runtime
10.prev_window_cpu[WALT_NR_CPUS]
统计task在每个cpu上前一窗口中相对runtime
11.curr_window, prev_window
walt_update_task_ravg()->update_cpu_busy_time()
统计task在当前窗和前一窗中,相对runtime。包含task运行期间的irqtime。
12.参考如何更新wts->bucket_bitmask及busy_buckets[bidx]
13.参考如何更新wts->bucket_bitmask及busy_buckets[bidx]
14.demand_scaled
参考5),把计算出当前的相对runtime值归一化到scaled runtime(0~1024).
walt_update_task_ravg()->update_task_demand()->update_history()->
scale_time_to_util(demand)
16.active_time
is_new_task中判断此值小于100ms就认为是新task,rollover_task_window是唯一调用路径。
28.last_sleep_ts:
记录task被开始sleep的时间, 在__schedule->android_rvh_schedule中, 当prev!=next且!prev->on_rq时, 更新prev task的last_sleep_ts。prev task不再被调度,且不在rq的wait list时, 即表示prev task被schedule out后,进入非running状态。
41.wts->cidx
current idx, 索引hist和hist_util数组中当前窗口, 存储task相对running time和util。在window更新时,在update_history中更新索引值,达到滚动更新历史窗口信息的目的。
task_event
task event标识了task的一组行为,这些行为发生的时间点,是对task demand变化更新的关键时间点。理解walt算法,首先要理解这些event的含义及如何通过这些event更新task demand和cpu 负载。
enum task_event {PUT_PREV_TASK = 0,PICK_NEXT_TASK = 1,TASK_WAKE = 2,TASK_MIGRATE = 3,TASK_UPDATE = 4,IRQ_UPDATE = 5, };
PUT_PREV_TASK/PICK_NEXT_TASK:
调用路径__schedule->android_rvh_schedule()->
walt_update_task_ravg(prev, rq, PUT_PREV_TASK/PICK_NEXT_TASK, wallclock, 0). CPU上发生任务调度,通过pick_next_task选出cpu上将要运行的next task后,通过hook调用核心函数对prev task和next task的util及cpu的负载进行更新。PUT_PREV_TASK标识prev task从rq上被调度走时对prev task的util和cpu负载进行更新。
TASK_WAKE:
调用路径try_to_wake_up->android_rvh_try_to_wake_up(*p)->
walt_update_task_ravg(p, rq, TASK_WAKE, wallclock, 0)
当wakeup一个task, 对非current(当前cpu)&&非runnalbe(不在run queue wait list中) &&非running(其它cpu)的task p,在select_task_rq选择target cpu之前,通过hook调用核心函数。此时的task_cpu(p)为task最后一次running的cpu。
walt_rq
struct walt_rq {struct task_struct *push_task;struct walt_sched_cluster *cluster;struct cpumask freq_domain_cpumask;struct walt_sched_stats walt_stats;u64 window_start;u32 prev_window_size;unsigned long walt_flags;u64 avg_irqload;u64 last_irq_window;u64 prev_irq_time;struct task_struct *ed_task;u64 task_exec_scale;u64 old_busy_time;u64 old_estimated_time;u64 curr_runnable_sum;u64 prev_runnable_sum;u64 nt_curr_runnable_sum;u64 nt_prev_runnable_sum;struct group_cpu_time grp_time;struct load_subtractions load_subs[NUM_TRACKED_WINDOWS];DECLARE_BITMAP_ARRAY(top_tasks_bitmap,NUM_TRACKED_WINDOWS, NUM_LOAD_INDICES);u8 *top_tasks[NUM_TRACKED_WINDOWS];u8 curr_table;int prev_top;int curr_top;bool notif_pending;bool high_irqload;u64 last_cc_update;u64 cycles;u64 util;struct list_head mvp_tasks;int num_mvp_tasks;u64 latest_clock;u32 enqueue_counter; };
15.task_exec_scale: task所在cpu的当前频点归一化到cpu最大能力1024的值,体现当前频点cpu的能力.
例如在最大cluster中cpu在最高频点运行, task_exec_scale=1024.
赋值调用路径:walt_update_task_ravg->update_task_rq_cpu_cycles
!use_cycle_counter:如果不使用cpu cycles计算负载:task_exec_scale=
{cpu_cur_freq(cpu)/(wrq->cluster->max_possible_freq)}
*arch_scale_cpu_capacity(cpu)
use_cycle_counter:如果使用cpu cycles计算负载:task_exec_scale=
{(cycles_delta /time_delta)/(wrq->cluster->max_possible_freq)}
*arch_scale_cpu_capacity(cpu)
(https://www.kernel.org/doc/Documentation/devicetree/bindings/arm/cpu-capacity.txt)
18.curr_runnable_sum
当前窗口cpu 负载信息
19.prev_runnable_sum
前一窗cpu负载信息
20.nt_curr_runnable_sum
cpu当前窗new task负载信息, wts->active_time小于100ms属于new task
21.nt_prev_runnable_sum
cpu前一窗new task负载信息
24.unsigned long top_tasks_bitmap[2][BITS_TO_LONGS(1000)],
跟踪curr和prev两个窗口。BITS_TO_LONGS(1000)=16, 即为大小为16*64=1024bit
26.u8 *top_tasks[2]
两个指针元素的指针数组, 分别指向大小1000 Byte(u8)类型的mem。记录当前cpu上最近两个窗口curr和prev上, 翻转?
27.curr_table 0?1
使用两个window进行跟踪,标识哪个是curr的,curr和prev构成一个环形数组,不停翻转.
在update_window_start->rollover_top_tasks(rq, full_window)翻转更新:
wrq->curr_table
wrq->prev_top = curr_top
wrq->curr_top = 0
wrq->top_tasks[prev/curr]
wrq->top_tasks_bitmap[prev/curr]
28.prev_top
前窗最大task runtime落在1000个bucket中的index
29.curr_top
是index值,当前窗口的max load_to_index(wts->curr_window), 记录在当前窗口最大task runtime落在的bucket id。如当前窗口运行过task和相对runtime分别为:
A(2ms) B(6ms) C(15ms),
则:wrq->curr_top = load_to_index(15ms)=15*1000/16
34.util
walt gov调频获取cpu util:
waltgov_update_freq->waltgov_get_util(wg_cpu)->cpu_util_freq_walt()->__cpu_util_freq_walt()
wrq->util = scale_time_to_util(freq_policy_load(rq, reason))
37.记录最近一次调用update_window_start的时间戳,但未必满足更新条件(delta \geq sched_ravg_window)对wrq->window_start进行更新
二、核心函数
1. 函数调用关系
walt_update_task_ravg是WALT算法的入口函数,
主要函数调用关系如下图。主要涉及:
1. 随着时间推移会产生新的window,更新window的
start time
update_window_start
2.计算 wrq->task_exec_scale, task_exec_scale反
映了在当前频点下cpu的能力,参见:
《二、task_runtime 1. 不同cpu不同频点的相对runtim》
update_task_rq_cpu_cycles
3. 对task runtime和历史记录的更新,实现函数为:
update_task_demand
4. 对task 和 rq curr和prev两个window统计信息的更新
update_cpu_busy_time
5. 更新task预测需求
update_task_pred_demand
2. 函数说明
a) update_window_start(rq, wallclock, event)
-
-
- 更新wrq->window_start += (u64)nr_windows * (u64)sched_ravg_window
- 更新wrq->prev_window_size = sched_ravg_window
- 更新wrq的历史窗口统计信息, 初始化新窗口信息:
update_window_start()->rollover_cpu_window(rq, nr_windows > 1)
-
rollover_cpu_window(struct rq *rq, bool full_window) {struct walt_rq *wrq = &per_cpu(walt_rq, cpu_of(rq));u64 curr_sum = wrq->curr_runnable_sum;u64 nt_curr_sum = wrq->nt_curr_runnable_sum;u64 grp_curr_sum = wrq->grp_time.curr_runnable_sum;u64 grp_nt_curr_sum = wrq->grp_time.nt_curr_runnable_sum;if (unlikely(full_window)) {curr_sum = 0;nt_curr_sum = 0;grp_curr_sum = 0;grp_nt_curr_sum = 0;}wrq->prev_runnable_sum = curr_sum;wrq->nt_prev_runnable_sum = nt_curr_sum;wrq->grp_time.prev_runnable_sum = grp_curr_sum;wrq->grp_time.nt_prev_runnable_sum = grp_nt_curr_sum;wrq->curr_runnable_sum = 0;wrq->nt_curr_runnable_sum = 0;wrq->grp_time.curr_runnable_sum = 0;wrq->grp_time.nt_curr_runnable_sum = 0; }
9)~21):
scheduler_tick()->trace_android_rvh_tick_entry(rq)->update_window_start在每个tick(250HZ 4ms)中断处理函数中触发更新wrq->window_start,所以超过一个window_size(8ms/16ms)满足更新条件(delta >=sched_ravg_window)后,在第二个window_size内还没更新wrq->window_start的情况(full_window=2)极少发生:(异常task长时间关中断运行,或tickless的idle cpu(https://docs.kernel.org/timers/no_hz.html))
当这种情况发生时,因为前窗更新window start异常,prev值只参考前窗, 所以(nr_windows > 1)后更新的prev值设置为0.
- 更新
- update_window_start()->rollover_top_tasks(struct rq *rq, bool full_window)
static void rollover_top_tasks(struct rq *rq, bool full_window) {struct walt_rq *wrq = &per_cpu(walt_rq, cpu_of(rq));u8 curr_table = wrq->curr_table;u8 prev_table = 1 - curr_table;int curr_top = wrq->curr_top;clear_top_tasks_table(wrq->top_tasks[prev_table]);clear_top_tasks_bitmap(wrq->top_tasks_bitmap[prev_table]);if (full_window) {curr_top = 0;clear_top_tasks_table(wrq->top_tasks[curr_table]);clear_top_tasks_bitmap(wrq->top_tasks_bitmap[curr_table]);}wrq->curr_table = prev_table;wrq->prev_top = curr_top;wrq->curr_top = 0; }
b)update_task_rq_cpu_cycles(p, rq, event, wallclock, irqtime)
主要计算 wrq->task_exec_scale, task_exec_scale反映了在当前频点下cpu的能力,参见walt_rq.task_exec_scale说明。
c)u64 update_task_demand(p, rq, event, wallclock)
此函数是WALT算法计算task demand核心函数,把此次event发生时需要更新的task busy time delta,归一化到最大capacity、最高频点的相对时间,即相对runtime,并累加到wts->sum, 并前滚更新task的历史窗口信息。根据5个历史窗口得到最终表示task大小的demand。详细参见《三、算法 1. 如何更新task demnd》
d)update_cpu_busy_time(p, rq, event, wallclock, irqtime)
详细参考《三、算法 2. 如何更新cpu负载信息》
e)update_task_pred_demand(rq, p, event)
static void update_task_pred_demand(struct rq *rq, struct task_struct *p, int event) {u16 new_pred_demand_scaled;struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;u16 curr_window_scaled;if (is_idle_task(p))return;if (event != PUT_PREV_TASK && event != TASK_UPDATE &&(!SCHED_FREQ_ACCOUNT_WAIT_TIME ||(event != TASK_MIGRATE &&event != PICK_NEXT_TASK)))return;/** TASK_UPDATE can be called on sleeping task, when its moved between* related groups*/if (event == TASK_UPDATE) {if (!p->on_rq && !SCHED_FREQ_ACCOUNT_WAIT_TIME)return;}curr_window_scaled = scale_time_to_util(wts->curr_window);if (wts->pred_demand_scaled >= curr_window_scaled)return;new_pred_demand_scaled = get_pred_busy(p, busy_to_bucket(curr_window_scaled),curr_window_scaled, wts->bucket_bitmask);if (task_on_rq_queued(p) && (!task_has_dl_policy(p) ||!p->dl.dl_throttled))fixup_walt_sched_stats_common(rq, p,wts->demand_scaled,new_pred_demand_scaled);wts->pred_demand_scaled = new_pred_demand_scaled; }
10~14.只处理PUT_PREV_TASK,TASK_UPDATE,TASK_MIGRATE,PICK_NEXT_TASK四个event,SCHED_FREQ_ACCOUNT_WAIT_TIME=0 (default)
20~22.当sleep task在不同group间迁移时,TASK_UPDATE会被调用。SCHED_FREQ_ACCOUNT_WAIT_TIME=0(default),不对sleep task 更新。
25.(wts->curr_window/sched_ravg_window)*1024, 把wts->curr_window以ns为单位的数值归一化到1024,归一化后的值为util。
29.参考《三、算法 1.如何更新task runtime》
f)run_walt_irq_work_rollover(old_window_start, rq)
三、算法
1. 如何更新task runtime
- WALT(Windows-Assist Load Tracing)算法是由Qcom开发, 通过把时间划分为窗口,对task运行时间和CPU负载进行跟踪计算的方法。为任务调度、迁移、负载均衡及CPU调频提供输入。
- WALT相对PELT算法,更能及时反映负载变化, 更适用于移动设备的交互场景。
不同cpu不同频点的相对runtime
task在不同cpu的不同频点上运行,运行时长是不同的。那么怎么度量task的大小,体现task对CPU的能力需求呢?我们知道task在某个cpu某个频点的绝对runtime时间,此runtime期间这个cpu的能力,和最大capacity cpu在最高频点运行的能力,如果能有对应关系, 就可以把这个绝对runtime时间,归一化到相对最大cpu能力的runtime时间。
WALT中,使用wrq->task_exec_scale表征cpu当前频点的能力。把capacity最大的cpu在最高频点运行一个window定义为1024。task_exec_scale的计算方法如下:
赋值调用路径:walt_update_task_ravg->update_task_rq_cpu_cycles
如果不使用cpu cycles计算负载:
task_exec_scale=
{cpu_cur_freq(cpu)/(wrq->cluster->max_possible_freq)}
*arch_scale_cpu_capacity(cpu)
如果使用cpu cycles计算负载:
task_exec_scale=
{(cycles_delta /time_delta)/(wrq->cluster->max_possible_freq)}
*arch_scale_cpu_capacity(cpu)
wrq->task_exec_scale会在cpu频点发生变化时从新计算更新,此时就可以通过这个归一化后的cpu能力,计算task的相对runtime了。
WALT是通过scale_exec_time函数实现的:
delta * (wrq->task_exec_scale) / 1024
例如,此时task 在cpu0 1G频点对应的task_exec_scale=200运行了10ms,那么task的相对运行时间即为:
10*200/1024=1.953125(ms)
scaled runtime
在实际项目中,WALT定义了5个历史窗口,每个窗口长16ms(可根据帧率动态设置为8ms)。在每次对task runtime delta进行更新时,都要转换为相对runtime。
但对调度器默认的单位util,还要做进一步的归一化。WALT中把task 在capacity最大cpu最高频点连续运行一个窗口(16ms)时间,定义为最大uti值1024。所以把相对runtime按一个窗口归一化到1024的util值为:
runtime_scaled = (delta/sched_ravg_window)*1024
实现函数为:
scale_time_to_util(runtime)
ms: wts->mark_start: event的开始时间
ws: wrq->window_start
wc: wallclock: 当前event时间戳,walt_sched_clock()->sched_clock()
在计算task的demand时分三种情况:
-
- event未跨越窗口
- event跨两个窗口
- event跨多个窗口
update_task_demand
()是WALT算法的核心函数,计算task 相对runtime并更新历史窗口(默认配置记录5个history window)。涉及三个主要函数 :
1) account_busy_for_task_demand(struct rq *rq, struct task_struct *p, int event)
判断task的busy time, 对应event是否应该累加task running time。
static int account_busy_for_task_demand(struct rq *rq, struct task_struct *p, int event) {/** No need to bother updating task demand for the idle task.*/if (is_idle_task(p))return 0;/** When a task is waking up it is completing a segment of non-busy* time. Likewise, if wait time is not treated as busy time, then* when a task begins to run or is migrated, it is not running and* is completing a segment of non-busy time.*/if (event == TASK_WAKE || (!SCHED_ACCOUNT_WAIT_TIME &&(event == PICK_NEXT_TASK || event == TASK_MIGRATE)))return 0;/** The idle exit time is not accounted for the first task _picked_ up to* run on the idle CPU.*/if (event == PICK_NEXT_TASK && rq->curr == rq->idle)return 0;/** TASK_UPDATE can be called on sleeping task, when its moved between* related groups*/if (event == TASK_UPDATE) {if (rq->curr == p)return 1;return p->on_rq ? SCHED_ACCOUNT_WAIT_TIME : 0;}return 1; }
如何确定event发生时, task在rq上是busy状态, 并应该被累加到task running time中呢?
7.不统计idle task的demand。
16.TASK_WAKE视为非busytime。SCHED_ACCOUNT_WAIT_TIME = flase, PICK_NEXT_TASK, TASK_MIGRATE event视为非busy time,不统计task demand。
24.如果设置了SCHED_ACCOUNT_WAIT_TIME,并且event=PICK_NEXT_TASK,当task被调度到idle cpu运行时,cpu退出idle的时间不能视为task的busy time,不统计。
31.event == TASK_UPDATE 且
-
-
-
- task p正在running,属于busy time。
- task p在等待队列中,可以通过配置 SCHED_ACCOUNT_WAIT_TIME 决定是否把task在等待队列中的时间视为busy time;
非running非runnable不统计。
-
-
38.排除以上情况,默认是需要更新的。
2) update_history(rq, p, runtime, samples, event)
WALT记录了task 5个历史窗口运行时间。在event跨窗口时统计task busy time,因为产生了新的历史窗口, 需要对task历史窗口runtime进行滚动更新。使用wts->cidx作为hist[]数组中current window的索引,通过更新索引值达到滚动更新历史信息。这里的runtime,是task在被更新窗口内的runtime,即wts->sum。当跨越并更新full window时,为window_size(16ms)的相对runtime(大小取决于当前频点计算出的wrq->task_exec_scale)。
static void update_history(struct rq *rq, struct task_struct *p,u32 runtime, int samples, int event) { runtime_scaled = scale_time_to_util(runtime);/* Push new 'runtime' value onto stack */for (; samples > 0; samples--) {hist[wts->cidx] = runtime;hist_util[wts->cidx] = runtime_scaled;wts->cidx = ++(wts->cidx) % RAVG_HIST_SIZE;}
wts->demand的计算策略
WINDOW_STATS_RECENT | 取上一窗的wts->sum作为demand |
WINDOW_STATS_MAX | 取5个历史窗口中最大值作为task demand |
WINDOW_STATS_MAX_RECENT_AVG | 取 max{5个历史窗口的平均值,上一窗的wts->sum},作为task demand |
WINDOW_STATS_AVG | 取5个窗口的平均值作为task demnad |
如何计算predict task demand?
static inline u16 predict_and_update_buckets(struct task_struct *p, u16 runtime_scaled) {int bidx;u32 pred_demand_scaled;struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;bidx = busy_to_bucket(runtime_scaled);pred_demand_scaled = get_pred_busy(p, bidx, runtime_scaled, wts->bucket_bitmask);bucket_increase(wts->busy_buckets, &wts->bucket_bitmask, bidx);return pred_demand_scaled; }
在调用update_history更新task的历史runtime后,会调用get_pred_busy()预测task的runtime,即未来task对cpu能力的需求。把0~1024的scaled runtime划分为16个buckets,每个buckets对应64 util,把要更新的runtime_scaled通过busy_to_bucket映射到对应的bucket id,作为起始start bucket id,根据wts->bucket_bitmask, 向上寻找第一个置位的bit n, 表示在之前的窗口, task的runtime落到这个bit所对应的busy_buckets[n],这是一个busy bucket。如果hist_util[5]记录的5个最近历史window的task runtime有落在这个busy_buckets[n]范围内的(例如hist_util[m]), 则这个hist_util[m]作为预测的task predict demand返回。
如何更新wts->bucket_bitmask及busy_buckets[bidx]?
#define INC_STEP 8 #define DEC_STEP 2 #define CONSISTENT_THRES 16 #define INC_STEP_BIG 16 static inline void bucket_increase(u8 *buckets, u16 *bucket_bitmask, int idx) {int i, step;for (i = 0; i < NUM_BUSY_BUCKETS; i++) {if (idx != i) {if (buckets[i] > DEC_STEP)buckets[i] -= DEC_STEP;else {buckets[i] = 0;*bucket_bitmask &= ~BIT_MASK(i);}} else {step = buckets[i] >= CONSISTENT_THRES ?INC_STEP_BIG : INC_STEP;if (buckets[i] > U8_MAX - step)buckets[i] = U8_MAX;elsebuckets[i] += step;*bucket_bitmask |= BIT_MASK(i);}} }
wts->bucket_bitmask的16个bit位对应16个busy_buckets,表示对应的bucket是否还处于busy状态。处于busy状态的bucket会在计算task的predict demand时被选中。当调用update_history更新task的历史负载时,需要更新进历史窗口的task scaled runtime通过busy_to_bucket(runtime_scaled)获取到runtime_scaled所属的bucket id(参数idx),并将bucket_bitmask中的对应的bit置位,表示task在历史窗口的runtime曾经落在这个bucket,表征了task在历史窗口中运行的大小特性。每次调用update_history更新的runtime,runtime落在的bucket,对应的busy_buckets[bidx]值会步进增加,其它busy_buckets则会衰减,当衰减到小于DEC_STEP时,busy_buckets[i]被清零,对应的bit mask被清除。这样busy_buckets[bidx]就反映了近期task runtime特征,值越大表明task在历史窗口的负载越多的落在这个bucket。bucket_bitmask[bit_n]=0表示task的runtime已经很久没有落在这个bucket区间了。目前bucket_bitmask[bidx]的值除了步进或衰减以设置或清除bucket_bitmask ,没有其它用途。
3) add_to_task_demand(struct rq *rq, struct task_struct *p, u64 delta)
static u64 add_to_task_demand(struct rq *rq, struct task_struct *p, u64 delta) {struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;delta = scale_exec_time(delta, rq, wts);wts->sum += delta;if (unlikely(wts->sum > sched_ravg_window))wts->sum = sched_ravg_window;return delta; }
当前窗口task busy time统计,把实际运行时间delta按cpu当前频率和capacity归一化到最大capacity cpu上最高频率运行的时长。累加到到wts->sum。wts->sum最大值为sched_ravg_window,即16ms。在event跨窗口时调用update_history归0,从新开始统计新窗口task的相对运行时间。参考wts.sum。
2. 如何更新cpu负载信息
update_cpu_busy_time是cpu负载信息更新核心函数, 将task runningtime或irqtime更新到:
wrq->grp_time->curr/prev_runnable_sum
wrq->grp_time->nt_curr/prev_runnable_sum
wts->curr/prev_window
wts->curr/prev_window_cpu[cpu]
task属于grp时才更grp_time对应的 sum,irq只更新到wrq对应的sum。
关键函数包括:
1) rollover_task_window(p, full_window)
如果非idle task跨窗口 ,需要将curr_window赋值给prev_window, rollover_task_window task的runtime统计:
wts->curr/prev_window
wts->curr/prev_window_cpu[i]
wts->active_time
static void rollover_task_window(struct task_struct *p, bool full_window) {u32 *curr_cpu_windows = empty_windows;u32 curr_window;int i;struct walt_rq *wrq = &per_cpu(walt_rq, cpu_of(task_rq(p)));struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;/* Rollover the sum */curr_window = 0;if (!full_window) {curr_window = wts->curr_window;curr_cpu_windows = wts->curr_window_cpu;}wts->prev_window = curr_window;wts->curr_window = 0;/* Roll over individual CPU contributions */for (i = 0; i < nr_cpu_ids; i++) {wts->prev_window_cpu[i] = curr_cpu_windows[i];wts->curr_window_cpu[i] = 0;}if (is_new_task(p))wts->active_time += wrq->prev_window_size; }
3 10 17 22.当经历一个完整窗口,表示可能存在异常,前窗清零:
wts->prev_window=0,
wts->prev_window_cpu[i]=0
13 14 17 22. 当只是跨一个窗口,更新:
wts->prev_window = wts->curr_window
wts->prev_window_cpu[i] = wts->curr_window_cpu[i]
18 23.wts->curr_window和wts->curr_window_cpu[i]在跨窗口时会被重置为0
2) account_busy_for_cpu_time(rq, p, irqtime, event)
walt_update_task_ravg->update_cpu_busy_time->account_busy_for_cpu_time
static int account_busy_for_cpu_time(struct rq *rq, struct task_struct *p,u64 irqtime, int event) {if (is_idle_task(p)) {/* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */if (event == PICK_NEXT_TASK)return 0;/* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */return irqtime || cpu_is_waiting_on_io(rq);}if (event == TASK_WAKE)return 0;if (event == PUT_PREV_TASK || event == IRQ_UPDATE)return 1;/** TASK_UPDATE can be called on sleeping task, when its moved between* related groups*/if (event == TASK_UPDATE) {if (rq->curr == p)return 1;return p->on_rq ? SCHED_FREQ_ACCOUNT_WAIT_TIME : 0;}/* TASK_MIGRATE, PICK_NEXT_TASK left */return SCHED_FREQ_ACCOUNT_WAIT_TIME; }
4.先看task是否是idle task。对于非idle task,再根据event具体判断cpu的busy time
10.p是idle task, 此时只有更新irqtime和cpu在等待io(rq->nr_iowait >0), cpu才属于busytime
13.非idle task的TASK_WAKE event发生时,cpu不是busytime
16.非idle taks的PUT_PREV_TASK 和 IRQ_UPDATE属于busytime
23.非idle taks,TASK_UPDATE时
24.p是curr task,cpu在busytime
27.p不是curr task,但是在runqueue的wait list中,取决于是否设置了SCHED_FREQ_ACCOUNT_WAIT_TIME
27.p不是curr task,也不是runnable状态,cpu不在busytime
31.非idle task,TASK_MIGRATE, PICK_NEXT_TASK对cpu来说,取决于是否设置了SCHED_FREQ_ACCOUNT_WAIT_TIME
3) update_top_tasks(p, rq, old_curr_window, new_window, full_window)
调用路径:
walt_update_task_ravg->update_cpu_busy_time->update_top_tasks
这里的old_curr_window是此次更新前的wts->curr_window。此函数作用是在更新
wrq->top_tasks[curr/prev],
wrq->top_tasks_bitmap[curr/prev]
wrq->top_tasks[2]是两个指针的指针数组,分别记录task在curr和prev window的相对runtime。分别指向1000个u8类型的mem, 即把0~16ms分成1000个buckets,每个bucket为16us。task在curr/prev window的相对runtime如果落在某个bucket,对应bucket的值+1,当bucket值等于1时,置位对应的top_tasks_bitmap位,表示此bucket为busy bucket。当bucket值从1变为0时,clear对应的top_tasks_bitmap位,表示此bucket为非busy bucket。
static void update_top_tasks(struct task_struct *p, struct rq *rq,u32 old_curr_window, int new_window, bool full_window) {struct walt_rq *wrq = &per_cpu(walt_rq, cpu_of(rq));struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;u8 curr = wrq->curr_table;u8 prev = 1 - curr;u8 *curr_table = wrq->top_tasks[curr];u8 *prev_table = wrq->top_tasks[prev];int old_index, new_index, update_index;u32 curr_window = wts->curr_window;u32 prev_window = wts->prev_window;bool zero_index_update;if (old_curr_window == curr_window && !new_window)return;old_index = load_to_index(old_curr_window);new_index = load_to_index(curr_window);if (!new_window) {zero_index_update = !old_curr_window && curr_window;if (old_index != new_index || zero_index_update) {if (old_curr_window)curr_table[old_index] -= 1;if (curr_window)curr_table[new_index] += 1;if (new_index > wrq->curr_top)wrq->curr_top = new_index;}if (!curr_table[old_index])__clear_bit(NUM_LOAD_INDICES - old_index - 1,wrq->top_tasks_bitmap[curr]);if (curr_table[new_index] == 1)__set_bit(NUM_LOAD_INDICES - new_index - 1,wrq->top_tasks_bitmap[curr]);return;}
没跨窗口
21.没跨窗口,更新前的wts->curr_window是0,当前非0,表示task在这个window里第一次被统计
23.如果此次更新task的runtime发生了变化,或者是第一次被统计busy time
25.curr_table指向1000个u8类型的mem, 更新前task runtime所属bucket id即为old_index,因为在同一window中发生了再次更新,原来task runtime落在的curr_table[old_index]减一,当前task runtime落在的curr_table[new_index]加一。
跨一个或多个窗口
/** The window has rolled over for this task. By the time we get* here, curr/prev swaps would has already occurred. So we need* to use prev_window for the new index.*/update_index = load_to_index(prev_window);if (full_window) {/** Two cases here. Either 'p' ran for the entire window or* it didn't run at all. In either case there is no entry* in the prev table. If 'p' ran the entire window, we just* need to create a new entry in the prev table. In this case* update_index will be correspond to sched_ravg_window* so we can unconditionally update the top index.*/if (prev_window) {prev_table[update_index] += 1;wrq->prev_top = update_index;}if (prev_table[update_index] == 1)__set_bit(NUM_LOAD_INDICES - update_index - 1,wrq->top_tasks_bitmap[prev]);} else {zero_index_update = !old_curr_window && prev_window;if (old_index != update_index || zero_index_update) {if (old_curr_window)prev_table[old_index] -= 1;prev_table[update_index] += 1;if (update_index > wrq->prev_top)wrq->prev_top = update_index;if (!prev_table[old_index])__clear_bit(NUM_LOAD_INDICES - old_index - 1,wrq->top_tasks_bitmap[prev]);if (prev_table[update_index] == 1)__set_bit(NUM_LOAD_INDICES - update_index - 1,
8.跨多个窗口
25.跨一个窗口
top task的使用
waltgov_update_freq->waltgov_get_util(wg_cpu)->cpu_util_freq_walt()->__cpu_util_freq_walt()->
util = scale_time_to_util(freq_policy_load(rq, reason))
wrq->util = util->
freq_policy_load(rq, reason)->tt_load = top_task_load(rq)
WALT在调频获取cpu的util时, 会使用top task load:
freq_policy_load->tt_load = top_task_load(rq)
tt_load = (wrq->prev_top + 1) * sched_load_granule
这里的wrq->prev_top表示前窗中,最大相对runtime时长的task,落在1000个bucket中的index值。
551 /*552 * Special case the last index and provide a fast path for index = 0.553 * Note that sched_load_granule can change underneath us if we are not554 * holding any runqueue locks while calling the two functions below.555 */556 static u32 top_task_load(struct rq *rq)557 {558 struct walt_rq *wrq = &per_cpu(walt_rq, cpu_of(rq));559 int index = wrq->prev_top;560 u8 prev = 1 - wrq->curr_table;561 562 if (!index) {563 int msb = NUM_LOAD_INDICES - 1;564 565 if (!test_bit(msb, wrq->top_tasks_bitmap[prev]))566 return 0;567 else568 return sched_load_granule;569 } else if (index == NUM_LOAD_INDICES - 1) {570 return sched_ravg_window;571 } else {572 return (index + 1) * sched_load_granule;573 }574 }
- packing cpu
walt_find_and_choose_cluster_packing_cpu
1068 /* walt_find_and_choose_cluster_packing_cpu - Return a packing_cpu choice common for this cluster. 1069 * @start_cpu: The cpu from the cluster to choose from 1070 * 1071 * If the cluster has a 32bit capable cpu return it regardless 1072 * of whether it is halted or not. 1073 * 1074 * If the cluster does not have a 32 bit capable cpu, find the 1075 * first unhalted, active cpu in this cluster. 1076 * 1077 * Returns -1 if packing_cpu if not found or is unsuitable to be packed on to 1078 * Returns a valid cpu number if packing_cpu is found and is usable 1079 */ 1080 static inline int walt_find_and_choose_cluster_packing_cpu(int start_cpu, struct task_struct *p) 1081 { 1082 struct walt_rq *wrq = &per_cpu(walt_rq, start_cpu); 1083 struct walt_sched_cluster *cluster = wrq->cluster; 1084 cpumask_t unhalted_cpus; 1085 int packing_cpu; 1086 1087 /* if idle_enough feature is not enabled */ 1088 if (!sysctl_sched_idle_enough) 1089 return -1; 1090 if (!sysctl_sched_cluster_util_thres_pct) 1091 return -1; 1092 1093 /* find all unhalted active cpus */ 1094 cpumask_andnot(&unhalted_cpus, cpu_active_mask, cpu_halt_mask); 1095 1096 /* find all unhalted active cpus in this cluster */ 1097 cpumask_and(&unhalted_cpus, &unhalted_cpus, &cluster->cpus); 1098 1099 if (is_compat_thread(task_thread_info(p))) 1100 /* try to find a packing cpu within 32 bit subset */ 1101 cpumask_and(&unhalted_cpus, &unhalted_cpus, system_32bit_el0_cpumask()); 1102 1103 /* return the first found unhalted, active cpu, in this cluster */ 1104 packing_cpu = cpumask_first(&unhalted_cpus); 1105 1106 /* packing cpu must be a valid cpu for runqueue lookup */ 1107 if (packing_cpu >= nr_cpu_ids) 1108 return -1; 1109 1110 /* if cpu is not allowed for this task */ 1111 if (!cpumask_test_cpu(packing_cpu, p->cpus_ptr)) 1112 return -1; 1113 1114 /* if cluster util is high */ 1115 if (sched_get_cluster_util_pct(cluster) >= sysctl_sched_cluster_util_thres_pct) 1116 return -1; 1117 1118 /* if cpu utilization is high */ 1119 if (cpu_util(packing_cpu) >= sysctl_sched_idle_enough) 1120 return -1; 1121 1122 /* don't pack big tasks */ 1123 if (task_util(p) >= sysctl_sched_idle_enough) 1124 return -1; 1125 1126 if (task_reject_partialhalt_cpu(p, packing_cpu)) 1127 return -1; 1128 1129 /* don't pack if running at a freq higher than 43.9pct of its fmax */ 1130 if (arch_scale_freq_capacity(packing_cpu) > 450) 1131 return -1; 1132 1133 /* the packing cpu can be used, so pack! */ 1134 return packing_cpu; 1135 }
代码看起来是的,
1. 首先要设置sysctl_sched_idle_enough, 表示让cpu尽可能进入idle状态,packing才会生效;
2. 在start cpu对应的cluster里,考虑cpu_active_mask,cpu_halt_mask,如果是32位进程考虑支持32位进程的cpu mask system_32bit_el0_cpumask, 考虑task的cpu affinity p->cpus_ptr。在符合这些条件的cpu中找packing cpu。
3. 检查cluster的cpu使用率(前一个window)是否大于sysctl_sched_cluster_util_thres_pct(40), 只有cluster负载轻时才packing
4. 检查找到的packing cpu的cpu util是否大于sysctl_sched_idle_enough(30), 只有cpu的util小于30时才能被选座packing cpu。
5. 检查将要packing的task的util是否大于sysctl_sched_idle_enough(30),只有task util小于30的小task才能packing
6. 检查packing cpu是否能是partial halt cpu(加入uifirst和 frameboost检查)
7. 检查packing cpu当前频点的freq_scale是否小于450. 要求packing cpu的当前要处在较低频点。freq_scale=1024*(cpu_curr_freq/cpu_max_freq)