0 背景
为了更好的检测linux kernel中内存out-of-bounds、mem-corruption、use-after-free、invaild-free等问题,调研了kfence功能(该功能在linux kernel 5.12引入),帮助研发更好的分析与定位这类内存错误的问题。
一、kfence介绍
1.1 什么是kfence
kfence是Linux kernel中用于检测内存错误的工具,如检测out-of-bounds、mem-corruption、use-after-free、invaild-free等,利用该工具尽早发现项目中存在的内存错误问题,帮助研发人员快速定位分析这些问题。
1.2 kfence与kasan区别
检测范围 | 检测原理 | 性能影响 | 适用场景 | |
kfence | 小于1个page(4KB)的slab内存分配 | 1)采用page fence和canary pattern机制检测内存out-of-bounds 2)采用data page的状态标志(如已释放的data page标记free)检测内存use-after-free | 对内存的影响: kfence采用以大量内存开销换取较小的性能干扰的思路,占用的内存较高,但可设定任意较小的num_objects来节约内存; 其他情况(全量模式及动态开启)则需消耗GB级别的内存。 对性能的影响: 采样模式下,对性能影响较小; 全量模式,对性能影响较大。 | 采样模式下,由于性能开销较小,可以在量产阶段使用 |
ksan | 适用整个kernel的内存分配,包括所有的slab、page、堆栈和全局内存等 | 采用shadow memory检测机制 | 开销较大 | 由于性能开销大,一般在研发阶段使用 |
二、kfence如何使用
kfence是linux kernel 5.12版本才引入,低内核版本想使用kfence工具,第一步需要功能移植(详见第四节)。
2.1 打开kfence功能开关
CONFIG_KFENCE=y // kfence enable
CONFIG_KFENCE_SAMPLE_INTERVAL=500 // 采样时间间隔,每隔500ms做检测
CONFIG_KFENCE_NUM_OBJECTS=63 // kfence内存池size
以上宏控配置可以根据自己的需求来做配置。
2.2 debug
宏控配置的方式不够灵活,不利于debug。因此,内核向用户空间提供了一些节点,方便用户动态调整配置:
/sys/module/kfence/parameters/check_on_panic
Y:更多的DEBUG信息
N:在生产环境中,减少系统崩溃时的额外开销/sys/module/kfence/parameters/deferrable
Y:KFENCE可以延迟执行某些内存检测操作,以减少对系统性能的影响
N:KFENCE 不会延迟执行内存检测操作,而是立即执行/sys/kernel/debug/kfence/stats // 记录kfence内存检测的状态信息/sys/kernel/debug/kfence/objects // 提供关于 KFENCE 管理的内存对象的信息echo -1 > /sys/module/kfence/parameters/sample_interval // 动态调整内存检测的采样时间间隔;0:表示关闭kfence功能,-1:所有符合(slab类型筛选)条件的内存均将进入kfence的监控范围内
echo 100 > /sys/module/kfence/parameters/skip_covered_thresh // 当某个内存区域的访问频率超过这个阈值时,KFENCE 可能会选择跳过对该区域的检测
2.3 查询相关日志信息
当kfence捕获到内存错误问题时,可以 cat /sys/kernel/debug/kfence/stats节点,查看total bugs计数会增加:
系统会将信息打印在dmesg,通过dmesg | grep -i kfence查询kfence相关的错误日志信息:
2.4 如何独立收集这些错误信息
在kfence捕获到内存错误,将日志输出到dmesg附近做hook,将日志获取到。详见3.2节。
三、kfence实现原理
3.1 检测原理
3.1.1 slub/slab hook实现
需要在slub/slab的malloc、free流程中加入kfence模块的hook,这样在内存分配与释放流程中才能走kfence的malloc、free流程,实现对内存错误的监控。
1)kfence alloc实现流程
在初始化阶段,kfence创建了自己的专有检测内存池 kfence_pool,详见3.3。
kmem_cache_alloc--->__kmem_cache_alloc_lru---> slab_alloc--->slab_alloc_node--->kfence_alloc,kfence alloc代码实现,详见3.4节。
2)kfence free实现流程
__kmem_cache_free--->__do_kmem_cache_free--->__cache_free--->__kfence_free,kfence free代码实现,详见3.5节。
3.1.2 use-after-free
obj 被 free 以后,对应 data page 也会被设置成不可访问状态。当被访问时,立刻会触发异常。
3.1.3 out-of-bounds或mem-corruption
内存访问越界,可分为data page页外访问越界(out-of-bounds)和页内访问越界(mem-corruption)。
data page页外访问越界:
从 kfence_pool内存池中分配一个内存对象 obj,不管 obj 的实际大小有多大,都会占据一个 data page, data page 的两边加上了 fence page 电子栅栏,利用 MMU 的特性把 fence page 设置成不可访问。如果对 data page 的访问越过了 page 边界, 即访问page fence,就会立刻触发异常,这种就称为data page页外访问越界。
data page页内访问越界:
大部分情况下 obj 是小于一个 page 的,对于 data page 剩余空间系统使用 canary pattern 进行填充。这种操作是为了检测超出了 obj 但还在 data page 范围内的溢出访问,这种就称为data page页内访问越界。
页内访问越界发生时不会立刻触发,只能在 obj free 时,通过检测 canary pattern 被破坏来检测到有 canary 区域的溢出访问,这种异常访问也被叫做mem-corruption.
3.1.4 invalid-free
当obj free 时,会检查记录的 malloc 信息,判断是不是一次异常的 free,如内存重复释放。
3.2 异常如何触发&日志打印
1)use-after-free:KFENCE_ERROR_UAF类型的内存错误
当某个模块的代码中触发了use-after-free,会走kernel原生的流程,调用kfence的kfence_handle_page_fault函数,进行错误日志的收集与打印。
// kernel/arch/arm/mm/fault.c /** Oops. The kernel tried to access some page that wasn't present.*/
static void
__do_kernel_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,struct pt_regs *regs)
{const char *msg;/** Are we prepared to handle this kernel fault?*/if (fixup_exception(regs))return;/** No handler, we'll have to terminate things with extreme prejudice.*/if (addr < PAGE_SIZE) {msg = "NULL pointer dereference";} else {if (is_translation_fault(fsr) &&kfence_handle_page_fault(addr, is_write_fault(fsr), regs))return;msg = "paging request";}die_kernel_fault(msg, mm, addr, fsr, regs);
}
kfence_handle_page_fault函数中判断是KFENCE_ERROR_OOB或KFENCE_ERROR_UAF类型的错误,调用kfence_report_error将错误的日志打印到dmesg.
bool kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs *regs)
{const int page_index = (addr - (unsigned long)__kfence_pool) / PAGE_SIZE;struct kfence_metadata *to_report = NULL;enum kfence_error_type error_type;unsigned long flags;if (!is_kfence_address((void *)addr))return false;if (!READ_ONCE(kfence_enabled)) /* If disabled at runtime ... */return kfence_unprotect(addr); /* ... unprotect and proceed. */atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);// 判断是KFENCE_ERROR_OOB(data page页外越界访问)还是KFENCE_ERROR_UAF(use-after-free)类型的错误// 如果page_index是奇数,说明是fence page被访问,KFENCE_ERROR_OOB类型错误// 如果page_index是偶数,说明是data page释放后被访问,KFENCE_ERROR_UAF类型错误if (page_index % 2) {/* This is a redzone, report a buffer overflow. */struct kfence_metadata *meta;int distance = 0;meta = addr_to_metadata(addr - PAGE_SIZE);if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) {to_report = meta;/* Data race ok; distance calculation approximate. */distance = addr - data_race(meta->addr + meta->size);}meta = addr_to_metadata(addr + PAGE_SIZE);if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) {/* Data race ok; distance calculation approximate. */if (!to_report || distance > data_race(meta->addr) - addr)to_report = meta;}if (!to_report)goto out;raw_spin_lock_irqsave(&to_report->lock, flags);to_report->unprotected_page = addr;error_type = KFENCE_ERROR_OOB;/** If the object was freed before we took the look we can still* report this as an OOB -- the report will simply show the* stacktrace of the free as well.*/} else {to_report = addr_to_metadata(addr);if (!to_report)goto out;raw_spin_lock_irqsave(&to_report->lock, flags);error_type = KFENCE_ERROR_UAF;/** We may race with __kfence_alloc(), and it is possible that a* freed object may be reallocated. We simply report this as a* use-after-free, with the stack trace showing the place where* the object was re-allocated.*/}out:if (to_report) {kfence_report_error(addr, is_write, regs, to_report, error_type);raw_spin_unlock_irqrestore(&to_report->lock, flags);} else {/* This may be a UAF or OOB access, but we can't be sure. */// 无法判断是哪种类型的内存错误kfence_report_error(addr, is_write, regs, NULL, KFENCE_ERROR_INVALID);}return kfence_unprotect(addr); /* Unprotect and let access proceed. */
}
2)out-of-bounds(页外访问越界):KFENCE_ERROR_OOB类型的内存错误
同上
3)out-of-bounds(页内访问越界):KFENCE_ERROR_CORRUPTION类型的内存错误
在kfence allock阶段初始化canary区域(详见3.4),kfence free阶段去检测canary区域是否被访问过或破坏,如果被破坏,传入KFENCE_ERROR_CORRUPTION类型的参数,调用kfence_report_error函数,打印错误日志信息。
static void kfence_guarded_free(void *addr, struct kfence_metadata *meta, bool zombie)
{....../* Check canary bytes for memory corruption. */for_each_canary(meta, check_canary_byte);......
}/* __always_inline this to ensure we won't do an indirect call to fn. */
static __always_inline void for_each_canary(const struct kfence_metadata *meta, bool (*fn)(u8 *))
{// pageaddr为这块data page的首地址const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE);unsigned long addr;/** We'll iterate over each canary byte per-side until fn() returns* false. However, we'll still iterate over the canary bytes to the* right of the object even if there was an error in the canary bytes to* the left of the object. Specifically, if check_canary_byte()* generates an error, showing both sides might give more clues as to* what the error is about when displaying which bytes were corrupted.*//* Apply to left of object. */// 检查左边的canary区域for (addr = pageaddr; addr < meta->addr; addr++) {if (!fn((u8 *)addr))break;}/* Apply to right of object. */// 检查右边的canary区域for (addr = meta->addr + meta->size; addr < pageaddr + PAGE_SIZE; addr++) {if (!fn((u8 *)addr))break;}
}/* Check canary byte at @addr. */
static inline bool check_canary_byte(u8 *addr)
{struct kfence_metadata *meta;unsigned long flags;// 如果data page的canary区域没被访问过或破坏,直接返回,否则,调用kfence_report_error函数,打印错误日志信息if (likely(*addr == KFENCE_CANARY_PATTERN(addr)))return true;atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);// 根据内存地址找到元数据对象meta = addr_to_metadata((unsigned long)addr);raw_spin_lock_irqsave(&meta->lock, flags);// 传入KFENCE_ERROR_CORRUPTION类型的参数,调用kfence_report_error函数,打印错误日志信息kfence_report_error((unsigned long)addr, false, NULL, meta, KFENCE_ERROR_CORRUPTION);raw_spin_unlock_irqrestore(&meta->lock, flags);return false;
}
/** Get the canary byte pattern for @addr. Use a pattern that varies based on the* lower 3 bits of the address, to detect memory corruptions with higher* probability, where similar constants are used.*/
#define KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)(addr) & 0x7))
4)invalid-free:KFENCE_ERROR_INVALID_FREE类型的内存错误
kfence free阶段去检测本次内存释放是否为invalid-free,调用kfence_report_error函数,传入KFENCE_ERROR_INVALID_FREE类型的参数,打印错误日志信息。
static void kfence_guarded_free(void *addr, struct kfence_metadata *meta, bool zombie)
{......// 如果内存块没有被分配就释放(包含了double-free)或内存块分配与释放时的地址不一样,认为本次释放是invalid-freeif (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) {/* Invalid or double-free, bail out. */atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);// 调用kfence_report_error函数,传入KFENCE_ERROR_INVALID_FREE类型的参数,打印错误日志信息kfence_report_error((unsigned long)addr, false, NULL, meta,KFENCE_ERROR_INVALID_FREE);raw_spin_unlock_irqrestore(&meta->lock, flags);return;}......
}
下面看如何打印错误的日志信息,kfence_report_error错误的日志信息会打印到dmesg.
#define pr_err printkvoid kfence_report_error(unsigned long address, bool is_write, struct pt_regs *regs,const struct kfence_metadata *meta, enum kfence_error_type type)
{....../* Print report header. */switch (type) {// 打印data page页外访问越界的错误日志信息到dmesgcase KFENCE_ERROR_OOB: {const bool left_of_object = address < meta->addr;pr_err("BUG: KFENCE: out-of-bounds %s in %pS\n\n", get_access_type(is_write),(void *)stack_entries[skipnr]);pr_err("Out-of-bounds %s at 0x%p (%luB %s of kfence-#%td):\n",get_access_type(is_write), (void *)address,left_of_object ? meta->addr - address : address - meta->addr,left_of_object ? "left" : "right", object_index);break;}// 打印use-after-free的错误日志信息到dmesgcase KFENCE_ERROR_UAF:pr_err("BUG: KFENCE: use-after-free %s in %pS\n\n", get_access_type(is_write),(void *)stack_entries[skipnr]);pr_err("Use-after-free %s at 0x%p (in kfence-#%td):\n",get_access_type(is_write), (void *)address, object_index);break;// 打印data page页内(canary区域内存破坏)访问越界的错误日志信息到dmesgcase KFENCE_ERROR_CORRUPTION:pr_err("BUG: KFENCE: memory corruption in %pS\n\n", (void *)stack_entries[skipnr]);pr_err("Corrupted memory at 0x%p ", (void *)address);print_diff_canary(address, 16, meta);pr_cont(" (in kfence-#%td):\n", object_index);break;case KFENCE_ERROR_INVALID:pr_err("BUG: KFENCE: invalid %s in %pS\n\n", get_access_type(is_write),(void *)stack_entries[skipnr]);pr_err("Invalid %s at 0x%p:\n", get_access_type(is_write),(void *)address);break;// 打印invalid-free的错误日志信息到dmesgcase KFENCE_ERROR_INVALID_FREE:pr_err("BUG: KFENCE: invalid free in %pS\n\n", (void *)stack_entries[skipnr]);pr_err("Invalid free of 0x%p (in kfence-#%td):\n", (void *)address,object_index);break;}......
}
3.3 kfence init
kfence初始化主要做了几件事情:
1)判断kfence_sample_interval采样间隔是否为0,设置为0,说明kfence功能disable
2)分配kfence pool内存池,默认内存块是255,分配(255+1)*2 = 512个page,包括255个data page,256个fence page,1个不可用的data page(放在第一个位置,记为page 0)
3)初始化metadata数组,记录每个data page内存块状态信息
4)初始化freelist空闲链表,记录data page内存块的是否可分配
5)将所有fence page和page 0设置为不可访问
// mm/kfence/core.cvoid __init kfence_init(void)
{stack_hash_seed = get_random_u32();/* Setting kfence_sample_interval to 0 on boot disables KFENCE. */// 1. 采样间隔为0,kfence disableif (!kfence_sample_interval)return;// 2. 初始化kfence pool内存池if (!kfence_init_pool_early()) {pr_err("%s failed\n", __func__);return;}kfence_init_enable();
}
static bool __init kfence_init_pool_early(void)
{unsigned long addr;if (!__kfence_pool)return false;addr = kfence_init_pool();......
}#define KFENCE_POOL_SIZE ((CONFIG_KFENCE_NUM_OBJECTS + 1) * 2 * PAGE_SIZE) // 默认为256*2个page
static struct list_head kfence_freelist = LIST_HEAD_INIT(kfence_freelist); // 空闲链表,记录空闲的内存块
struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS]; // metadata数组,记录data page内存块状态信息
/** Initialization of the KFENCE pool after its allocation.* Returns 0 on success; otherwise returns the address up to* which partial initialization succeeded.*/
static unsigned long kfence_init_pool(void)
{unsigned long addr;struct page *pages;int i;if (!arch_kfence_init_pool())return (unsigned long)__kfence_pool;addr = (unsigned long)__kfence_pool;// 将虚拟地址转换为物理地址pages = virt_to_page(__kfence_pool);/** Set up object pages: they must have PG_slab set, to avoid freeing* these as real pages.** We also want to avoid inserting kfence_free() in the kfree()* fast-path in SLUB, and therefore need to ensure kfree() correctly* enters __slab_free() slow-path.*/// 默认分配512个pagefor (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {struct slab *slab = page_slab(nth_page(pages, i));if (!i || (i % 2))continue;__folio_set_slab(slab_folio(slab));
#ifdef CONFIG_MEMCGslab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 1].objcg |MEMCG_DATA_OBJCGS;
#endif}/** Protect the first 2 pages. The first page is mostly unnecessary, and* merely serves as an extended guard page. However, adding one* additional page in the beginning gives us an even number of pages,* which simplifies the mapping of address to metadata index.*/for (i = 0; i < 2; i++) {if (unlikely(!kfence_protect(addr)))return addr;addr += PAGE_SIZE;}for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {struct kfence_metadata *meta = &kfence_metadata_init[i];/* Initialize metadata. */INIT_LIST_HEAD(&meta->list);raw_spin_lock_init(&meta->lock);// 记录内存块状态为unusedmeta->state = KFENCE_OBJECT_UNUSED;// 记录内存块地址meta->addr = addr; /* Initialize for validation in metadata_to_pageaddr(). */// 加入空闲链表list_add_tail(&meta->list, &kfence_freelist);/* Protect the right redzone. */// 将fence page设置为不可访问if (unlikely(!kfence_protect(addr + PAGE_SIZE)))goto reset_slab;// 下一个data page的首地址addr += 2 * PAGE_SIZE; // 每个page data间隔8KB,因为中间隔了一个fence page}/** Make kfence_metadata visible only when initialization is successful.* Otherwise, if the initialization fails and kfence_metadata is freed,* it may cause UAF in kfence_shutdown_cache().*/smp_store_release(&kfence_metadata, kfence_metadata_init);return 0;reset_slab:for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {struct slab *slab = page_slab(nth_page(pages, i));if (!i || (i % 2))continue;
#ifdef CONFIG_MEMCGslab->memcg_data = 0;
#endif__folio_clear_slab(slab_folio(slab));}return addr;
}
3.4 kfence alloc
Kfence alloc主要做了以下几个事情:
1)从kfence pool内存池中找到空闲内存块(data page)
2)向data page canary区域写入固定的数据,便于在free阶段做检测
void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
{unsigned long stack_entries[KFENCE_STACK_DEPTH];size_t num_stack_entries;u32 alloc_stack_hash;/** Perform size check before switching kfence_allocation_gate, so that* we don't disable KFENCE without making an allocation.*/// 如果申请的内存超过1个page(4KB),直接返回NULLif (size > PAGE_SIZE) {atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_INCOMPAT]);return NULL;}/** Skip allocations from non-default zones, including DMA. We cannot* guarantee that pages in the KFENCE pool will have the requested* properties (e.g. reside in DMAable memory).*/if ((flags & GFP_ZONEMASK) ||(s->flags & (SLAB_CACHE_DMA | SLAB_CACHE_DMA32))) {atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_INCOMPAT]);return NULL;}/** Skip allocations for this slab, if KFENCE has been disabled for* this slab.*/// 标志位设置了 SLAB_SKIP_KFENCE,说明对于该 slab 已经禁用了 KFENCE,直接返回 NULL/*除此之外,还有以下标志位SLAB_RECLAIM_ACCOUNT:用于标记 slab 是可回收的,即可以被内存回收机制重新使用。SLAB_PANIC:在出现内存分配失败时,会触发内核崩溃转储,用于故障排除。 SLAB_CONSISTENCY_CHECKS:启用一致性检查,用于检测内存污染或其他问题。SLAB_RED_ZONE:在分配的内存块两端添加红色区域,用于检测写越界操作。SLAB_STORE_USER:在 slab 元数据中存储用户定义的数据。 SLAB_DEBUG_OBJECTS:用于开启额外的对象调试功能。*/if (s->flags & SLAB_SKIP_KFENCE)return NULL;// kfence_allocation_gate > 1,说明还没到下一轮采样时间点if (atomic_inc_return(&kfence_allocation_gate) > 1)return NULL;
#ifdef CONFIG_KFENCE_STATIC_KEYS/** waitqueue_active() is fully ordered after the update of* kfence_allocation_gate per atomic_inc_return().*/if (waitqueue_active(&allocation_wait)) {/** Calling wake_up() here may deadlock when allocations happen* from within timer code. Use an irq_work to defer it.*/irq_work_queue(&wake_up_kfence_timer_work);}
#endifif (!READ_ONCE(kfence_enabled))return NULL;num_stack_entries = stack_trace_save(stack_entries, KFENCE_STACK_DEPTH, 0);/** Do expensive check for coverage of allocation in slow-path after* allocation_gate has already become non-zero, even though it might* mean not making any allocation within a given sample interval.** This ensures reasonable allocation coverage when the pool is almost* full, including avoiding long-lived allocations of the same source* filling up the pool (e.g. pagecache allocations).*/alloc_stack_hash = get_alloc_stack_hash(stack_entries, num_stack_entries);if (should_skip_covered() && alloc_covered_contains(alloc_stack_hash)) {atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_COVERED]);return NULL;}return kfence_guarded_alloc(s, size, flags, stack_entries, num_stack_entries,alloc_stack_hash);
}static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp,unsigned long *stack_entries, size_t num_stack_entries,u32 alloc_stack_hash)
{// 以kfence_metadata结构体管理元数据struct kfence_metadata *meta = NULL;unsigned long flags;struct slab *slab;void *addr;const bool random_right_allocate = prandom_u32_max(2);const bool random_fault = CONFIG_KFENCE_STRESS_TEST_FAULTS &&!prandom_u32_max(CONFIG_KFENCE_STRESS_TEST_FAULTS);/* Try to obtain a free object. */// 从kfence list中获取空闲的内存块raw_spin_lock_irqsave(&kfence_freelist_lock, flags);if (!list_empty(&kfence_freelist)) {meta = list_entry(kfence_freelist.next, struct kfence_metadata, list);list_del_init(&meta->list);}......meta->addr = metadata_to_pageaddr(meta);/* Unprotect if we're reusing this page. */// 如果该data page被标记为已释放状态,则取消该标记if (meta->state == KFENCE_OBJECT_FREED)kfence_unprotect(meta->addr);/** Note: for allocations made before RNG initialization, will always* return zero. We still benefit from enabling KFENCE as early as* possible, even when the RNG is not yet available, as this will allow* KFENCE to detect bugs due to earlier allocations. The only downside* is that the out-of-bounds accesses detected are deterministic for* such allocations.*/if (random_right_allocate) {/* Allocate on the "right" side, re-calculate address. */meta->addr += PAGE_SIZE - size;meta->addr = ALIGN_DOWN(meta->addr, cache->align);}addr = (void *)meta->addr;/* Update remaining metadata. */metadata_update_state(meta, KFENCE_OBJECT_ALLOCATED, stack_entries, num_stack_entries);/* Pairs with READ_ONCE() in kfence_shutdown_cache(). */WRITE_ONCE(meta->cache, cache);meta->size = size;meta->alloc_stack_hash = alloc_stack_hash;raw_spin_unlock_irqrestore(&meta->lock, flags);alloc_covered_add(alloc_stack_hash, 1);/* Set required slab fields. */slab = virt_to_slab((void *)meta->addr);slab->slab_cache = cache;
#if defined(CONFIG_SLUB)slab->objects = 1;
#elif defined(CONFIG_SLAB)slab->s_mem = addr;
#endif/* Memory initialization. */// 初始化 canary区域for_each_canary(meta, set_canary_byte);/** We check slab_want_init_on_alloc() ourselves, rather than letting* SL*B do the initialization, as otherwise we might overwrite KFENCE's* redzone.*/if (unlikely(slab_want_init_on_alloc(gfp, cache)))memzero_explicit(addr, size);if (cache->ctor)cache->ctor(addr);if (random_fault)kfence_protect(meta->addr); /* Random "faults" by protecting the object. */atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCATED]);atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCS]);return addr;
}
下面看下是如何向data page的canary区域写入固定的数据:
/* Write canary byte to @addr. */
static inline bool set_canary_byte(u8 *addr)
{*addr = KFENCE_CANARY_PATTERN(addr);return true;
}
3.5 kfence free
kfence free主要做了以下事情:
1) data page释放后,将状态设置为‘不可访问状态’
2)检查data page的canary区域是否被破坏
3)将释放的内存还回到kfence pool内存池或空闲链表
void __kfence_free(void *addr)
{// 地址转换为 struct kfence_metadata 结构体指针 meta。// 这里的 struct kfence_metadata 是内存分配元数据结构,用于追踪内存分配和释放的相关信息。struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);#ifdef CONFIG_MEMCGKFENCE_WARN_ON(meta->objcg);
#endif/** If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing* the object, as the object page may be recycled for other-typed* objects once it has been freed. meta->cache may be NULL if the cache* was destroyed.*/// 码判断了 meta 对应的缓存是否存在,并且缓存的标志为 SLAB_TYPESAFE_BY_RCU,// 如果满足条件,则调用 call_rcu 来延迟释放对象。这是因为一些缓存类型在被释放后可能会// 立即被重新利用,因此需要通过 RCU 机制来确保安全释放。if (unlikely(meta->cache && (meta->cache->flags & SLAB_TYPESAFE_BY_RCU)))call_rcu(&meta->rcu_head, rcu_guarded_free);else// 否则,立即释放内存kfence_guarded_free(addr, meta, false);
}static void kfence_guarded_free(void *addr, struct kfence_metadata *meta, bool zombie)
{struct kcsan_scoped_access assert_page_exclusive;unsigned long flags;bool init;raw_spin_lock_irqsave(&meta->lock, flags);if (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) {/* Invalid or double-free, bail out. */atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);kfence_report_error((unsigned long)addr, false, NULL, meta,KFENCE_ERROR_INVALID_FREE);raw_spin_unlock_irqrestore(&meta->lock, flags);return;}/* Detect racy use-after-free, or incorrect reallocation of this page by KFENCE. */kcsan_begin_scoped_access((void *)ALIGN_DOWN((unsigned long)addr, PAGE_SIZE), PAGE_SIZE,KCSAN_ACCESS_SCOPED | KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT,&assert_page_exclusive);if (CONFIG_KFENCE_STRESS_TEST_FAULTS)kfence_unprotect((unsigned long)addr); /* To check canary bytes. *//* Restore page protection if there was an OOB access. */// if (meta->unprotected_page) {memzero_explicit((void *)ALIGN_DOWN(meta->unprotected_page, PAGE_SIZE), PAGE_SIZE);kfence_protect(meta->unprotected_page);meta->unprotected_page = 0;}/* Mark the object as freed. */// data page释放后,需要将状态设置为‘不可访问状态’,若被访问,立即触发use-after-free异常metadata_update_state(meta, KFENCE_OBJECT_FREED, NULL, 0);init = slab_want_init_on_free(meta->cache);raw_spin_unlock_irqrestore(&meta->lock, flags);alloc_covered_add(meta->alloc_stack_hash, -1);/* Check canary bytes for memory corruption. */// 检查data page的canary区域是否被破坏,即是否被访问过for_each_canary(meta, check_canary_byte);/** Clear memory if init-on-free is set. While we protect the page, the* data is still there, and after a use-after-free is detected, we* unprotect the page, so the data is still accessible.*/if (!zombie && unlikely(init))memzero_explicit(addr, meta->size);/* Protect to detect use-after-frees. */kfence_protect((unsigned long)addr);kcsan_end_scoped_access(&assert_page_exclusive);// 如果不是僵死进程,则将释放的内存还回到kfence pool内存池或空闲链表if (!zombie) {/* Add it to the tail of the freelist for reuse. */raw_spin_lock_irqsave(&kfence_freelist_lock, flags);KFENCE_WARN_ON(!list_empty(&meta->list));list_add_tail(&meta->list, &kfence_freelist);raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);atomic_long_dec(&counters[KFENCE_COUNTER_ALLOCATED]);atomic_long_inc(&counters[KFENCE_COUNTER_FREES]);} else {/* See kfence_shutdown_cache(). */atomic_long_inc(&counters[KFENCE_COUNTER_ZOMBIES]);}
}
3.6 metadata
metadata用于记录内存块的状态。
3.7 核心数据结构
/* Alloc/free tracking information. */
// 用于跟踪分配和释放的信息
struct kfence_track {pid_t pid; // 进行分配/释放内存操作的进程IDint cpu; // 进行操作时的CPUu64 ts_nsec; // 记录内存分配或释放时间点int num_stack_entries; // 函数调用栈数量unsigned long stack_entries[KFENCE_STACK_DEPTH]; // 函数调用栈存放数组
};/* KFENCE error types for report generation. */
// 异常类型定义
enum kfence_error_type {KFENCE_ERROR_OOB, /* Detected a out-of-bounds access. */KFENCE_ERROR_UAF, /* Detected a use-after-free access. */KFENCE_ERROR_CORRUPTION, /* Detected a memory corruption on free. */KFENCE_ERROR_INVALID, /* Invalid access of unknown type. */KFENCE_ERROR_INVALID_FREE, /* Invalid free. */
};/* KFENCE object states. */
// 定义元数据对象的状态
enum kfence_object_state {KFENCE_OBJECT_UNUSED, /* Object is unused. */KFENCE_OBJECT_ALLOCATED, /* Object is currently allocated. */KFENCE_OBJECT_FREED, /* Object was allocated, and then freed. */
};/* KFENCE metadata per guarded allocation. */
// 用于记录data page的信息
struct kfence_metadata {struct list_head list; /* Freelist node; access under kfence_freelist_lock. */struct rcu_head rcu_head; /* For delayed freeing. *//** Lock protecting below data; to ensure consistency of the below data,* since the following may execute concurrently: __kfence_alloc(),* __kfence_free(), kfence_handle_page_fault(). However, note that we* cannot grab the same metadata off the freelist twice, and multiple* __kfence_alloc() cannot run concurrently on the same metadata.*/raw_spinlock_t lock;/* The current state of the object; see above. */enum kfence_object_state state; // 内存块的状态/** Allocated object address; cannot be calculated from size, because of* alignment requirements.** Invariant: ALIGN_DOWN(addr, PAGE_SIZE) is constant.*/unsigned long addr; // data page内存块的地址/** The size of the original allocation.*/size_t size; // 原始size/** The kmem_cache cache of the last allocation; NULL if never allocated* or the cache has already been destroyed.*/struct kmem_cache *cache; // 用于分配小块内存的高速缓存,减少频繁地分配和释放内存的开销/** In case of an invalid access, the page that was unprotected; we* optimistically only store one address.*/unsigned long unprotected_page;/* Allocation and free stack information. */struct kfence_track alloc_track; // 记录内存分配的信息struct kfence_track free_track; // 记录内存释放的信息/* For updating alloc_covered on frees. */u32 alloc_stack_hash; // 使用 alloc_stack_hash 来比较分配和释放时的栈信息哈希值,可以提高对释放操作的准确性和安全性
#ifdef CONFIG_MEMCGstruct obj_cgroup *objcg;
#endif
};
四、如何移植kfence
kfence功能在linux kernel 5.12被引入,低内核版本要使用kfence,需做功能移植,如Alibaba Cloud Linux 3在内核版本5.10.134-16支持kfence功能。
功能移植,主要分为三个模块,如下:
1)移植框架代码
这部分是kfence功能代码,主要文件,如下:
include/linux/kfence.h
init/main.c
lib/Kconfig.debug
lib/Kconfig.kfence
mm/Makefile
mm/kfence/Makefile
mm/kfence/core.c
mm/kfence/kfence.h
mm/kfence/report.c
2)移植ARM平台代码
这部分是kfence在arm平台的hook代码,主要文件,如下:
arch/arm64/Kconfig
arch/arm64/include/asm/kfence.h
arch/arm64/mm/fault.c
arch/arm64/mm/mmu.c
3)移植slub模块中的hook代码
这部分是kfence在slub内存分配器的hook代码,主要文件,如下:
include/linux/slub_def.h
mm/kfence/core.c
mm/slub.c