一、背景
linux kernel 内存踩踏之KASAN(一)_kasan版本跟hasan版本区别-CSDN博客
linux kernel 内存踩踏之KASAN_SW_TAGS(二)-CSDN博客
最后来介绍一下KASAN_HW_TAGS,ARM64上就是MTE,这个特性在ARMv8.5支持,实际目前市面支持MTE的芯片都是ARMv9了; 由于这个特性依赖硬件支持,本文利用qemu 学习这个feature。
二、KASAN_HW_TAGS (MTE)使能相关配置
内核相关配置
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_SW_TAGS=y
CONFIG_HAVE_ARCH_KASAN_HW_TAGS=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_KASAN_SW_TAGS=y
CONFIG_KASAN=y
# CONFIG_KASAN_GENERIC is not set
# CONFIG_KASAN_SW_TAGS is not set
CONFIG_KASAN_HW_TAGS=y //mte相关
CONFIG_KASAN_VMALLOC=y
MTE 相关feature 是否打开
502 # ARMv8.5 architectural features503 #504 CONFIG_AS_HAS_ARMV8_5=y......508 CONFIG_ARM64_AS_HAS_MTE=y509 CONFIG_ARM64_MTE=y
确认MTE是否正常打开
geek@geek-virtual-machine:~/workspace/linux/qemu$ ./linux_boot.sh
qemu-system-aarch64: MTE requested, but not supported by the guest CPU
调试时遇到,MTE未打开的情况,可以打断点在 kasan_init_hw_tags
void __init kasan_init_hw_tags(void)
{/* If hardware doesn't support MTE, don't initialize KASAN. */if (!system_supports_mte())return;....../* KASAN is now initialized, enable it. */static_branch_enable(&kasan_flag_enabled);pr_info("KernelAddressSanitizer initialized (hw-tags, mode=%s, vmalloc=%s, stacktrace=%s)\n",kasan_mode_info(),kasan_vmalloc_enabled() ? "on" : "off",kasan_stack_collection_enabled() ? "on" : "off");
}
上面的异常最终确认是之前所使用的CPU类型不支持,修改的qemu启动脚本如下:
主要是machine增加mte=on字段,CPU选择支持mte的架构,如:cortex-a710
qemu-system-aarch64 \-machine virt,gic-version=3,mte=on \-nographic \-m size=2048M \-cpu cortex-a710 \-smp 8 \ -kernel Image \-drive format=raw,file=rootfs.img \-append "root=/dev/vda rw nokaslr kasan=on kasan.mode=sync kasan.stacktrace=on kasan.fault=report " \-s
成功打开时,内核kmsg会打印:
kasan: KernelAddressSanitizer initialized (hw-tags, mode=sync, vmalloc=on, stacktrace=on)
三、KASAN_HW_TAGS(MTE)基本原理
MTE的lock和key模型
MTE中key存放在指针高byte中,lock则是对内存的标记,只有key和lock匹配时,才能正常访问和操作内存。
MTE新增的指令
Instruction | Name |
ADDG | Add with Tag |
CMPP | Compare with Tag |
GMI | Tag Mask Insert |
IRG | Insert Random Tag |
LDG | Load Allocation Tag |
LDGV | Load Tag Vector |
ST2G | Store Allocaton Tags to two granules |
STG | Store Allocation Tag |
STGP | Store Allocation Tag and Pair |
STGV | Store Tag Vector |
STZ2G | Store Allocation Tags to two granules Zeroing |
STZG | Store Allocation Tag, Zeroing |
SUBG | Subtract with Tag |
SUBP | Subtract Pointer |
SUBPS | Subtract Pointer, setting Flags |
... |
基本上MTE的使用分为三步:
1、memtag create(lock)
2、address tag(指针key)
MTE 需要结合ARM64的TBI(Top Byte Ignore)特性,在指针最高byte存储tag信息,这个实现和前面介绍的KASAN_SW_TAGS类似,不过MTE只需要4bit就够了。
3、tag check
四、Linux中KASAN_HW_TAGS(MTE)关键实现
4.1 先看一个例子日志
还是使用之前的测试程序 linux kernel 内存踩踏之KASAN(一)_kasan版本跟hasan版本区别-CSDN博客:
/test # echo 0 > /dev/kasan_test
[ 156.628134] kmalloc_oob_right f9ff0000038b5000
[ 156.629125] ==================================================================
[ 156.633409] BUG: KASAN: invalid-access in kmalloc_oob_right.constprop.0+0x48/0x64 [kasan_driver]
[ 156.634892] Write at addr f9ff0000038b5081 by task sh/179
[ 156.635552] Pointer tag: [f9], memory tag: [fe]
[ 156.635990]
[ 156.636490] CPU: 4 PID: 179 Comm: sh Tainted: G N 6.6.1-gf1e080ccc5c5-dirty #19
[ 156.637310] Hardware name: linux,dummy-virt (DT)
[ 156.637771] Call trace:
[ 156.638111] dump_backtrace+0x90/0xe8
[ 156.638721] show_stack+0x18/0x24
[ 156.639046] dump_stack_lvl+0x48/0x60
[ 156.639391] print_report+0x100/0x600
[ 156.639703] kasan_report+0x84/0xac
[ 156.640034] __do_kernel_fault+0xa4/0x194
[ 156.640376] do_tag_check_fault+0x78/0x8c
[ 156.640724] do_mem_abort+0x44/0x94
[ 156.641052] el1_abort+0x40/0x60
[ 156.641367] el1h_64_sync_handler+0xa4/0xe4
[ 156.641719] el1h_64_sync+0x64/0x68
[ 156.642042] kmalloc_oob_right.constprop.0+0x48/0x64 [kasan_driver]
[ 156.642511] kasan_test_case+0x38/0xb0 [kasan_driver]
[ 156.642921] kasan_testcase_write+0x7c/0xf4 [kasan_driver]
[ 156.643350] vfs_write+0xc8/0x300
[ 156.643666] ksys_write+0x74/0x10c
[ 156.643986] __arm64_sys_write+0x1c/0x28
[ 156.644336] invoke_syscall+0x48/0x110
[ 156.644681] el0_svc_common.constprop.0+0x40/0xe0
[ 156.645082] do_el0_svc+0x1c/0x28
[ 156.645415] el0_svc+0x40/0x114
[ 156.645728] el0t_64_sync_handler+0x120/0x12c
[ 156.646092] el0t_64_sync+0x19c/0x1a0
[ 156.646528]
[ 156.646749] The buggy address belongs to the object at ffff0000038b5080
[ 156.646749] which belongs to the cache kmalloc-128 of size 128
[ 156.647547] The buggy address is located 1 bytes inside of
[ 156.647547] 128-byte region [ffff0000038b5080, ffff0000038b5100)
[ 156.648270]
[ 156.648533] The buggy address belongs to the physical page:
[ 156.649067] page:00000000ffd93f36 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x438b5
[ 156.650024] flags: 0x3fffc0000000800(slab|node=0|zone=0|lastcpupid=0xffff|kasantag=0x0)
[ 156.651089] page_type: 0xffffffff()
[ 156.651723] raw: 03fffc0000000800 f6ff000002c02600 dead000000000122 0000000000000000
[ 156.652262] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
[ 156.652786] page dumped because: kasan: bad access detected
[ 156.653183]
[ 156.653375] Memory state around the buggy address:
[ 156.653836] ffff0000038b4e00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7
[ 156.654346] ffff0000038b4f00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7
[ 156.654857] >ffff0000038b5000: f9 f9 f9 f9 f9 f9 f9 f9 fe fe fe fe fe fe fe fe
[ 156.655342] ^
[ 156.655870] ffff0000038b5100: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
[ 156.656351] ffff0000038b5200: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
[ 156.656842] ==================================================================
[ 156.657836] Disabling lock debugging due to kernel taint
[ 156.659261] kasan_test_case type 0
上面的例子触发越界访问, key 是f9, 访问到越界内存,而越界内存的 memory tag(lock)是fe, 所以触发异常。
4.2 关键代码分析:
测试代码中函数kmalloc_oob_right分析,转化成汇编之后可以看到基于MTE的实现方法在触发越界时不需要像之前kasan/sw_tag kasan那样有读取tag对比的代码了,MTE中这些都是硬件实现的
(gdb) disassemble
Dump of assembler code for function kmalloc_oob_right:0xffff80007a8801b0 <+0>: paciasp
=> 0xffff80007a8801b4 <+4>: adrp x0, 0xffff800081a2d000 <cpucap_ptrs+272>0xffff80007a8801b8 <+8>: stp x29, x30, [sp, #-32]!0xffff80007a8801bc <+12>: mov x2, #0x80 // #1280xffff80007a8801c0 <+16>: mov w1, #0xcc0 // #32640xffff80007a8801c4 <+20>: mov x29, sp0xffff80007a8801c8 <+24>: ldr x0, [x0, #1752]0xffff80007a8801cc <+28>: str x19, [sp, #16]0xffff80007a8801d0 <+32>: bl 0xffff80008022e498 <kmalloc_trace>0xffff80007a8801d4 <+36>: mov x19, x00xffff80007a8801d8 <+40>: adrp x1, 0xffff80007a8840000xffff80007a8801dc <+44>: add x1, x1, #0x1100xffff80007a8801e0 <+48>: mov x2, x00xffff80007a8801e4 <+52>: add x1, x1, #0x300xffff80007a8801e8 <+56>: adrp x0, 0xffff80007a8840000xffff80007a8801ec <+60>: add x0, x0, #0x500xffff80007a8801f0 <+64>: bl 0xffff8000800f45a0 <_printk>0xffff80007a8801f4 <+68>: mov w1, #0x79 // #1210xffff80007a8801f8 <+72>: strb w1, [x19, #129] //触发越界写入0xffff80007a8801fc <+76>: mov x0, x190xffff80007a880200 <+80>: bl 0xffff80008022f5d0 <kfree>0xffff80007a880204 <+84>: ldr x19, [sp, #16]0xffff80007a880208 <+88>: ldp x29, x30, [sp], #320xffff80007a88020c <+92>: autiasp0xffff80007a880210 <+96>: ret
设置memtag, 还是用kmalloc为例:
kmalloc
-->kmalloc_trace-->__kmem_cache_alloc_node-->slab_alloc_node-->slab_post_alloc_hook-->kasan_slab_allocvoid * __must_check __kasan_slab_alloc(struct kmem_cache *cache,void *object, gfp_t flags, bool init)
{..../** Generate and assign random tag for tag-based modes.* Tag is ignored in set_tag() for the generic mode.*/tag = assign_tag(cache, object, false); // 1、随机数分配tagtagged_object = set_tag(object, tag); // 2、设置tag 到指针 /** Unpoison the whole object.* For kmalloc() allocations, kasan_kmalloc() will do precise poisoning.*/kasan_unpoison(tagged_object, cache->object_size, init); 3、设置memtag/* Save alloc info (if possible) for non-kmalloc() allocations. */if (kasan_stack_collection_enabled() && !is_kmalloc_cache(cache))kasan_save_alloc_info(cache, tagged_object, flags);return tagged_object;
}#if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
#define __tag_shifted(tag) ((u64)(tag) << 56)
#define __tag_reset(addr) __untagged_addr(addr)
#define __tag_get(addr) (__u8)((u64)(addr) >> 56)1、分配tag
static inline u8 assign_tag(struct kmem_cache *cache,const void *object, bool init)
{if (IS_ENABLED(CONFIG_KASAN_GENERIC))return 0xff;/** If the cache neither has a constructor nor has SLAB_TYPESAFE_BY_RCU* set, assign a tag when the object is being allocated (init == false).*/https://www.kernel.org/doc/html/v5.15/arm64/memory-tagging-extension.htmlif (!cache->ctor && !(cache->flags & SLAB_TYPESAFE_BY_RCU))return init ? KASAN_TAG_KERNEL : kasan_random_tag();/* For caches that either have a constructor or SLAB_TYPESAFE_BY_RCU: */
#ifdef CONFIG_SLAB/* For SLAB assign tags based on the object index in the freelist. */return (u8)obj_to_index(cache, virt_to_slab(object), (void *)object);
#else/** For SLUB assign a random tag during slab creation, otherwise reuse* the already assigned tag.*/return init ? kasan_random_tag() : get_tag(object);
#endif
}static inline u8 kasan_random_tag(void) { return hw_get_random_tag(); }#ifdef CONFIG_KASAN_HW_TAGS
...
#define hw_get_random_tag() arch_get_random_tag()
#define hw_get_mem_tag(addr) arch_get_mem_tag(addr)
#define hw_set_mem_tag_range(addr, size, tag, init) \arch_set_mem_tag_range((addr), (size), (tag), (init))#ifdef CONFIG_KASAN_HW_TAGS
...
#define arch_get_random_tag() mte_get_random_tag()
#define arch_get_mem_tag(addr) mte_get_mem_tag(addr)
#define arch_set_mem_tag_range(addr, size, tag, init) \mte_set_mem_tag_range((addr), (size), (tag), (init))
#endif /* CONFIG_KASAN_HW_TAGS *//* Generate a random tag. */
static inline u8 mte_get_random_tag(void)
{void *addr;asm(__MTE_PREAMBLE "irg %0, %0": "=r" (addr));return mte_get_ptr_tag(addr);
}设置memtag
static inline void kasan_poison(const void *addr, size_t size, u8 value, bool init)
{addr = kasan_reset_tag(addr);/* Skip KFENCE memory if called explicitly outside of sl*b. */if (is_kfence_address(addr))return;if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK))return;if (WARN_ON(size & KASAN_GRANULE_MASK))return;hw_set_mem_tag_range((void *)addr, size, value, init);
}对比之前的定义:
#define hw_set_mem_tag_range(addr, size, tag, init) \arch_set_mem_tag_range((addr), (size), (tag), (init))#define arch_set_mem_tag_range(addr, size, tag, init) \mte_set_mem_tag_range((addr), (size), (tag), (init))static inline void mte_set_mem_tag_range(void *addr, size_t size, u8 tag,bool init)
{u64 curr, mask, dczid, dczid_bs, dczid_dzp, end1, end2, end3;/* Read DC G(Z)VA block size from the system register. */dczid = read_cpuid(DCZID_EL0);dczid_bs = 4ul << (dczid & 0xf);dczid_dzp = (dczid >> 4) & 1;curr = (u64)__tag_set(addr, tag);mask = dczid_bs - 1;/* STG/STZG up to the end of the first block. */end1 = curr | mask;end3 = curr + size;/* DC GVA / GZVA in [end1, end2) */end2 = end3 & ~mask;/** The following code uses STG on the first DC GVA block even if the* start address is aligned - it appears to be faster than an alignment* check + conditional branch. Also, if the range size is at least 2 DC* GVA blocks, the first two loops can use post-condition to save one* branch each.*/
#define SET_MEMTAG_RANGE(stg_post, dc_gva) \do { \if (!dczid_dzp && size >= 2 * dczid_bs) {\do { \curr = stg_post(curr); \} while (curr < end1); \\do { \dc_gva(curr); \curr += dczid_bs; \} while (curr < end2); \} \\while (curr < end3) \curr = stg_post(curr); \} while (0)if (init)SET_MEMTAG_RANGE(__stzg_post, __dc_gzva);elseSET_MEMTAG_RANGE(__stg_post, __dc_gva);
#undef SET_MEMTAG_RANGE
}static inline u64 __stg_post(u64 p)
{asm volatile(__MTE_PREAMBLE "stg %0, [%0], #16": "+r"(p):: "memory");return p;
}
上面的核心实现可以看到,主要是两个指令:一个是IRG, 一个是STG, 完成了key和lock的填充。
4.3 tag存在哪里?
MTE将tags分成两类:
Address Tag:也就是key, 是4bit存放在虚拟地址的最高byte中(利用ARM64的TBI 特性)
Memory Tag:也叫lock, Memeory tag也是4bit, 每4byte代表16 byte, 与kasan, sw tag kasan 不同,MTE中Memory tag的存储是由硬件实现的。
看上图实际MTE的tag也是存储在memory上的,按照tag的消耗是4bit标记16byte, 开启MTE后也是会消耗1/32的物理内存,但是这个memory 的地址我们在内核是看不到的,kernel也没有看到设定的地方。
翻看ARM手册,如上图所示有一个Memory Tag Unit(MTU)管理和区分tag storage和data storage。
翻看CI-700的手册中有介绍设置MTE tag存储的物理地址的起始地址,其中还描述了这个寄存器只能在secure(EL3)操作,这也是为什么在内核找不到设置的地方(通常MTE使能的硬件平台会在设备树中增加一个保留内存,这个内存也就是在TZ中被设置,用来存储tag信息)
五、用户空间MTE使用方法
前面讲了内核中的MTE实现和使用,用户空间也是类似的,arm官网提供了一个很好的例子:
/** Memory Tagging Extension (MTE) example for Linux** Compile with gcc and use -march=armv8.5-a+memtag* gcc mte-example.c -o mte-example -march=armv8.5-a+memtag** Compilation should be done on a recent Arm Linux machine for the .h files to include MTE support.**/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/auxv.h>
#include <sys/mman.h>
#include <sys/prctl.h>/** Insert a random logical tag into the given pointer.* IRG instruction.*/
#define insert_random_tag(ptr) ({ \uint64_t __val; \asm("irg %0, %1" : "=r" (__val) : "r" (ptr)); \__val; \
})/** Set the allocation tag on the destination address.* STG instruction.*/
#define set_tag(tagged_addr) do { \asm volatile("stg %0, [%0]" : : "r" (tagged_addr) : "memory"); \
} while (0)int main(void)
{unsigned char *ptr; // pointer to memory for MTE demonstration/** Use the architecture dependent information about the processor* from getauxval() to check if MTE is available.*/if (!((getauxval(AT_HWCAP2)) & HWCAP2_MTE)){printf("MTE is not supported\n");return EXIT_FAILURE;}else{printf("MTE is supported\n");}/** Enable MTE with synchronous checking*/if (prctl(PR_SET_TAGGED_ADDR_CTRL,PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xfffe << PR_MTE_TAG_SHIFT),0, 0, 0)){perror("prctl() failed");return EXIT_FAILURE;}/** Allocate 1 page of memory with MTE protection*/ptr = mmap(NULL, sysconf(_SC_PAGESIZE), PROT_READ | PROT_WRITE | PROT_MTE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);if (ptr == MAP_FAILED){perror("mmap() failed");return EXIT_FAILURE;}/** Print the pointer value with the default tag (expecting 0)*/printf("pointer is %p\n", ptr);/** Write the first 2 bytes of the memory with the default tag*/ptr[0] = 0x41;ptr[1] = 0x42;/** Read back to confirm the writes*/printf("ptr[0] = 0x%hhx ptr[1] = 0x%hhx\n", ptr[0], ptr[1]);/** Generate a random tag and store it for the address (IRG instruction)*/ptr = (unsigned char *) insert_random_tag(ptr);/** Set the key on the pointer to match the lock on the memory (STG instruction)*/set_tag(ptr);/** Print the pointer value with the new tag*/printf("pointer is now %p\n", ptr);/** Write the first 2 bytes of the memory again, with the new tag*/ptr[0] = 0x43;ptr[1] = 0x44;/** Read back to confirm the writes*/printf("ptr[0] = 0x%hhx ptr[1] = 0x%hhx\n", ptr[0], ptr[1]);/** Write to memory beyond the 16 byte granule (offsest 0x10)* MTE should generate an exception* If the offset is less than 0x10 no SIGSEGV will occur.*/printf("Expecting SIGSEGV...\n");ptr[0x10] = 0x55;/** Program only reaches this if no SIGSEGV occurs*/printf("...no SIGSEGV was received\n");return EXIT_FAILURE;
}
上面的例子很简单,就是利用irg和stg指令给指定的内存生成lock, 指针tag(生成key),然后进行越界访问,会触发异常。
在qemu中执行结果:
六、小结
对比kernel中内存踩踏检测工具
类型 | shadow内存占用 | cpu占用 | 优缺点 |
---|---|---|---|
KASAN | 1/8 | 复杂,每次内存访问,需要计算对比shadow值 | 定位准确,8byte内的踩踏也能检测;32位/64位均能使用 |
KASAN_SW_TAGS | 1/16 | 每次内存访问,需要计算对比shadow值 | 16 byte内的踩踏无法区分, 仅64才能使用(因为依赖arm64 TBI feature) |
KASAN_HW_TAGS(MTE) | 1/32 | 5%左右消耗,tag的生成和检查由硬件完成 | 16 byte内的踩踏无法区分, 仅支持MTE的平台才能使用 |
其实对比KASAN_SW_TAGS, MTE主要是性能上的提升,缺点和能力与KASAN_SW_TAGS接近,MTE的诞生其实不是用来debug, 而是google希望推动MTE在商用版本上落地,最根本的目的是解决内存安全的问题,当前目前的确有性能上的影响(目前厂商均未应用到用户端),随着MTE本身的优化和CPU性能的进一步提升,也许不久的将来会看到MTE落地到产品商用版本上。
参考:
Memory Tagging Extension (MTE) in AArch64 Linux
Learn about the Arm Memory Tagging Extension: Build and run an example application to learn about MTE
Arm 内存标记扩展 (MTE) | Android NDK | Android Developers
ARM MTE简介-CSDN博客
https://www.qemu.org/docs/master/system/arm/virt.html
https://www.kernel.org/doc/html/v5.15/arm64/memory-tagging-extension.html
Documentation - Arm Developer