概述
KVM是由以色列初创公司Qumranet在CPU推出硬件虚拟化之后开发的一个基于内核的虚拟机监控器。
KVM是一个虚拟化的统称方案,除了x86外,ARM等其他架构也有自己的方案,所以KVM的主体代码位于内核树virt/kvm目录下面,表示所有CPU架构的公共代码,这也是内核kvm.ko对应的源码。
CPU架构代码位于arch/目录下面,如x86的架构相关的代码在arch/x86/kvm下。当然,同一个架构可能会有多种不同的实现,如KVM就有Intel和AMD两家的CPU实现,所以在x86目录下面就有多种实现代码,如Intel的vmx.c(对应intel VM-X方案)、AMD的svm.c(对应AMD-V方案),ioapic.c和lapic.c是中断控制器的代码,这也是intel-kvm.ko和amd-kvm.ko的来源。这种源码组织架构也常见于Linux内核的其他子系统。
KVM的所有虚拟化实现(Intel和AMD)都会向KVM模块注册一个kvm_x86_ops结构体,这样,KVM中的一些函数就是一个外壳,它可能首先会调用kvm_arch_xxx函数,表示的是调用CPU架构相关的函数,而如果kvm_arch_xxx函数需要调用到实现相关的代码,则会调用kvm_x86_ops结构中的相关回调函数。
kvm_intel.ko 与 kvm.ko 的关系:
VM创建
qemu侧虚机创建
qemu中支持kvm的代码入口主要都在kvm-all.c中,其中初始化函数kvm_init()。
当运行qemu时,如果命令行中带有--enable-kvm
参数,则在qemu_init()
函数中会处理:
case QEMU_OPTION_enable_kvm:olist = qemu_find_opts("machine");qemu_opts_parse_noisily(olist, "accel=kvm", false);break;
machine optslist这个参数项加了一个accel=kvm参数,之后main函数会调用configure_accelerator(current_machine),该函数会从machine的参数列表中取出accel的值,找出所属的类型,然后调用accel_init_machine。
int accel_init_machine(AccelState *accel, MachineState *ms)
{AccelClass *acc = ACCEL_GET_CLASS(accel); /*获取指定类型(这里是kvm)的accel类*/int ret;ms->accelerator = accel;*(acc->allowed) = true;ret = acc->init_machine(ms); /* 执行其对应的 init_machine 函数*/if (ret < 0) {ms->accelerator = NULL;*(acc->allowed) = false;object_unref(OBJECT(accel));} else {object_set_accelerator_compat_props(acc->compat_props);}return ret;
}
那么accel=kvm的init_machine函数是谁呢?
#define TYPE_KVM_ACCEL ACCEL_CLASS_NAME("kvm") #定义TYPE_KVM_ACCEL 就是 kvm-accel
然后在kvm-all.c中,构造kvm_accel_type结构体时设置了其init_machine钩子函数:
static void kvm_accel_class_init(ObjectClass *oc, void *data)
{AccelClass *ac = ACCEL_CLASS(oc);ac->name = "KVM";ac->init_machine = kvm_init; /* 这里初始化kvm accel的init_machine 函数为 kvm_init()*/ac->has_memory = kvm_accel_has_memory;ac->allowed = &kvm_allowed;...
}/* 初始化kvm_accel_type结构体 */
static const TypeInfo kvm_accel_type = {.name = TYPE_KVM_ACCEL,.parent = TYPE_ACCEL,.instance_init = kvm_accel_instance_init,.class_init = kvm_accel_class_init,.instance_size = sizeof(KVMState),
};static void kvm_type_init(void)
{type_register_static(&kvm_accel_type); /* 注册kvm_accel_type结构体 */
}type_init(kvm_type_init);
kvm-all.c中 kvm_init()函数:
static int kvm_init(MachineState *ms)
{/* 省略代码... */s = KVM_STATE(ms->accelerator);/* 省略代码... */s->fd = qemu_open("/dev/kvm", O_RDWR); /* 打开 /dev/kvm 得到fd句柄 *//* 省略代码... */do {ret = kvm_ioctl(s, KVM_CREATE_VM, type); /* ioctl打开的/dev/kvm的fd句柄,KVM_CREATE_VM命令通知kvm.ko模块创建虚机*/} while (ret == -EINTR);/* 省略代码... */ret = kvm_arch_init(ms, s); /* 做一些架构相关的初始化操作*//* 省略代码... */return ret;
}
kvm_init()的主要作用就是调用/dev/kvm
提供的一系列ioctl接口,在内核KVM中创建一台虚拟机。一个QEMU进程对应一台虚拟机VM。
kvm侧虚机创建
内核kvm模块的主要代码入口在kvm_main.c中,以kvm与intel组合为例,后面的分析涉及架构都是intel:
数据结构
内核kvm模块中,struct kvm其实就代表一台虚拟机。
初始化/dev/kvm
kvm_init()函数中初始化/dev/kvm
设备,留给qemu去访问,并初始化对应的options操作函数。
x86架构下,kvm的options对象kvm_x86_ops。
arch/x86/kvm/x86.c中定义了全局变量 kvm_x86_ops
struct kvm_x86_ops kvm_x86_ops __read_mostly;
EXPORT_SYMBOL_GPL(kvm_x86_ops);
kvm_x86_ops结构体中是一系列函数指针,其具体的函数初始化是vmx_x86_ops中初始化的。
struct kvm_x86_ops {int (*hardware_enable)(void);void (*hardware_disable)(void);void (*hardware_unsetup)(void);bool (*cpu_has_accelerated_tpr)(void);bool (*has_emulated_msr)(u32 index);void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);unsigned int vm_size;int (*vm_init)(struct kvm *kvm);void (*vm_destroy)(struct kvm *kvm);/*省略一大堆函数指针*/
}
x86架构的vmx.c中vmx_init函数在调用kvm_init时传入的是vmx_init_ops:
r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),__alignof__(struct vcpu_vmx), THIS_MODULE);
主要起作用的是vmx_x86_ops,在/arch/x86/kvm/vmx/vmx.c中初始化:
static struct kvm_x86_init_ops vmx_init_ops __initdata = {.cpu_has_kvm_support = cpu_has_kvm_support,.disabled_by_bios = vmx_disabled_by_bios,.check_processor_compatibility = vmx_check_processor_compat,.hardware_setup = hardware_setup,.runtime_ops = &vmx_x86_ops,
};
其中,vmx_x86_ops也是一个全局静态对象,其具体内容:
static struct kvm_x86_ops vmx_x86_ops __initdata = {.hardware_unsetup = hardware_unsetup,.hardware_enable = hardware_enable,.hardware_disable = hardware_disable,.cpu_has_accelerated_tpr = report_flexpriority,.has_emulated_msr = vmx_has_emulated_msr,.vm_size = sizeof(struct kvm_vmx),.vm_init = vmx_vm_init,/*省略...*/
};
内核kvm_main.c中,定义了kvm的设备、字符设备ioctl、vm虚机的ioctl、vcpu的iotctl等全局变量以便响应用户态的操作。
static struct file_operations kvm_vcpu_fops = {.release = kvm_vcpu_release,.unlocked_ioctl = kvm_vcpu_ioctl,.mmap = kvm_vcpu_mmap,.llseek = noop_llseek,KVM_COMPAT(kvm_vcpu_compat_ioctl),
};static struct file_operations kvm_vm_fops = {.release = kvm_vm_release,.unlocked_ioctl = kvm_vm_ioctl,.llseek = noop_llseek,KVM_COMPAT(kvm_vm_compat_ioctl),
};static struct file_operations kvm_chardev_ops = {.unlocked_ioctl = kvm_dev_ioctl,.llseek = noop_llseek,KVM_COMPAT(kvm_dev_ioctl),
};static struct miscdevice kvm_dev = {KVM_MINOR,"kvm",&kvm_chardev_ops,
};kvm_preempt_ops.sched_in = kvm_sched_in;
kvm_preempt_ops.sched_out = kvm_sched_out;
kvm_dev_ioctl:
ioctl操作 | 对应处理函数 |
---|---|
KVM_GET_API_VERSION | |
KVM_CREATE_VM | 创建虚机,kvm_dev_ioctl_create_vm() --> kvm_create_vm() |
KVM_CHECK_EXTENSION | 检查扩展功能,kvm_vm_ioctl_check_extension_generic() |
KVM_GET_VCPU_MMAP_SIZE | 创建qemu与kvm共享内存 |
… |
kvm_vm_ioctl:
ioctl操作 | 对应处理函数 |
---|---|
KVM_CREATE_VCPU | 创建vcpu,kvm_vm_ioctl_create_vcpu |
KVM_ENABLE_CAP | kvm_vm_ioctl_enable_cap_generic |
KVM_SET_USER_MEMORY_REGION | kvm_vm_ioctl_set_memory_region |
KVM_GET_DIRTY_LOG | kvm_vm_ioctl_get_dirty_log |
KVM_REGISTER_COALESCED_MMIO | |
KVM_IRQFD | kvm_irqfd |
KVM_IOEVENTFD | kvm_ioeventfd |
KVM_CREATE_DEVICE | kvm_ioctl_create_device |
KVM_CHECK_EXTENSION | kvm_vm_ioctl_check_extension_generic |
… |
kvm_vcpu_ioctl:
ioctl操作 | 对应处理函数 |
---|---|
KVM_RUN | 运行vcpu,kvm_arch_vcpu_ioctl_run() |
KVM_GET_REGS | |
KVM_SET_REGS | |
… |
kvm_dev_ioctl与kvm_vm_ioctl与kvm_vcpu_ioctl之间的关系:
QEMU创建CPU
qemu中的CPU模型继承关系:
qemu中支持的x86 CPU都定义在target/i386/cpu.c中的X86CPUDefinition类型的builtin_x86_defs数组中:
/* Base definition for a CPU model */
typedef struct X86CPUDefinition {const char *name;uint32_t level;uint32_t xlevel;/* vendor is zero-terminated, 12 character ASCII string */char vendor[CPUID_VENDOR_SZ + 1];int family;int model;int stepping;FeatureWordArray features;const char *model_id;CPUCaches *cache_info;/* Use AMD EPYC encoding for apic id */bool use_epyc_apic_id_encoding;/** Definitions for alternative versions of CPU model.* List is terminated by item with version == 0.* If NULL, version 1 will be registered automatically.*/const X86CPUVersionDefinition *versions;
} X86CPUDefinition;
其中:
X86CPUDefinition成员 | 作用 |
---|---|
name | CPU的名字 |
level | CPUID指令支持的最大功能号 |
xlevel | CPUID扩展质量支持的最大功能号 |
vendor、family、model、stepping | CPU的基本信息 |
features | 记录CPU特性的数组 |
model_id | CPU的全名 |
builtin_x86_defs数组:
static X86CPUDefinition builtin_x86_defs[] = {{.name = "qemu64",.level = 0xd,.vendor = CPUID_VENDOR_AMD,.family = 6,.model = 6,.stepping = 3,.features[FEAT_1_EDX] =PPRO_FEATURES |CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |CPUID_PSE36,.features[FEAT_1_ECX] =CPUID_EXT_SSE3 | CPUID_EXT_CX16,.features[FEAT_8000_0001_EDX] =CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,.features[FEAT_8000_0001_ECX] =CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM,.xlevel = 0x8000000A,.model_id = "QEMU Virtual CPU version " QEMU_HW_VERSION,},... /*有2000多行代码*/
}
qemu中通过struct X86CPU结构体来实例化一个虚拟的x86 CPU:
qemu中创建vcpu的函数调用路径:
其中,qemu中的kvm_init_vcpu()代码
int kvm_init_vcpu(CPUState *cpu)
{/*以下都省略部分代码,只留关心的部分*/ret = kvm_get_vcpu(s, kvm_arch_vcpu_id(cpu)); /*KVM_CREATE_VCPU去创建vcpu*/mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0); /*创建共享内存空间*/cpu->kvm_run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,cpu->kvm_fd, 0); /*qemu拿到共享内存后,对其fd进行mmap,kvm中处理函数是kvm_vcpu_mmap()*/ret = kvm_arch_init_vcpu(cpu);return ret;
}
KVM创建CPU
qemu与kvm共享数据
QEMU与KVM经常需要共享数据,如KVM将VM Exit的信息放到共享内存中,QEMU可以通过共享内存区域获取这些数据。QEMU与KVM之间的数据共享是QEMU在创建VCPU时分配的。
qemu在kvm_init_vcpu()中有kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0)
,该接口返回的是qemu与kvm共享内存的大小。
kvm中处理该接口的函数是:
static long kvm_dev_ioctl(struct file *filp,unsigned int ioctl, unsigned long arg)
{/*省略部分代码*/case KVM_GET_VCPU_MMAP_SIZE:if (arg)goto out;r = PAGE_SIZE; /* struct kvm_run */
#ifdef CONFIG_X86r += PAGE_SIZE; /* pio data page */
#endif
#ifdef CONFIG_KVM_MMIOr += PAGE_SIZE; /* coalesced mmio ring page */
#endifbreak;return r;
}
ioctl(KVM_GET_VCPU_MMAP_SIZE)可能返回的大小为1个、2个或者3个页。第一页用于kvm_run,该结构体用于与QEMU和KVM进行基本的数据交互,第二页用于虚拟机访问IO端口时存储相应的数据,最后一页用于聚合的MMIO。
然后qemu对共享内存进行mmap操作
static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
{struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;struct page *page;if (vmf->pgoff == 0)page = virt_to_page(vcpu->run);
#ifdef CONFIG_X86else if (vmf->pgoff == KVM_PIO_PAGE_OFFSET)page = virt_to_page(vcpu->arch.pio_data);
#endif
#ifdef CONFIG_KVM_MMIOelse if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
#endifelsereturn kvm_arch_vcpu_fault(vcpu, vmf);get_page(page);vmf->page = page;return 0;
}static const struct vm_operations_struct kvm_vcpu_vm_ops = {.fault = kvm_vcpu_fault,
};static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
{vma->vm_ops = &kvm_vcpu_vm_ops;return 0;
}
QEMU调用mmap映射VCPU的fd这个匿名文件的时候,实际上仅分配了虚拟地址空间,并且设置了这段虚拟地址空间的操作为kvm_vcpu_vm_ops,该操作回调只有一个fault回调函数kvm_vcpu_fault。kvm_vcpu_fault函数会在QEMU访问共享内存产生缺页异常的时候被调用,从其代码可以看到,内核会在QEMU把对应的数据与虚拟地址空间联系起来。
访问共享内存页 | 实际访问 |
---|---|
page1 | kvm_vcpu->run |
page2 | kvm_vcpu->arch |
page3 | kvm->coalesced_mmio_ring |
VCPU运行
QEMU运行VCPU
每个VCPU都会有一个对应的VMCS(Virtual Machine Control Structure),该结构是Intel x86处理器中实现CPU虚拟化记录vCPU状态的一个关键数据结构。VMCS的物理地址会作为操作数提供给VMX的指令。VMCS总共有如下4种状态:
- Inactive:即只是分配和初始化VMCS结构或者是执行VMCLEAR指令之后的状态。
- working:CPU在一个VMCS上执行了VMPTRLD指令或者产生VM exit之后所处的状态,这个时候CPU还是在VMX root状态。
- Active:当前VMCS执行了VMPTRLD指令,同一个CPU执行了另一个VCPU的VMPTRLD之后,前一个VMCS所处的状态。
- controlling:当CPU在一个VMCS上执行了VMLAUNCH指令之后CPU所处的VMX non-root状态。
Intel SDM 31.6所描述的要让一个虚拟机运行起来的步骤。
- 在非分页内存中分配一个4KB对齐的VMCS区域,其大小通过IA32_VMX_BASIC MSR得到,对于KVM,这个过程主要是通过vmx_create_vcpu调用alloc_vmcs来完成的。
- 初始化VMCS区域的版本标识(VMCS区域的前31位),这也是通过IA32_VMX_BASIC SMR得到的,清除VMCS区域前4个字节的31位,对于KVM,这个过程在alloc_vmcs_cpu中完成。
- 使用VMCS的物理地址作为操作数执行VMCLEAR指令,这会将当前CPU的working-VMCS指针指向FFFFFFFF_FFFFFFFFH,指令执行完成之后检查RFLAGS.CF=0以及RFLAGS.ZE=0,对于KVM,这个过程主要通过loaded_vmcs_clear函数最终调用vmcs_clear来完成。
- 使用VMCS的物理地址执行VMPTRLD指令,这个时候CPU的working-VMCS指针指向VMCS区域的物理地址,对于KVM,这个过程通过vmx_vcpu_load调用vmcs_load来完成。
- 执行VMWRITE指令,初始化VMCS的host-state区域,当产生VM exit后,这个区域会用来创建宿主机的CPU状态和上下文,host-state区域包括控制寄存器(CR0、CR3以及CR4),段寄存器(CS、SS、DS、ES、FS、GS、TR)以及RSP、RIP和一些MSR寄存器,对于KVM,这个过程主要在vmx_vcpu_setup函数中完成。
- 执行VMWRITE指令,初始化VMCS中的VM-exit control区域、VM-entry control区域以及VM-execution control区域。这些区域的某些数据需要根据VMX capability MSR的报告设置,如MSR寄存器报告在当前CPU上某些位只能设置为0,对于KVM,这个过程主要在vmx_vcpu_setup函数中完成。
- 执行VMWRITE指令,初始化guest-state区域,当CPU进入VMX non-root模式时会根据这些数据创建上下文,对于KVM,这个过程主要在vmx_vcpu_reset中完成。
- guest-state的设置需要满足如下条件。
- ① 如果虚拟机需要模拟一个从BIOS启动的完整OS,则需要将guest的状态设置为物理CPU加电时的状态。
- ② 需要将VMM不能截获的guest-state数据正确设置,如通用寄存器、CR2控制寄存器、调试寄存器、浮点数寄存器等。
- 执行VMLAUNCH,使得CPU处于VMX non-root状态,如果这个过程出错,将会设置RFLAGS.CF或者RFLAGS.ZF,对于KVM,这个过程在vmx_vcpu_run中完成。
qemu中vcpu线程的routine函数是
static void *qemu_kvm_cpu_thread_fn(void *arg)
{/*省略*/r = kvm_init_vcpu(cpu);kvm_init_cpu_signals(cpu);/* signal CPU creation */cpu->created = true;qemu_cond_signal(&qemu_cpu_cond);qemu_guest_random_seed_thread_part2(cpu->random_seed);do {if (cpu_can_run(cpu)) {r = kvm_cpu_exec(cpu); /*vcpu运行的核心代码*/if (r == EXCP_DEBUG) {cpu_handle_guest_debug(cpu);}}qemu_wait_io_event(cpu); /*vcpu不好运行时,则将CPU等待在cpu->halt_cond条件上*/} while (!cpu->unplug || cpu_can_run(cpu));/*省略*/return NULL;
}
qemu中vcpu运行的核心代码函数kvm_cpu_exec(),其核心也是一个do{}while()循环。
int kvm_cpu_exec(CPUState *cpu)
{/*省略*/do {/*省略*/kvm_arch_pre_run(cpu, run);run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);attrs = kvm_arch_post_run(cpu, run);switch (run->exit_reason) {case KVM_EXIT_IO:DPRINTF("handle_io\n");/* Called outside BQL */kvm_handle_io(run->io.port, attrs,(uint8_t *)run + run->io.data_offset,run->io.direction,run->io.size,run->io.count);ret = 0;break;case KVM_EXIT_MMIO:DPRINTF("handle_mmio\n");/* Called outside BQL */address_space_rw(&address_space_memory,run->mmio.phys_addr, attrs,run->mmio.data,run->mmio.len,run->mmio.is_write);ret = 0;break;/*省略*/case KVM_EXIT_SYSTEM_EVENT:default:DPRINTF("kvm_arch_handle_exit\n");ret = kvm_arch_handle_exit(cpu, run);break;}} while (ret == 0);/*省略*/return ret;
}
kvm_arch_pre_run首先做一些运行前的准备工作,如nmi和smi的中断注入,之后触发VCPU的ioctl(KVM_RUN)使该CPU运行起来,KVM模块在处理该ioctl时,会执行对应的VMX指令,把该VCPU运行的物理CPU从VMX root模式转换成VMX non-root模式,开始运行虚拟机中的代码。虚拟机内部如果遇到一些事件产生VM Exit,就会退出到KVM,如果KVM无法处理就会分发到QEMU,也就是在ioctl(KVM_RUN)返回的时候调用kvm_arch_post_run来进行一些初步处理,然后开始根据QEMU和KVM共享内存kvm_run中的数据来判断退出原因,并做出相应处理,如对于I/O的退出会调用kvm_handle_io进行分发,最终调用到注册该I/O端口的设备回调函数。可以看到,这里用了很多kvm_run里面的数据,如果退出原因是由于访问MMIO,则会调用address_space_rw,这个函数会找到MMIO是由哪个设备注册的,从而调用其相关回调函数。
qemu、kvm与vm之间的关系:
KVM运行VCPU
kvm_vcpu_ioctl
由kvm_vcpu_ioctl中去处理,最后有arch/x86/kvm/x86.c中的vcpu_run()函数做主要处理:
static struct file_operations kvm_vcpu_fops = {.release = kvm_vcpu_release,.unlocked_ioctl = kvm_vcpu_ioctl,.mmap = kvm_vcpu_mmap,.llseek = noop_llseek,KVM_COMPAT(kvm_vcpu_compat_ioctl),
};
kvm_vcpu_ioctl()函数如何保证是当前vcpu线程在处理的呢?函数中首先处理如下判断,
if (vcpu->kvm->mm != current->mm)return -EIO;
switch (ioctl) {case KVM_RUN: {struct pid *oldpid;r = -EINVAL;if (arg)goto out;oldpid = rcu_access_pointer(vcpu->pid); //这里有可能运行该vcpu的线程换了if (unlikely(oldpid != task_pid(current))) {/* The thread running this VCPU changed. */struct pid *newpid;r = kvm_arch_vcpu_run_pid_change(vcpu);if (r)break;newpid = get_task_pid(current, PIDTYPE_PID);rcu_assign_pointer(vcpu->pid, newpid); //如果换线程了,则更新vcpu->pid为current->pidif (oldpid)synchronize_rcu();put_pid(oldpid);}/*这里可以对vcpu进行特征统计,对运行vcpu的线程进行标记,但是如果统计vcpu特征了,还需要标记线程么?*/r = kvm_arch_vcpu_ioctl_run(vcpu); //进入具体架构vcpu run代码trace_kvm_userspace_exit(vcpu->run->exit_reason, r);break;}
kvm_arch_vcpu_ioctl_run
进入kvm_arch_vcpu_ioctl_run()函数,这里分析x86架构:
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{struct kvm_run *kvm_run = vcpu->run;int r;vcpu_load(vcpu);//省略代码if (kvm_run->immediate_exit)r = -EINTR;elser = vcpu_run(vcpu); //主要是vcpu_run函数out:kvm_put_guest_fpu(vcpu);if (kvm_run->kvm_valid_regs)store_regs(vcpu);post_kvm_run_save(vcpu);kvm_sigset_deactivate(vcpu);vcpu_put(vcpu);return r;
}
vcpu_load 与 vcpu_put
vcpu_load是加载vcpu至对应的物理cpu,vcpu_put则相反。
kvm中定义了一个per cpu变量,kvm_running_vcpu,用于记录是否运行vcpu任务。
static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
vcpu_load()函数,主要就是kvm_running_vcpu赋值,
/** Switches to specified vcpu, until a matching vcpu_put()*/
void vcpu_load(struct kvm_vcpu *vcpu)
{int cpu = get_cpu(); //关闭抢占,返回cpu的id__this_cpu_write(kvm_running_vcpu, vcpu); //赋值per-cpu变量kvm_running_vcpu为当前vcpupreempt_notifier_register(&vcpu->preempt_notifier);kvm_arch_vcpu_load(vcpu, cpu);put_cpu(); //开启抢占
}
EXPORT_SYMBOL_GPL(vcpu_load);
vcpu_put()与vcpu_load()是相对使用的。
void vcpu_put(struct kvm_vcpu *vcpu)
{preempt_disable();kvm_arch_vcpu_put(vcpu);preempt_notifier_unregister(&vcpu->preempt_notifier);__this_cpu_write(kvm_running_vcpu, NULL);preempt_enable();
}
EXPORT_SYMBOL_GPL(vcpu_put);
vcpu_run
static int vcpu_run(struct kvm_vcpu *vcpu)
{/*省略*/for (;;) {if (kvm_vcpu_running(vcpu)) {r = vcpu_enter_guest(vcpu); /*判断的结果是可以运行,则会调用vcpu_enter_guest来进入虚拟机*/} else {r = vcpu_block(kvm, vcpu); /*如果vcpu_run判断此时VCPU不能运行,不考虑poll机制,则调用schedule()提请调度,让出CPU。*/}if (r <= 0)break;/*省略*/}/*省略*/return r;
}/* 判断两个方面:* 1. vcpu.arch结构的mp_state是否为KVM_MP_STATE_RUNNABLE* 2. vcpu.arch结构中的apf.halted表示的虚拟机中是否存在需要访问却被宿主机swap出去的内存页,如果由于apf而被暂停,则这个时候虚拟CPU也是不能运行的*/
static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
{if (is_guest_mode(vcpu))kvm_x86_ops.nested_ops->check_events(vcpu);return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&!vcpu->arch.apf.halted);
}
如果vcpu_run判断此时VCPU不能运行,则会调用vcpu_block,后者调用kvm_vcpu_block,如果不考虑poll机制,则kvm_vcpu_block会调用schedule()提请调度,让出CPU。
void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{/*省略*/for (;;) {set_current_state(TASK_INTERRUPTIBLE);if (kvm_vcpu_check_block(vcpu) < 0)break;waited = true;schedule();}/*省略*/
}
vcpu_enter_guest
返回1,则vcpu_run()函数就一直在for循环中,否则返回至userspace。
/** Returns 1 to let vcpu_run() continue the guest execution loop without* exiting to the userspace. Otherwise, the value will be returned to the* userspace.*/
static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{/*省略...........................*/r = kvm_mmu_reload(vcpu);if (unlikely(r)) {goto cancel_injection;}preempt_disable(); //关闭抢占kvm_x86_ops.prepare_guest_switch(vcpu); //这里是保存host主机的state,以便虚拟机退出后能正常运行host/** Disable IRQs before setting IN_GUEST_MODE. Posted interrupt* IPI are then delayed after guest entry, which ensures that they* result in virtual interrupt delivery.* 这里禁止CPU的外部中断请求 */local_irq_disable();vcpu->mode = IN_GUEST_MODE; //进入guest mode//省略trace_kvm_entry(vcpu->vcpu_id); //这里追踪kvm entry,而kvm exit是在vmx_vcpu_run()函数中追踪的//省略exit_fastpath = kvm_x86_ops.run(vcpu); //这里进入vmx_vcpu_run()函数//省略vcpu->arch.last_vmentry_cpu = vcpu->cpu;vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());vcpu->mode = OUTSIDE_GUEST_MODE; //退出guest modesmp_wmb();kvm_x86_ops.handle_exit_irqoff(vcpu); //退出虚机后,处理外部中断/* * Consume any pending interrupts, including the possible source of* VM-Exit on SVM and any ticks that occur between VM-Exit and now.* An instruction is required after local_irq_enable() to fully unblock* interrupts on processors that implement an interrupt shadow, the* stat.exits increment will do nicely.*/kvm_before_interrupt(vcpu);local_irq_enable();++vcpu->stat.exits; //这里对退出的数据进行统计local_irq_disable();kvm_after_interrupt(vcpu);if (lapic_in_kernel(vcpu)) {s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;if (delta != S64_MIN) {trace_kvm_wait_lapic_expire(vcpu->vcpu_id, delta);vcpu->arch.apic->lapic_timer.advance_expire_delta = S64_MIN;}}local_irq_enable();preempt_enable();//省略r = kvm_x86_ops.handle_exit(vcpu, exit_fastpath); //其实到这里已经没有什么外部中断需要处理了,就是统计虚机退出的一些原因数据return r;cancel_injection:if (req_immediate_exit)kvm_make_request(KVM_REQ_EVENT, vcpu);kvm_x86_ops.cancel_injection(vcpu);if (unlikely(vcpu->arch.apic_attention))kvm_lapic_sync_from_vapic(vcpu);
out:return r;
}
该函数会陷入kvm_vcpu对应的vmx_vcpu_run,当vmx_vcpu_run执行完返回的时候,其实已经完成了一轮VMEntry与VM Exit了。
vcpu->mode有以下几种
enum {OUTSIDE_GUEST_MODE,IN_GUEST_MODE,EXITING_GUEST_MODE,READING_SHADOW_PAGE_TABLES,
};
CPU在guest模式运行时,中断是关闭的,运行着虚拟机代码的CPU不会接收到外部中断,但是外部中断会导致CPU退出guest模式,进入VMX root模式。外部中断的处理是在handle_exit之前进行的,所以后面在handle_exit中处理外部中断的时候就没有什么实际的事可以做了,而只是对统计数据进行了修改。
vmx_vcpu_run
static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{fastpath_t exit_fastpath;struct vcpu_vmx *vmx = to_vmx(vcpu);unsigned long cr3, cr4;reenter_guest:/* Record the guest's net vcpu time for enforced NMI injections. */if (unlikely(!enable_vnmi &&vmx->loaded_vmcs->soft_vnmi_blocked))vmx->loaded_vmcs->entry_time = ktime_get();/* Don't enter VMX if guest state is invalid, let the exit handlerstart emulation until we arrive back to a valid state */if (vmx->emulation_required)return EXIT_FASTPATH_NONE;if (vmx->ple_window_dirty) {vmx->ple_window_dirty = false;vmcs_write32(PLE_WINDOW, vmx->ple_window);}/** We did this in prepare_switch_to_guest, because it needs to* be within srcu_read_lock.*/WARN_ON_ONCE(vmx->nested.need_vmcs12_to_shadow_sync);if (kvm_register_is_dirty(vcpu, VCPU_REGS_RSP))vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);if (kvm_register_is_dirty(vcpu, VCPU_REGS_RIP))vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);cr3 = __get_current_cr3_fast();if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {vmcs_writel(HOST_CR3, cr3);vmx->loaded_vmcs->host_state.cr3 = cr3;}cr4 = cr4_read_shadow();if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {vmcs_writel(HOST_CR4, cr4);vmx->loaded_vmcs->host_state.cr4 = cr4;}/* When single-stepping over STI and MOV SS, we must clear the* corresponding interruptibility bits in the guest state. Otherwise* vmentry fails as it then expects bit 14 (BS) in pending debug* exceptions being set, but that's not correct for the guest debugging* case. */if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)vmx_set_interrupt_shadow(vcpu, 0);kvm_load_guest_xsave_state(vcpu);pt_guest_enter(vmx);atomic_switch_perf_msrs(vmx);if (enable_preemption_timer)vmx_update_hv_timer(vcpu);if (lapic_in_kernel(vcpu) &&vcpu->arch.apic->lapic_timer.timer_advance_ns)kvm_wait_lapic_expire(vcpu);/** If this vCPU has touched SPEC_CTRL, restore the guest's value if* it's non-zero. Since vmentry is serialising on affected CPUs, there* is no need to worry about the conditional branch over the wrmsr* being speculatively taken.*/x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);/* The actual VMENTER/EXIT is in the .noinstr.text section. */vmx_vcpu_enter_exit(vcpu, vmx);/** We do not use IBRS in the kernel. If this vCPU has used the* SPEC_CTRL MSR it may have left it on; save the value and* turn it off. This is much more efficient than blindly adding* it to the atomic save/restore list. Especially as the former* (Saving guest MSRs on vmexit) doesn't even exist in KVM.** For non-nested case:* If the L01 MSR bitmap does not intercept the MSR, then we need to* save it.** For nested case:* If the L02 MSR bitmap does not intercept the MSR, then we need to* save it.*/if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL)))vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);x86_spec_ctrl_restore_host(vmx->spec_ctrl, 0);/* All fields are clean at this point */if (static_branch_unlikely(&enable_evmcs))current_evmcs->hv_clean_fields |=HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;if (static_branch_unlikely(&enable_evmcs))current_evmcs->hv_vp_id = vcpu->arch.hyperv.vp_index;/* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */if (vmx->host_debugctlmsr)update_debugctlmsr(vmx->host_debugctlmsr);#ifndef CONFIG_X86_64/** The sysexit path does not restore ds/es, so we must set them to* a reasonable value ourselves.** We can't defer this to vmx_prepare_switch_to_host() since that* function may be executed in interrupt context, which saves and* restore segments around it, nullifying its effect.*/loadsegment(ds, __USER_DS);loadsegment(es, __USER_DS);
#endifvmx_register_cache_reset(vcpu);pt_guest_exit(vmx);kvm_load_host_xsave_state(vcpu);vmx->nested.nested_run_pending = 0;vmx->idt_vectoring_info = 0;if (unlikely(vmx->fail)) {vmx->exit_reason = 0xdead;return EXIT_FASTPATH_NONE;}vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);if (unlikely((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY))kvm_machine_check();trace_kvm_exit(vmx->exit_reason, vcpu, KVM_ISA_VMX);if (unlikely(vmx->exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))return EXIT_FASTPATH_NONE;vmx->loaded_vmcs->launched = 1;vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);vmx_recover_nmi_blocking(vmx);vmx_complete_interrupts(vmx);if (is_guest_mode(vcpu))return EXIT_FASTPATH_NONE;exit_fastpath = vmx_exit_handlers_fastpath(vcpu);if (exit_fastpath == EXIT_FASTPATH_REENTER_GUEST) {if (!kvm_vcpu_exit_request(vcpu)) {/** FIXME: this goto should be a loop in vcpu_enter_guest,* but it would incur the cost of a retpoline for now.* Revisit once static calls are available.*/if (vcpu->arch.apicv_active)vmx_sync_pir_to_irr(vcpu);goto reenter_guest;}exit_fastpath = EXIT_FASTPATH_EXIT_HANDLED;}return exit_fastpath;
}
该函数首先根据VCPU的状态写一些VMCS的值,接着执行汇编ASM_VMX_VMLAUNCH将CPU置于guest模式,这个时候CPU就开始执行虚拟机的代码,当发生退出时候,其地址是vmx_return。
VCPU退出
x86架构
VCPU的exit事件,由kvm_x86_ops.handle_exit()来处理,在/arch/x86/kvm/x86.c中
static int vcpu_enter_guest(struct kvm_vcpu *vcpu){//省略r = kvm_x86_ops.handle_exit(vcpu, exit_fastpath);
}
退出事件
#define VMX_EXIT_REASONS_FAILED_VMENTRY 0x80000000#define EXIT_REASON_EXCEPTION_NMI 0
#define EXIT_REASON_EXTERNAL_INTERRUPT 1
#define EXIT_REASON_TRIPLE_FAULT 2
#define EXIT_REASON_INIT_SIGNAL 3#define EXIT_REASON_INTERRUPT_WINDOW 7
#define EXIT_REASON_NMI_WINDOW 8
#define EXIT_REASON_TASK_SWITCH 9
#define EXIT_REASON_CPUID 10
#define EXIT_REASON_HLT 12
#define EXIT_REASON_INVD 13
#define EXIT_REASON_INVLPG 14
#define EXIT_REASON_RDPMC 15
#define EXIT_REASON_RDTSC 16
#define EXIT_REASON_VMCALL 18
#define EXIT_REASON_VMCLEAR 19
#define EXIT_REASON_VMLAUNCH 20
#define EXIT_REASON_VMPTRLD 21
#define EXIT_REASON_VMPTRST 22
#define EXIT_REASON_VMREAD 23
#define EXIT_REASON_VMRESUME 24
#define EXIT_REASON_VMWRITE 25
#define EXIT_REASON_VMOFF 26
#define EXIT_REASON_VMON 27
#define EXIT_REASON_CR_ACCESS 28
#define EXIT_REASON_DR_ACCESS 29
#define EXIT_REASON_IO_INSTRUCTION 30
#define EXIT_REASON_MSR_READ 31
#define EXIT_REASON_MSR_WRITE 32
#define EXIT_REASON_INVALID_STATE 33
#define EXIT_REASON_MSR_LOAD_FAIL 34
#define EXIT_REASON_MWAIT_INSTRUCTION 36
#define EXIT_REASON_MONITOR_TRAP_FLAG 37
#define EXIT_REASON_MONITOR_INSTRUCTION 39
#define EXIT_REASON_PAUSE_INSTRUCTION 40
#define EXIT_REASON_MCE_DURING_VMENTRY 41
#define EXIT_REASON_TPR_BELOW_THRESHOLD 43
#define EXIT_REASON_APIC_ACCESS 44
#define EXIT_REASON_EOI_INDUCED 45
#define EXIT_REASON_GDTR_IDTR 46
#define EXIT_REASON_LDTR_TR 47
#define EXIT_REASON_EPT_VIOLATION 48
#define EXIT_REASON_EPT_MISCONFIG 49
#define EXIT_REASON_INVEPT 50
#define EXIT_REASON_RDTSCP 51
#define EXIT_REASON_PREEMPTION_TIMER 52
#define EXIT_REASON_INVVPID 53
#define EXIT_REASON_WBINVD 54
#define EXIT_REASON_XSETBV 55
#define EXIT_REASON_APIC_WRITE 56
#define EXIT_REASON_RDRAND 57
#define EXIT_REASON_INVPCID 58
#define EXIT_REASON_VMFUNC 59
#define EXIT_REASON_ENCLS 60
#define EXIT_REASON_RDSEED 61
#define EXIT_REASON_PML_FULL 62
#define EXIT_REASON_XSAVES 63
#define EXIT_REASON_XRSTORS 64
#define EXIT_REASON_UMWAIT 67
#define EXIT_REASON_TPAUSE 68
vmx_handle_exit()
退出最终会到vmx_handle_exit()中处理,然后根据事件分发给对应的处理函数
/** The exit handlers return 1 if the exit was handled fully and guest execution* may resume. Otherwise they set the kvm_run parameter to indicate what needs* to be done to userspace and return 0.*/
static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {[EXIT_REASON_EXCEPTION_NMI] = handle_exception_nmi, /*处理不可屏蔽中断non-maskable interrupt*/[EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt, /*总是返回1,没做什么具体处理,可忽略*/[EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault, /*总是返回0,kvm exit shutdown*/[EXIT_REASON_NMI_WINDOW] = handle_nmi_window, /*总是返回1, 可不考虑*/[EXIT_REASON_IO_INSTRUCTION] = handle_io, /*看名字就是IO操作*/[EXIT_REASON_CR_ACCESS] = handle_cr, /*操作控制寄存器*/[EXIT_REASON_DR_ACCESS] = handle_dr, /*操作调试寄存器*/[EXIT_REASON_CPUID] = kvm_emulate_cpuid, /*模拟cpuid,还是操作eax等寄存器*/[EXIT_REASON_MSR_READ] = kvm_emulate_rdmsr, /*模拟rdmsr指令,本质还是操作EAX寄存器*/[EXIT_REASON_MSR_WRITE] = kvm_emulate_wrmsr, /*模拟wrmsr指令,操作MSR等寄存器*/[EXIT_REASON_INTERRUPT_WINDOW] = handle_interrupt_window, /*总是返回1,可不考虑*/[EXIT_REASON_HLT] = kvm_emulate_halt, /*HLT指令,暂停cpu*/[EXIT_REASON_INVD] = handle_invd, /*调用kvm_emulate_instruction*/[EXIT_REASON_INVLPG] = handle_invlpg, /*调用kvm_skip_emulate_instruction*/[EXIT_REASON_RDPMC] = handle_rdpmc, /*x86的rdpmc指令,读取PMU寄存器*/[EXIT_REASON_VMCALL] = handle_vmcall, /*vmcall指令,kvm_emulate_hypercall调用*/[EXIT_REASON_VMCLEAR] = handle_vmx_instruction,[EXIT_REASON_VMLAUNCH] = handle_vmx_instruction,[EXIT_REASON_VMPTRLD] = handle_vmx_instruction,[EXIT_REASON_VMPTRST] = handle_vmx_instruction,[EXIT_REASON_VMREAD] = handle_vmx_instruction,[EXIT_REASON_VMRESUME] = handle_vmx_instruction,[EXIT_REASON_VMWRITE] = handle_vmx_instruction,[EXIT_REASON_VMOFF] = handle_vmx_instruction,[EXIT_REASON_VMON] = handle_vmx_instruction, /*handle_vmx_instruct函数总是返回1*/[EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold, /*操作寄存器,函数返回1*/[EXIT_REASON_APIC_ACCESS] = handle_apic_access, /*APIC控制器*/[EXIT_REASON_APIC_WRITE] = handle_apic_write, /*函数返回1*/[EXIT_REASON_EOI_INDUCED] = handle_apic_eoi_induced, /*函数总返回1*/[EXIT_REASON_WBINVD] = handle_wbinvd, //操作寄存器[EXIT_REASON_XSETBV] = handle_xsetbv, //操作寄存器[EXIT_REASON_TASK_SWITCH] = handle_task_switch, //处理模拟进程切换[EXIT_REASON_MCE_DURING_VMENTRY] = handle_machine_check, //总是返回1,可忽略[EXIT_REASON_GDTR_IDTR] = handle_desc,[EXIT_REASON_LDTR_TR] = handle_desc,[EXIT_REASON_EPT_VIOLATION] = handle_ept_violation, //和NMI相关[EXIT_REASON_EPT_MISCONFIG] = handle_ept_misconfig, //ept配置错误处理[EXIT_REASON_PAUSE_INSTRUCTION] = handle_pause, //PAUSE[EXIT_REASON_MWAIT_INSTRUCTION] = handle_mwait, //使用NOP指令模拟MWAIT[EXIT_REASON_MONITOR_TRAP_FLAG] = handle_monitor_trap, //返回1,可忽略[EXIT_REASON_MONITOR_INSTRUCTION] = handle_monitor, //NOP模拟MONITOR[EXIT_REASON_INVEPT] = handle_vmx_instruction,[EXIT_REASON_INVVPID] = handle_vmx_instruction,[EXIT_REASON_RDRAND] = handle_invalid_op, //返回1,可忽略[EXIT_REASON_RDSEED] = handle_invalid_op,[EXIT_REASON_PML_FULL] = handle_pml_full, //返回1,可忽略[EXIT_REASON_INVPCID] = handle_invpcid, //和操作内存相关,PCIDs[EXIT_REASON_VMFUNC] = handle_vmx_instruction, //返回1,可忽略[EXIT_REASON_PREEMPTION_TIMER] = handle_preemption_timer, //返回1,可忽略[EXIT_REASON_ENCLS] = handle_encls, //返回1,可忽略
};
vm exit原因
有许多events或者instructions会导致VM exit,其中某些事永久enable开启的,有些是可以通过VMSC控制域开关的。
Unconditional reasons for VM exit include:
- CPUID
- RDMSR and WRMSR unless MSR bitmap is used
- most of VMX instructions
- INIT signal
- SIPI signal - does not result in exit if the processor is not in wait-for-SIPI state
- triple fault
- task switches (hardware, including
- VM entry failure
There are too many controllable exit reasons to describe each one separately, but most of them can be classified as one of:
-
interrupts or interrupt windows
-
I/O ports access
-
memory access - controlled by EPT
-
HLT/PAUSE and pre-emption timer - useful for multiple VMs running on one physical CPU
-
changes to descriptor tables and control registers
-
APIC access
kvm_userspace_exit
virt/kvm/kvm_main.c中的kvm_vcpu_ioctl()在处理KVM_RUN中,当从kvm_arch_vcpu_ioctl_run()这个涉及具体架构的vcpu run的处理函数退出时,意味着内核kvm层对vcpu的处理已经无法处理,需要继续退出至qemu去处理,即需要从内核态返回用户态去处理了。
r = kvm_arch_vcpu_ioctl_run(vcpu);trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
kvm_arch_vcpu_ioctl_run()函数退出时,系统叫它userspace exit
#define KVM_EXIT_UNKNOWN 0
#define KVM_EXIT_EXCEPTION 1
#define KVM_EXIT_IO 2
#define KVM_EXIT_HYPERCALL 3
#define KVM_EXIT_DEBUG 4
#define KVM_EXIT_HLT 5
#define KVM_EXIT_MMIO 6
#define KVM_EXIT_IRQ_WINDOW_OPEN 7
#define KVM_EXIT_SHUTDOWN 8
#define KVM_EXIT_FAIL_ENTRY 9
#define KVM_EXIT_INTR 10
#define KVM_EXIT_SET_TPR 11
#define KVM_EXIT_TPR_ACCESS 12
#define KVM_EXIT_S390_SIEIC 13
#define KVM_EXIT_S390_RESET 14
#define KVM_EXIT_DCR 15 /* deprecated */
#define KVM_EXIT_NMI 16
#define KVM_EXIT_INTERNAL_ERROR 17
#define KVM_EXIT_OSI 18
#define KVM_EXIT_PAPR_HCALL 19
#define KVM_EXIT_S390_UCONTROL 20
#define KVM_EXIT_WATCHDOG 21
#define KVM_EXIT_S390_TSCH 22
#define KVM_EXIT_EPR 23
#define KVM_EXIT_SYSTEM_EVENT 24
#define KVM_EXIT_S390_STSI 25
#define KVM_EXIT_IOAPIC_EOI 26
#define KVM_EXIT_HYPERV 27
#define KVM_EXIT_ARM_NISV 28
VCPU调度
现代处理器通常都是多对称处理,操作系统一般可以自由地将VCPU调度到任何一个物理CPU上运行。当VCPU在不同的物理CPU上运行的时候会影响虚拟机的性能。这是由于在同一个物理CPU上运行VCPU时只需要执行VMRESUME指令即可,但是如果要切换到不同的物理CPU,则需要执行VMCLEAR、VMPTRLD和VMLAUNCH指令。
将一个VCPU调度到不同的物理CPU上的简化步骤,实际kvm处理比这复杂:
- 在源物理CPU执行VMCLEAR指令,这可以保证将当前CPU关联的VMCS相关缓存数据冲刷到内存中
- 在目的VMCS区域以VCPU的VMCS物理地址为操作数执行VMPTRLD指令
- 在目的VMCS区域执行VMLAUNCH指令
每个物理CPU会有一个指向VMCS结构体的指针per cpu变量current_vmcs,这是在vmx.c中定义的
DEFINE_PER_CPU(struct vmcs *, current_vmcs);
每一个VCPU也分配了一个VMCS结构,这是在vmx_create_vcpu中创建并保存在vmx_vcpu的loaded_vmcs中vmcs成员中的。VCPU的调度本质上就是让物理CPU的per cpu变量current_vmcs在所有VCPU之间分配,在某一时刻会指向这些VCPU中的一个。
- 内核调用vcpu_load将VCPU1与PCPU1关联起来,如果是第一次调用ioctl(KVM_RUN),则vcpu_load在kvm_vcpu_ioctl函数的开始被调用。如果是被调度进来的,则是在kvm_sched_in中,通过kvm_arch_vcpu_load调用到最终实现的vcpu_load(如vmx_vcpu_load),完成关联过程。
- 当PCPU1执行虚拟机代码时,当前线程是禁止抢占以及被中断打断的,但是中断却可以触发VM Exit,也就是让虚拟机退出到宿主机。退出并处理一些必要的工作之后就会开启中断和抢占,这样PCPU1就有可能去调度别的线程或VCPU。
- VCPU1的线程被抢占之后调用kvm_sched_out。当又该调度VCPU1时,系统却把它调度到物理CPU2上,那么就需要将VCPU1的状态与PCPU2关联起来。