c++应用网络编程之十二Linux下的epoll模式分析

一、epoll的原理

在上一篇文章基本明白了epoll的入门知识，本篇开始分析一下其内在的原理，让大家对epoll的运行机制有一个真正的了解。其实分析epoll的原理就必须先说明一下epoll在整个网络通信过程中的位置或者说环节，这样才能从整体上对其有一个更明白的认识。
在前面的DPDK等关于网络通信的系列里可以明白，网络通信其实分为三大块即硬件、驱动和软件。这里把驱动从软件中剥离了出来，因为从后面的发展来看，这一块也相当的重要。而软件又分成内核和应用两层。
那么epoll在哪个位置呢？从严谨的角度看，它应该划分到内核中。而大家开发常用的接口其实等同于内核开放给上层应用的接口调用。所以此处只分析这一部分，其它部分暂时略过。
在前面的学习中已经明白，IO多路复用的重点有两大部分：
1、网络fd对网络通信的影响
它又分两部分，一个是网络fd的快速增长对效率的影响；另外一个是对fd数量的限制。在epoll中，使用了红黑树再加一层双向链表做为缓冲。为什么说epoll的复杂度是O(1)呢？红黑树的算法不是O(logn)么？其实这里主要是指epoll_wait的复杂度，它是从就绪队列rdllist中查找，一查都是就绪的，所以是O(1)，百分百命中。而要做这到一点，就需要红黑树在查找的后回调一个函数，将就绪的fd插入到rdllist，而链表的插入操作也是O(1)。其实看一下数据定义就明白了：

struct eventpoll {
....../* Wait queue used by sys_epoll_wait() */wait_queue_head_t wq;/* Wait queue used by file->poll() */wait_queue_head_t poll_wait;/* List of ready file descriptors */struct list_head rdllist;
....../* RB tree root used to store monitored fd structs */struct rb_root_cached rbr;struct epitem *ovflist;
.....
};

2、数据在内核与用户层间的拷贝
如果说上面是对管理过程的控制，那么现在这一条是对真正工作的优化。epoll模式下，内核与用户层使用类似mmap方式来共享fd相关的数据，从而不需要进行频繁的fd进行内核与用户层间的交互。这就相当于省略非常耗时的中间层。这和古代山间的关口一样，挨个检查费时费力，直接通过会更快。

二、底层分析

在内核中在调用epoll_create时，会创建一个eventpoll结构体，也就是上面看到的那段代码，在这段代码中，可以看两个很重要的变量，即rdllist和rbr。同时，在这个过程中，内核还创建了一个文件fd的节点（一定要明白，在Linux系统中，一切皆文件）并且创建一个Cache。Cache中使用红黑树来存储通过接口epoll_ctl传递进来的Socket即网络通信的fd。
为了能够更快速的得到就绪的fd（socket），内核还会创建一个rdllist。也就是说，在epoll_wait接口调用时，仅仅扫描rdllist是否为空即可。非空即把相关数据返回给上层应用即可。
同样，对于IO多路复用的事件，在epoll中都会通过内核与网卡驱动建立事件触发关系（一般是通过回调）。而这个事件函数的定义在epollevent.c中可以看到：

static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
{int pwake = 0;struct epitem *epi = ep_item_from_wait(wait);struct eventpoll *ep = epi->ep;__poll_t pollflags = key_to_poll(key);unsigned long flags;int ewake = 0;
.../** If we are transferring events to userspace, we can hold no locks* (because we're accessing user memory, and because of linux f_op->poll()* semantics). All the events that happen during that period of time are* chained in ep->ovflist and requeued later on.*/if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {if (chain_epi_lockless(epi))ep_pm_stay_awake_rcu(epi);} else if (!ep_is_linked(epi)) {/* In the usual case, add event to ready list. */if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist))ep_pm_stay_awake_rcu(epi);}
...
}

而在内核中每个事件会对应一个数据结构体：

/** Each file descriptor added to the eventpoll interface will* have an entry of this type linked to the "rbr" RB tree.* Avoid increasing the size of this struct, there can be many thousands* of these on a server and we do not want this to take another cache line.*/
struct epitem {union {/* RB tree node links this structure to the eventpoll RB tree */struct rb_node rbn;/* Used to free the struct epitem */struct rcu_head rcu;};/* List header used to link this structure to the eventpoll ready list */struct list_head rdllink;/** Works together "struct eventpoll"->ovflist in keeping the* single linked chain of items.*/struct epitem *next;/* The file descriptor information this item refers to */struct epoll_filefd ffd;/* List containing poll wait queues */struct eppoll_entry *pwqlist;/* The "container" of this item */struct eventpoll *ep;/* List header used to link this item to the "struct file" items list */struct hlist_node fllink;/* wakeup_source used when EPOLLWAKEUP is set */struct wakeup_source __rcu *ws;/* The structure that describe the interested events and the source fd */struct epoll_event event;
};

这就比较明白的看清了底层逻辑。
有了这些数据结构和事件处理，整个内核其实针对epoll提供了三个部分的支持，即底层数据结构的对象的处理（红黑树的添加、删除以及双向链表的处理等）、IO事件的监控（事件监听、异步唤醒的事件队列等）以及通信数据的拷贝即可。

三、epoll的工作流程

下面看一个epoll模型的通信整体的流程(服务端)：
1、首先创建socket，即创建网络操作的文件描述符或者说文件句柄
2、将其bind到指定的地址
3、开始监听listen
4、调用epoll_create创建epoll文件描述符，用于后续操作。其参数size目前可认为无效（>0即可）
5、调用epoll_ctl进行epoll事件的注册（增、删、改），此处的事件即会存储于前面所述的红黑树中
6、调用epoll_wait等待epoll事件即唤醒工作任务（可能是读或写等）
7、事件触发，调用相关处理函数并处理相关事件或数据
8、将事件或数据通知上层应用
9、回到6继续等待处理
10、收到退出信号退出
上述只是一个主干流程，可能会涉及到一些阻塞操作参数以及触发方式（LT或ET）等的设置，都没在列出来，在实际的编码中要注意。
上面的流程是不是很简单，看上去没什么嘛！可是，当你真正的写起来就会发现，到处都是细节的处理，如何快速响应触发的动作把数据拷贝走并且进行处理？如何保证不丢失数据？如何处理流式数据（也就是常说的粘包）？在ET的情况下如何处理数据是否读完？如何防止写事件的连续触发等等！特别是在多核心的服务器上，如何更好的复用每个核心而防止出现热点？这都是非常重要的考虑。但这种种都有一个前提，你的服务端处理的并发量要高。否则这些都没有实际意义。