Linux--Sys_Read系统调用过程分析

注：

本片文章以Read函数的调用为例来讲述一下系统对块驱动层的一些处理, 哈哈。如果有不正确或者不完善的地方，欢迎前来拍砖留言或者发邮件到guopeixin@126.com进行讨论，先行谢过。

一．Read函数经由的层次模型

首先来了解一下Read函数经由的层次模型：

clip_image002

从图中可以看出，对于磁盘的一次读请求，首先经过虚拟文件系统层（vfs layer），其次是具体的文件系统层（例如 ext2），接下来是 cache 层（page cache 层）、通用块层（generic block layer）、IO 调度层（I/O scheduler layer）、块设备驱动层（block device driver layer），最后是物理块设备层（block device layer）。

下面摘抄一份文档，来对上面的各个层面的作用做一些简述：

• 虚拟文件系统层的作用：屏蔽下层具体文件系统操作的差异，为上层的操作提供一个统一的接口。正是因为有了这个层次，所以可以把设备抽象成文件，使得操作设备就像操作文件一样简单。

• 在具体的文件系统层中，不同的文件系统（例如 ext2 和 NTFS）具体的操作过程也是不同的。每种文件系统定义了自己的操作集合。关于文件系统的更多内容，请参见参考资料。

• 引入 cache 层的目的是为了提高 linux 操作系统对磁盘访问的性能。 Cache 层在内存中缓存了磁盘上的部分数据。当数据的请求到达时，如果在 cache 中存在该数据且是最新的，则直接将数据传递给用户程序，免除了对底层磁盘的操作，提高了性能。

• 通用块层的主要工作是：接收上层发出的磁盘请求，并最终发出 IO 请求。该层隐藏了底层硬件块设备的特性，为块设备提供了一个通用的抽象视图。

• IO 调度层的功能：接收通用块层发出的 IO 请求，缓存请求并试图合并相邻的请求（如果这两个请求的数据在磁盘上是相邻的）。并根据设置好的调度算法，回调驱动层提供的请求处理函数，以处理具体的 IO 请求。

• 驱动层中的驱动程序对应具体的物理块设备。它从上层中取出 IO 请求，并根据该 IO 请求中指定的信息，通过向具体块设备的设备控制器发送命令的方式，来操纵设备传输数据。

• 设备层中都是具体的物理设备。定义了操作具体设备的规范。

二．系统调用的发起点sys_read

1. sys_read代码分析

Sys_read最终被注册为系统API，在很多的系统模块中都可以看到该API的调用。

函数sys_read()的代码如下：

asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf, size_t count)

{

struct file *file;

ssize_t ret = -EBADF;

int fput_needed;

file = fget_light(fd, &fput_needed);

if (file) {

loff_t pos = file_pos_read(file);

ret = vfs_read(file, buf, count, &pos);

file_pos_write(file, pos);

fput_light(file, fput_needed);

}

return ret;

}

ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)

{

ssize_t ret;

if (!(file->f_mode & FMODE_READ))

return -EBADF;

if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read))

return -EINVAL;

if (unlikely(!access_ok(VERIFY_WRITE, buf, count)))

return -EFAULT;

ret = rw_verify_area(READ, file, pos, count);

if (ret >= 0) {

count = ret;

if (file->f_op->read)

ret = file->f_op->read(file, buf, count, pos);

else

ret = do_sync_read(file, buf, count, pos);

if (ret > 0) {

fsnotify_access(file->f_path.dentry);

add_rchar(current, ret);

}

inc_syscr(current);

}

return ret;

}

从上面可以看到，调用Stack为sys_read()àvfs_read()àfile->f_op->read()。而file->f_op->read实际上就是具体的文件系统向通用Block层注册的一个函数指针，对于本文中讲述的EXT2文件系统来说，实际上就是do_sync_read。

三．Ext2文件系统在sys_read调用过程中的角色

1. Ext2文件系统file_operations接口的注册过程

Ext2文件系统的模块初始化函数会去注册操作接口ext2_file_operations，调用Stack如下init_ext2_fs()à register_filesystem()àext2_get_sb()àext2_fill_super（）àext2_iget()，其中函数ext2_iget()会获取结构体file_operations的值。其中，接口的定义如下：

* We have mostly NULL's here: the current defaults are ok for

* the ext2 filesystem.

const struct file_operations ext2_file_operations = {

.llseek = generic_file_llseek,

.read = do_sync_read,

.write = do_sync_write,

.aio_read = generic_file_aio_read,

.aio_write = generic_file_aio_write,

.unlocked_ioctl = ext2_ioctl,

#ifdef CONFIG_COMPAT

.compat_ioctl = ext2_compat_ioctl,

#endif

.mmap = generic_file_mmap,

.open = generic_file_open,

.release = ext2_release_file,

.fsync = ext2_sync_file,

.splice_read = generic_file_splice_read,

.splice_write = generic_file_splice_write,

};

const struct address_space_operations ext2_aops = {

.readpage = ext2_readpage,

.readpages = ext2_readpages,

.writepage = ext2_writepage,

.sync_page = block_sync_page,

.write_begin = ext2_write_begin,

.write_end = generic_write_end,

.bmap = ext2_bmap,

.direct_IO = ext2_direct_IO,

.writepages = ext2_writepages,

.migratepage = buffer_migrate_page,

.is_partially_uptodate = block_is_partially_uptodate,

};

const struct address_space_operations ext2_nobh_aops = {

.readpage = ext2_readpage,

.readpages = ext2_readpages,

.writepage = ext2_nobh_writepage,

.sync_page = block_sync_page,

.write_begin = ext2_nobh_write_begin,

.write_end = nobh_write_end,

.bmap = ext2_bmap,

.direct_IO = ext2_direct_IO,

.writepages = ext2_writepages,

.migratepage = buffer_migrate_page,

};

而函数ext2_iget()中的相关代码如下：

struct inode *ext2_iget (struct super_block *sb, unsigned long ino)

{

...

if (S_ISREG(inode->i_mode)) {

inode->i_op = &ext2_file_inode_operations;

if (ext2_use_xip(inode->i_sb)) {

inode->i_mapping->a_ops = &ext2_aops_xip;

inode->i_fop = &ext2_xip_file_operations;

} else if (test_opt(inode->i_sb, NOBH)) {

inode->i_mapping->a_ops = &ext2_nobh_aops;

inode->i_fop = &ext2_file_operations;

} else {

inode->i_mapping->a_ops = &ext2_aops;

inode->i_fop = &ext2_file_operations;

}

} else if (S_ISDIR(inode->i_mode)) {

inode->i_op = &ext2_dir_inode_operations;

inode->i_fop = &ext2_dir_operations;

if (test_opt(inode->i_sb, NOBH))

inode->i_mapping->a_ops = &ext2_nobh_aops;

else

inode->i_mapping->a_ops = &ext2_aops;

} else if (S_ISLNK(inode->i_mode)) {

if (ext2_inode_is_fast_symlink(inode))

inode->i_op = &ext2_fast_symlink_inode_operations;

else {

inode->i_op = &ext2_symlink_inode_operations;

if (test_opt(inode->i_sb, NOBH))

inode->i_mapping->a_ops = &ext2_nobh_aops;

else

inode->i_mapping->a_ops = &ext2_aops;

}

} else {

inode->i_op = &ext2_special_inode_operations;

if (raw_inode->i_block[0])

init_special_inode(inode, inode->i_mode,

old_decode_dev(le32_to_cpu(raw_inode->i_block[0])));

else

init_special_inode(inode, inode->i_mode,

new_decode_dev(le32_to_cpu(raw_inode->i_block[1])));

}

...

}

2. 系统Read过程调用在该层的Stack

四．Page Cache在Sys_read调用过程中所做的工作

1. Page Cache在Sys_read调用过程中所做的工作

从前面粘贴的函数ext2_iget()的代码中中可以看到inode->i_mapping->a_ops = &ext2_aops，实际上这里就是注册了页面缓存的一些接口。

上一部分提到Ext2调用的结束点就是mappingàa_opsàreadpage(file, page)，实际上执行的就是ext2_aops.readpage(file, page)，也即ext2_readpage。

有关函数ext2_readpage()的调用Stack如下：

五．通用Block层和IO Schedule层扮演的角色

这部分相对比较简单，通过函数submit_bio()的调用直接可以找到，相关调用Stack如下：

六．Driver所做的事情

哎呀，分析了半天还没有看到块设备驱动的参与，不要急，这里就来了，呵呵。

在块设备驱动中一般会调用通过Block层的导出函数blk_init_queue()来注册执行具体操作的函数，形如q->request_fn = rfn。