Linux 内核自旋锁spinlock（一）

文章目录

前言
一、自旋锁
- 1.1 简介
- 1.2 API
- - 1.2.1 spin_lock/spin_unlock
  - 1.2.2 spin_lock_irq/spin_unlock_irq
  - 1.2.3 spin_lock_irqsave/spin_unlock_irqstore
  - 1.2.4 spin_lock_bh/spin_unlock_bh
  - 1.2.5 补充
二、自选锁原理
三、自旋锁在内核的使用
- 3.1 struct file
- 3.2 struct dentry
- 3.3 struct inode
- 3.4 struct super_block
参考资料

前言

在软件工程中，自旋锁是一种用于多线程应用程序中的同步原语，用于保护共享资源。当一个线程尝试获取自旋锁但发现它已被另一个线程持有时，该线程不会立即进入睡眠状态（如传统互斥锁的情况），而是会在一个循环中持续“自旋”，不断检查锁是否可用。这种行为被称为忙等待。

自旋锁能够避免操作系统进程重新调度或上下文切换的开销，因此在线程可能仅会短暂阻塞的情况下，自旋锁非常高效。出于这个原因，操作系统内核经常使用自旋锁。

对于操作系统内核：
在SMP系统上，各个CPU可能同时处于核心态，在理论上可以操作所有现存的数据结构。为阻止CPU彼此干扰，需要通过锁保护内核的某些范围。锁可以确保每次只能有一个CPU访问被保护的范围。

内核可以不受限制地访问整个地址空间。在多处理器系统上（或类似地，在启用了内核抢占的单处理器系统上），这会引起一些问题。如果几个处理器同时处于核心态，则理论上它们可以同时访问同一个数据结构。

一、自旋锁

1.1 简介

内核当发生访问资源冲突的时候，可以有两种锁的解决方案选择：
（1）一个是原地等待
（2）一个是挂起当前进程，调度其他进程执行（睡眠）
自旋锁就是原地等待，直到锁释放。

中断上下文要用锁，首选 spinlock。

自旋锁：这些是最常用的锁选项。它们用于短期保护某段代码，以防止其他处理器的访问。在内核等待自旋锁释放时，会重复检查是否能获取锁，而不会进入睡眠状态（忙等待）。当然，如果等待时间较长，则效率显然不高。

自旋锁用于保护短的代码段，其中只包含少量C语句，因此会很快执行完毕。大多数内核数据结构都有自身的自旋锁，在处理结构中的关键成员时，必须获得相应的自旋锁。自旋锁在内核源代码中普遍存在。如下图所示：
在这里插入图片描述
当运行任务 B 的 CPU(B) 想要通过自旋锁的锁定函数获取锁，而这个自旋锁已经被另一个 CPU（例如运行任务 A 的 CPU(A)）持有时，CPU(B) 将简单地在一个循环中旋转，从而阻塞任务 B，直到另一个 CPU 释放该锁（任务 A 调用自旋锁的释放函数）。这种自旋只会发生在多核机器上，这就解释了之前描述的使用情况，因为涉及多个 CPU，所以在单核机器上不会发生：任务要么持有自旋锁并继续执行，要么直到锁被释放才能运行。自旋锁是由 CPU 持有的锁，这与互斥锁相反，互斥锁是由任务持有的锁。自旋锁通过禁用本地 CPU 上的调度程序运行（即运行调用自旋锁的锁定 API 的任务的 CPU）。这也意味着当前在该 CPU 上运行的任务不能被另一个任务抢占，除非 IRQs 被禁用（稍后会详细讨论）。换句话说，自旋锁保护只能由一个 CPU 一次获取/访问的资源。这使得自旋锁适用于 SMP 安全性和执行原子任务。

自旋锁并不是唯一利用硬件原子功能的实现方式。例如，在 Linux 内核中，抢占状态取决于每个 CPU 变量，如果等于 0，则表示抢占已启用。然而，如果大于 0，则表示抢占已禁用（schedule() 变为无效）。因此，禁用抢占（preempt_disable()）包括向当前每个 CPU 变量（实际为 preempt_count）添加 1，而 preempt_enable() 则从变量中减去 1，检查新值是否为 0，并调用 schedule()。这些加法/减法操作应该是原子的，因此依赖于 CPU 能够提供原子加法/减法功能。

1.2 API

在这里插入图片描述

1.2.1 spin_lock/spin_unlock

最常用的就是 spin_lock/spin_unlock 这一对API了，其使用：
动态定义：

spinlock_t lock;  //定义一个自旋锁变量
spin_lock_init(&lock)  //初始化自旋锁

静态定义：

DEFINE_SPINLOCK(lock)

使用：

spin_lock(&lock);  //加锁
//临界区
spin_unlock(&lock);  //解锁

// include/linux/spinlock_types.h#define DEFINE_SPINLOCK(x)	spinlock_t x = __SPIN_LOCK_UNLOCKED(x)

// include/linux/spinlock.h# define spin_lock_init(_lock)			\
do {						\spinlock_check(_lock);			\*(_lock) = __SPIN_LOCK_UNLOCKED(_lock);	\
} while (0)

// include/linux/spinlock.h#define raw_spin_lock(lock)	_raw_spin_lock(lock)static __always_inline void spin_lock(spinlock_t *lock)
{raw_spin_lock(&lock->rlock);
}

// kernel/locking/spinlock.cvoid __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);

//  /include/linux/spinlock_api_smp.hstatic inline void __raw_spin_lock(raw_spinlock_t *lock)
{preempt_disable();spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

reempt_disable(): 禁用抢占，这将增加抢占计数并阻止当前 CPU 上的任务被抢占。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况，可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。

static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{__acquire(lock);arch_spin_lock(&lock->raw_lock);mmiowb_spin_lock();
}

// /include/asm-generic/qspinlock.h#define arch_spin_lock(l)		queued_spin_lock(l)

流程图如下：
在这里插入图片描述
这种方式使用自旋锁是有一些局限性的。虽然自旋锁可以防止本地 CPU 的抢占，但它无法防止该 CPU 被中断“霸占”（即执行中断处理程序）。想象一种情况，CPU 持有一个“自旋锁”来保护某个资源，并发生了一个中断。CPU 将停止当前任务并跳转到这个中断处理程序。到目前为止，一切都很好。现在，想象一下这个中断处理程序需要获取同样的自旋锁（你可能已经猜到这个资源是与中断处理程序共享的）。它将会在原地无限自旋，试图获取一个已经被被抢占了的任务所持有的锁。这种情况被称为死锁。

为了解决这个问题，Linux 内核提供了针对自旋锁的 _irq 变种函数，除了禁用/启用抢占外，还会在本地 CPU 上禁用/启用中断。这些函数包括 spin_lock_irq() 和 spin_unlock_irq()。通过使用这些函数，可以在保护共享资源时，同时考虑到中断处理程序对于自旋锁的需求，从而避免了死锁的发生。

1.2.2 spin_lock_irq/spin_unlock_irq

#define raw_spin_lock_irq(lock)		_raw_spin_lock_irq(lock)static __always_inline void spin_lock_irq(spinlock_t *lock)
{raw_spin_lock_irq(&lock->rlock);
}

void __lockfunc _raw_spin_lock_irq(raw_spinlock_t *lock)
{__raw_spin_lock_irq(lock);
}
EXPORT_SYMBOL(_raw_spin_lock_irq);

static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
{local_irq_disable();preempt_disable();spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

local_irq_disable(): 这个函数用于在本地 CPU 上禁用中断，这可以防止当前 CPU 上的中断处理程序运行，从而确保在获取自旋锁期间不会被中断。
reempt_disable(): 禁用抢占，这将增加抢占计数并阻止当前 CPU 上的任务被抢占。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况，可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。

_irq 变种函数只能部分解决这个问题。想象一下，在您的代码开始锁定之前，处理器上的中断已经被禁用。因此，当您调用 spin_unlock_irq() 时，不仅会释放锁，还会启用中断。然而，这可能会以错误的方式发生，因为 spin_unlock_irq() 无法知道在锁定之前哪些中断是已启用的，哪些是未启用的。

以下是一个简短的示例：
（1) 假设在获取自旋锁之前，中断 x 和 y 已被禁用，而 z 没有被禁用。
（2）spin_lock_irq() 会禁用这些中断（现在 x、y 和 z 都被禁用）并获取锁。
（3）spin_unlock_irq() 会启用这些中断。现在 x、y 和 z 都被启用了，而在获取锁之前不是这样。这就是问题所在。

这使得在从上下文中调用的 IRQ 中使用 spin_lock_irq() 不安全，因为它的对应函数 spin_unlock_irq() 会简单地启用 IRQ，存在启用那些在调用 spin_lock_irq() 时未被启用的 IRQ 的风险。只有在您知道中断已经被启用时才使用 spin_lock_irq() 才有意义；也就是说，您确定没有其他东西在本地 CPU 上禁用了中断。

在这种情况下，想象一下在获取锁之前将中断的状态保存在一个变量中，然后在释放锁时将中断恢复到获取锁时的状态。这样一来，就不会再有问题了。为了实现这一点，内核提供了 _irqsave 变种函数。这些函数的行为类似于 _irq 函数，同时还保存和恢复中断状态的特性。这些函数包括 spin_lock_irqsave() 和 spin_unlock_irqrestore()。

1.2.3 spin_lock_irqsave/spin_unlock_irqstore

#define spin_lock_irqsave(lock, flags)				\
do {								\raw_spin_lock_irqsave(spinlock_check(lock), flags);	\
} while (0)

#define raw_spin_lock_irqsave(lock, flags)			\do {						\typecheck(unsigned long, flags);	\flags = _raw_spin_lock_irqsave(lock);	\} while (0)

unsigned long __lockfunc _raw_spin_lock_irqsave(raw_spinlock_t *lock)
{return __raw_spin_lock_irqsave(lock);
}
EXPORT_SYMBOL(_raw_spin_lock_irqsave);

static inline unsigned long __raw_spin_lock_irqsave(raw_spinlock_t *lock)
{unsigned long flags;local_irq_save(flags);preempt_disable();spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);/** On lockdep we dont want the hand-coded irq-enable of* do_raw_spin_lock_flags() code, because lockdep assumes* that interrupts are not re-enabled during lock-acquire:*/
#ifdef CONFIG_LOCKDEPLOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
#elsedo_raw_spin_lock_flags(lock, &flags);
#endifreturn flags;
}

local_irq_save()：这个函数用于在本地 CPU 上保存当前中断状态到 flags 变量中，然后禁用中断。
preempt_disable()：禁用抢占，这会增加抢占计数并阻止当前 CPU 上的任务被抢占。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况，可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。

备注：
凡是用到 spin_lock_irq()\spin_unlock_irq() 都可以用 spin_lock_irqsave()\spin_unlock_irqrestore() 替换，根据使用情况决定选择哪种方式即可，例如希望中断执行完成后，所有的中断都要开启，那就选择 spin_lock_irq()\spin_unlock_irq()，如果希望中断执行完成后，只需要恢复执行前的中断开关状态，那么就选择 spin_lock_irqsave()\spin_unlock_irqrestore()，如执行前 A中断本来就要求关闭的，那么执行完之后，还是希望 A中断仍处于关闭状态。

1.2.4 spin_lock_bh/spin_unlock_bh

#define raw_spin_lock_bh(lock)		_raw_spin_lock_bh(lock)static __always_inline void spin_lock_bh(spinlock_t *lock)
{raw_spin_lock_bh(&lock->rlock);
}

void __lockfunc _raw_spin_lock_bh(raw_spinlock_t *lock)
{__raw_spin_lock_bh(lock);
}
EXPORT_SYMBOL(_raw_spin_lock_bh);

static inline void __raw_spin_lock_bh(raw_spinlock_t *lock)
{__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

__local_bh_disable_ip()：这个函数用于在本地 CPU 上禁用软中断。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况，可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。

spin_lock_bh/spin_unlock_bh 这对API用来关闭中断底半部，即关闭软中断。

1.2.5 补充

spin_lock() 及其所有变体会自动调用 preempt_disable()，这会在本地 CPU 上禁用抢占。另一方面，spin_unlock() 及其变体会调用 preempt_enable()，尝试启用抢占（是的，尝试！——这取决于其他自旋锁是否被锁定，这将影响抢占计数器的值），并在启用时（取决于计数器的当前值，应为 0）内部调用 schedule()。然后，spin_unlock() 是一个抢占点，可能会重新启用抢占。

禁用中断可能会防止内核抢占（调度器的定时器中断会被禁用），但并没有阻止受保护部分调用调度器（schedule() 函数）。许多内核函数间接调用调度器，比如处理自旋锁的函数。因此，即使是一个简单的 printk() 函数也可能会调用调度器，因为它涉及到保护内核消息缓冲区的自旋锁。内核通过增加或减少一个全局变量和每个 CPU 的变量（默认为 0，表示“启用”）来禁用或启用调度器（执行抢占），这个变量称为 preempt_count。当这个变量大于 0 时（由 schedule() 函数检查），调度器简单地返回并不执行任何操作。每当调用与 spin_lock* 相关的辅助函数时，这个变量就会增加 1。另一方面，释放自旋锁（任何 spin_unlock* 函数）会将其减少 1，每当它达到 0 时，调度器就会被调用，这意味着您的临界区可能不是原子化的。

因此，如果代码本身不会触发抢占，那么只能通过禁用中断来保护代码免受抢占。也就是说，已经锁定自旋锁的代码可能无法休眠，因为没有办法唤醒它（请记住，在本地 CPU 上定时器中断和调度器都被禁用）。

二、自选锁原理

自旋锁是一种基于硬件的锁原语。它依赖于当前硬件的能力来提供原子操作（比如test_and_set，在非原子实现中会涉及读取、修改和写入操作）。

通常，自选锁的实现只能通过特殊的汇编语言指令来实现，比如 atomic（即不可中断的）test-and-set 操作，并且在不支持真正原子操作的编程语言中无法轻松实现。

Linux 对于自选锁的实现可以分为三种：
（1）老版本（2.6.25之前）的Linux内核的自旋锁

typedef struct {volatile unsigned int slock;
} raw_spinlock_t;

raw_spinlock_t结构体使用一个unsigned int slock表示即可，slock 等于0表示locked，1表示unlocked。

在没有加锁的情况下，线程可以获得该锁，然后将变量置为1。其它线程由于发现该值为1，所以只能等待。而当线程解锁的时候，将该变量设置为0，此时其它变量就可以进行加锁了。或者将加锁和未加锁的标示反过来，也就是0表示加锁，1表示未加锁。这个都无所谓，只是一种状态标识。

// v2.6.20/source/include/linux/spinlock.h#define spin_lock(lock)			_spin_lock(lock)

// v2.6.20/source/kernel/spinlock.cvoid __lockfunc _spin_lock(spinlock_t *lock)
{preempt_disable();spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);_raw_spin_lock(lock);
}

// v2.6.20/source/include/linux/spinlock.h# define _raw_spin_lock(lock)		__raw_spin_lock(&(lock)->raw_lock)

// v2.6.20/source/include/asm-x86_64/spinlock.hstatic inline void __raw_spin_lock(raw_spinlock_t *lock)
{asm volatile("\n1:\t"LOCK_PREFIX " ; decl %0\n\t""jns 2f\n""3:\n""rep;nop\n\t""cmpl $0,%0\n\t""jle 3b\n\t""jmp 1b\n""2:\t" : "=m" (lock->slock) : : "memory");
}

（1）LOCK_PREFIX 是一个用于添加 x86 平台上的锁定指令的宏。这个宏会根据不同的体系结构提供相应的锁定指令。
（2）decl %0 会将 lock 参数指向的变量减一。如果结果不为负数（jns 指令用于判断），则自旋锁已成功获取。
（3）如果减一操作后结果为负数，代码会循环等待（自旋），在 3: 标签处执行 rep;nop 等待指令。
（4）在等待循环中，代码不断检查自旋锁的值，如果值小于等于0，则继续等待。
（5）如果自旋锁的值大于0，则通过 jmp 1b 返回到第1步继续尝试获取自旋锁。

// v2.6.20/source/include/asm-x86_64/spinlock.hstatic inline void __raw_spin_unlock(raw_spinlock_t *lock)
{asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
}

（1）movl $1, %0 会将值1移动到 lock->slock 所指向的变量，这个操作将自旋锁的值设置为1，表示自旋锁已经被释放。
（2）“=m” (lock->slock) 表示将 lock->slock 的值输出到内存中，以更新自旋锁的状态。
（3）memory 标记告诉编译器不要对内存操作进行任何优化，以确保在释放自旋锁时对内存的操作不会被重新排序或优化掉。

这个版本的spin lock的实现当然可以实现功能，而且在没有冲突的时候表现出不错的性能，不过存在一个问题：不公平。也就是所有的thread都是在无序的争抢spin lock，谁先抢到谁先得，不管thread等了很久还是刚刚开始spin。在冲突比较少的情况下，不公平不会体现的特别明显，然而，随着硬件的发展，多核处理器的数目越来越多，多核之间的冲突越来越剧烈，无序竞争的spinlock带来性能问题。

（2） 2.6.25以后的Linux内核的自旋锁
2.6.25以后的Linux内核的自旋锁称为 ticket spinlock，基于 FIFO 算法的排队自选锁。

（3） 4.2.0 以后的Linux内核的自旋锁
4.2.0以后的Linux内核的自旋锁称为 queued spinlock，基于 MCS 算法的排队自选锁。
目前高版本的自选锁都是queued spinlock。

三、自旋锁在内核的使用

以下内核源码来自于 Linux 5.15.0，即queued spinlock。

3.1 struct file

struct file {/** Protects f_ep, f_flags.* Must not be taken from IRQ context.*/spinlock_t		f_lock;......unsigned int 		f_flags;fmode_t			f_mode;......
}

spinlock_t f_lock 初始化：

/*** alloc_file - allocate and initialize a 'struct file'** @path: the (dentry, vfsmount) pair for the new file* @flags: O_... flags with which the new file will be opened* @fop: the 'struct file_operations' for the new file*/
static struct file *alloc_file(const struct path *path, int flags,const struct file_operations *fop)
{struct file *file;file = alloc_empty_file(flags, current_cred());file->f_path = *path;file->f_inode = path->dentry->d_inode;file->f_mapping = path->dentry->d_inode->i_mapping;......file->f_mode |= FMODE_OPENED;file->f_op = fop;......return file;
}

alloc_file()-->alloc_empty_file()-->__alloc_file()

static struct file *__alloc_file(int flags, const struct cred *cred)
{struct file *f;f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);spin_lock_init(&f->f_lock);mutex_init(&f->f_pos_lock);f->f_flags = flags;f->f_mode = OPEN_FMODE(flags);return f;
}

spinlock_t f_lock的使用：

static int setfl(int fd, struct file * filp, unsigned long arg)
{......spin_lock(&filp->f_lock);filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);spin_unlock(&filp->f_lock);......
}

保护 file 的 f_flags 成员变量。

int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
{switch (advice) {case POSIX_FADV_NORMAL:file->f_ra.ra_pages = bdi->ra_pages;spin_lock(&file->f_lock);file->f_mode &= ~FMODE_RANDOM;spin_unlock(&file->f_lock);break;case POSIX_FADV_RANDOM:spin_lock(&file->f_lock);file->f_mode |= FMODE_RANDOM;spin_unlock(&file->f_lock);break;case POSIX_FADV_SEQUENTIAL:file->f_ra.ra_pages = bdi->ra_pages * 2;spin_lock(&file->f_lock);file->f_mode &= ~FMODE_RANDOM;spin_unlock(&file->f_lock);break;......}
}

保护 file 的 f_mode 成员变量。

3.2 struct dentry

#define USE_CMPXCHG_LOCKREF \(IS_ENABLED(CONFIG_ARCH_USE_CMPXCHG_LOCKREF) && \IS_ENABLED(CONFIG_SMP) && SPINLOCK_SIZE <= 4)struct lockref {union {
#if USE_CMPXCHG_LOCKREFaligned_u64 lock_count;
#endifstruct {spinlock_t lock;int count;};};
};

#define d_lock	d_lockref.lockstruct dentry {/* Ref lookup also touches following */struct lockref d_lockref;	/* per-dentry lock and refcount */
}/** dentry->d_lock spinlock nesting subclasses:** 0: normal* 1: nested*/
enum dentry_d_lock_class
{DENTRY_D_LOCK_NORMAL, /* implicitly used by plain spin_lock() APIs. */DENTRY_D_LOCK_NESTED
};

使用比如例程1：

/*** d_alloc	-	allocate a dcache entry* @parent: parent of entry to allocate* @name: qstr of the name** Allocates a dentry. It returns %NULL if there is insufficient memory* available. On a success the dentry is returned. The name passed in is* copied and the copy passed in may be reused after this call.*/
struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
{struct dentry *dentry = __d_alloc(parent->d_sb, name);if (!dentry)return NULL;spin_lock(&parent->d_lock);/** don't need child lock because it is not subject* to concurrency here*/__dget_dlock(parent);dentry->d_parent = parent;list_add(&dentry->d_child, &parent->d_subdirs);spin_unlock(&parent->d_lock);return dentry;
}
EXPORT_SYMBOL(d_alloc);

分配dentry时初始化spinlock_t：

/*** __d_alloc	-	allocate a dcache entry* @sb: filesystem it will belong to* @name: qstr of the name** Allocates a dentry. It returns %NULL if there is insufficient memory* available. On a success the dentry is returned. The name passed in is* copied and the copy passed in may be reused after this call.*/static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
{struct dentry *dentry;dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);......dentry->d_lockref.count = 1;spin_lock_init(&dentry->d_lock);seqcount_spinlock_init(&dentry->d_seq, &dentry->d_lock);......
}

例程2：

/*** __d_lookup - search for a dentry (racy)* @parent: parent dentry* @name: qstr of name we wish to find* Returns: dentry, or NULL** __d_lookup is like d_lookup, however it may (rarely) return a* false-negative result due to unrelated rename activity.** __d_lookup is slightly faster by avoiding rename_lock read seqlock,* however it must be used carefully, eg. with a following d_lookup in* the case of failure.** __d_lookup callers must be commented.*/
struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
{unsigned int hash = name->hash;struct hlist_bl_head *b = d_hash(hash);struct hlist_bl_node *node;struct dentry *found = NULL;struct dentry *dentry;/** Note: There is significant duplication with __d_lookup_rcu which is* required to prevent single threaded performance regressions* especially on architectures where smp_rmb (in seqcounts) are costly.* Keep the two functions in sync.*//** The hash list is protected using RCU.** Take d_lock when comparing a candidate dentry, to avoid races* with d_move().** It is possible that concurrent renames can mess up our list* walk here and result in missing our dentry, resulting in the* false-negative result. d_lookup() protects against concurrent* renames using rename_lock seqlock.** See Documentation/filesystems/path-lookup.txt for more details.*/rcu_read_lock();hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {if (dentry->d_name.hash != hash)continue;spin_lock(&dentry->d_lock);if (dentry->d_parent != parent)goto next;if (d_unhashed(dentry))goto next;if (!d_same_name(dentry, parent, name))goto next;dentry->d_lockref.count++;found = dentry;spin_unlock(&dentry->d_lock);break;
next:spin_unlock(&dentry->d_lock);}rcu_read_unlock();return found;
}

该函数用于在父目录项中搜索指定名称的目录项。

3.3 struct inode

struct inode {spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
}

使用例程1：

/*** d_find_alias - grab a hashed alias of inode* @inode: inode in question** If inode has a hashed alias, or is a directory and has any alias,* acquire the reference to alias and return it. Otherwise return NULL.* Notice that if inode is a directory there can be only one alias and* it can be unhashed only if it has no children, or if it is the root* of a filesystem, or if the directory was renamed and d_revalidate* was the first vfs operation to notice.** If the inode has an IS_ROOT, DCACHE_DISCONNECTED alias, then prefer* any other hashed alias over that one.*/
struct dentry *d_find_alias(struct inode *inode)
{struct dentry *de = NULL;if (!hlist_empty(&inode->i_dentry)) {spin_lock(&inode->i_lock);de = __d_find_alias(inode);spin_unlock(&inode->i_lock);}return de;
}
EXPORT_SYMBOL(d_find_alias);

使用例程2：

/** When a file is deleted, we have two options:* - turn this dentry into a negative dentry* - unhash this dentry and free it.** Usually, we want to just turn this into* a negative dentry, but if anybody else is* currently using the dentry or the inode* we can't do that and we fall back on removing* it from the hash queues and waiting for* it to be deleted later when it has no users*//*** d_delete - delete a dentry* @dentry: The dentry to delete** Turn the dentry into a negative dentry if possible, otherwise* remove it from the hash queues so it can be deleted later*/void d_delete(struct dentry * dentry)
{struct inode *inode = dentry->d_inode;spin_lock(&inode->i_lock);spin_lock(&dentry->d_lock);/** Are we the only user?*/if (dentry->d_lockref.count == 1) {dentry->d_flags &= ~DCACHE_CANT_MOUNT;dentry_unlink_inode(dentry);} else {__d_drop(dentry);spin_unlock(&dentry->d_lock);spin_unlock(&inode->i_lock);}
}
EXPORT_SYMBOL(d_delete);

使用例程3：

static struct inode *alloc_inode(struct super_block *sb)
{const struct super_operations *ops = sb->s_op;struct inode *inode;if (ops->alloc_inode)inode = ops->alloc_inode(sb);elseinode = kmem_cache_alloc(inode_cachep, GFP_KERNEL);......inode_init_always(sb, inode);......return inode;
}

/*** inode_init_always - perform inode structure initialisation* @sb: superblock inode belongs to* @inode: inode to initialise** These are initializations that need to be done on every inode* allocation as the fields are not initialised by slab allocation.*/
int inode_init_always(struct super_block *sb, struct inode *inode)
{spin_lock_init(&inode->i_lock);
}

使用例程4：

/*** do_inode_permission - UNIX permission checking* @mnt_userns:	user namespace of the mount the inode was found from* @inode:	inode to check permissions on* @mask:	right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC ...)** We _really_ want to just do "generic_permission()" without* even looking at the inode->i_op values. So we keep a cache* flag in inode->i_opflags, that says "this has not special* permission function, use the fast case".*/
static inline int do_inode_permission(struct user_namespace *mnt_userns,struct inode *inode, int mask)
{if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {if (likely(inode->i_op->permission))return inode->i_op->permission(mnt_userns, inode, mask);/* This gets set once for the inode lifetime */spin_lock(&inode->i_lock);inode->i_opflags |= IOP_FASTPERM;spin_unlock(&inode->i_lock);}return generic_permission(mnt_userns, inode, mask);
}

3.4 struct super_block

struct super_block {....../* s_inode_list_lock protects s_inodes */spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;struct list_head	s_inodes;	/* all inodes */spinlock_t		s_inode_wblist_lock;struct list_head	s_inodes_wb;	/* writeback inodes */
}

spinlock_t s_inode_list_lock/s_inode_wblist_lock初始化：

/***	alloc_super	-	create new superblock*	@type:	filesystem type superblock should belong to*	@flags: the mount flags*	@user_ns: User namespace for the super_block**	Allocates and initializes a new &struct super_block.  alloc_super()*	returns a pointer new superblock or %NULL if allocation had failed.*/
static struct super_block *alloc_super(struct file_system_type *type, int flags,struct user_namespace *user_ns)
{struct super_block *s = kzalloc(sizeof(struct super_block),  GFP_USER);......INIT_LIST_HEAD(&s->s_inodes);spin_lock_init(&s->s_inode_list_lock);INIT_LIST_HEAD(&s->s_inodes_wb);spin_lock_init(&s->s_inode_wblist_lock);......
}

s_inode_list_lock的使用：

/*** inode_sb_list_add - add inode to the superblock list of inodes* @inode: inode to add*/
void inode_sb_list_add(struct inode *inode)
{spin_lock(&inode->i_sb->s_inode_list_lock);list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);spin_unlock(&inode->i_sb->s_inode_list_lock);
}
EXPORT_SYMBOL_GPL(inode_sb_list_add);static inline void inode_sb_list_del(struct inode *inode)
{if (!list_empty(&inode->i_sb_list)) {spin_lock(&inode->i_sb->s_inode_list_lock);list_del_init(&inode->i_sb_list);spin_unlock(&inode->i_sb->s_inode_list_lock);}

/*** invalidate_inodes	- attempt to free all inodes on a superblock* @sb:		superblock to operate on* @kill_dirty: flag to guide handling of dirty inodes** Attempts to free all inodes for a given superblock.  If there were any* busy inodes return a non-zero value, else zero.* If @kill_dirty is set, discard dirty inodes too, otherwise treat* them as busy.*/
int invalidate_inodes(struct super_block *sb, bool kill_dirty)
{int busy = 0;struct inode *inode, *next;LIST_HEAD(dispose);again:spin_lock(&sb->s_inode_list_lock);list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {spin_lock(&inode->i_lock);if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {spin_unlock(&inode->i_lock);continue;}if (inode->i_state & I_DIRTY_ALL && !kill_dirty) {spin_unlock(&inode->i_lock);busy = 1;continue;}if (atomic_read(&inode->i_count)) {spin_unlock(&inode->i_lock);busy = 1;continue;}inode->i_state |= I_FREEING;inode_lru_list_del(inode);spin_unlock(&inode->i_lock);list_add(&inode->i_lru, &dispose);if (need_resched()) {spin_unlock(&sb->s_inode_list_lock);cond_resched();dispose_list(&dispose);goto again;}}spin_unlock(&sb->s_inode_list_lock);dispose_list(&dispose);return busy;
}

/*** fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.* @sb: superblock being unmounted.** Called during unmount with no locks held, so needs to be safe against* concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.*/
static void fsnotify_unmount_inodes(struct super_block *sb)
{struct inode *inode, *iput_inode = NULL;spin_lock(&sb->s_inode_list_lock);list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {/** We cannot __iget() an inode in state I_FREEING,* I_WILL_FREE, or I_NEW which is fine because by that point* the inode cannot have any associated watches.*/spin_lock(&inode->i_lock);if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {spin_unlock(&inode->i_lock);continue;}/** If i_count is zero, the inode cannot have any watches and* doing an __iget/iput with SB_ACTIVE clear would actually* evict all inodes with zero i_count from icache which is* unnecessarily violent and may in fact be illegal to do.* However, we should have been called /after/ evict_inodes* removed all zero refcount inodes, in any case.  Test to* be sure.*/if (!atomic_read(&inode->i_count)) {spin_unlock(&inode->i_lock);continue;}__iget(inode);spin_unlock(&inode->i_lock);spin_unlock(&sb->s_inode_list_lock);if (iput_inode)iput(iput_inode);/* for each watch, send FS_UNMOUNT and then remove it */fsnotify_inode(inode, FS_UNMOUNT);fsnotify_inode_delete(inode);iput_inode = inode;cond_resched();spin_lock(&sb->s_inode_list_lock);}spin_unlock(&sb->s_inode_list_lock);if (iput_inode)iput(iput_inode);
}

s_inode_wblist_lock的使用：

// linux-5.15/fs/fs-writeback.c/** mark an inode as under writeback on the sb*/
void sb_mark_inode_writeback(struct inode *inode)
{struct super_block *sb = inode->i_sb;unsigned long flags;if (list_empty(&inode->i_wb_list)) {spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);if (list_empty(&inode->i_wb_list)) {list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);trace_sb_mark_inode_writeback(inode);}spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);}
}

这个函数用于将给定的索引节点标记为正在写回的状态，并将其添加到超级块的写回列表中。函数首先检查索引节点的写回列表是否为空，如果是空的，则获取超级块的写回列表锁，然后再次检查是否为空以避免重复添加。如果索引节点的写回列表为空，则将该索引节点添加到超级块的写回列表中，并记录写回操作的跟踪信息。

// linux-5.15/fs/fs-writeback.c/** clear an inode as under writeback on the sb*/
void sb_clear_inode_writeback(struct inode *inode)
{struct super_block *sb = inode->i_sb;unsigned long flags;if (!list_empty(&inode->i_wb_list)) {spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);if (!list_empty(&inode->i_wb_list)) {list_del_init(&inode->i_wb_list);trace_sb_clear_inode_writeback(inode);}spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);}
}

这个函数用于将给定的索引节点从超级块的写回列表中清除，表示该索引节点不再处于写回状态。函数首先检查索引节点的写回列表是否非空，如果非空，则获取超级块的写回列表锁，然后再次检查是否非空以确保正确性。如果索引节点的写回列表非空，则将该索引节点从超级块的写回列表中删除，并记录清除写回状态的跟踪信息。

// linux-5.15/fs/fs-writeback.c/** The @s_sync_lock is used to serialise concurrent sync operations* to avoid lock contention problems with concurrent wait_sb_inodes() calls.* Concurrent callers will block on the s_sync_lock rather than doing contending* walks. The queueing maintains sync(2) required behaviour as all the IO that* has been issued up to the time this function is enter is guaranteed to be* completed by the time we have gained the lock and waited for all IO that is* in progress regardless of the order callers are granted the lock.*/
static void wait_sb_inodes(struct super_block *sb)
{LIST_HEAD(sync_list);/** We need to be protected against the filesystem going from* r/o to r/w or vice versa.*/WARN_ON(!rwsem_is_locked(&sb->s_umount));mutex_lock(&sb->s_sync_lock);/** Splice the writeback list onto a temporary list to avoid waiting on* inodes that have started writeback after this point.** Use rcu_read_lock() to keep the inodes around until we have a* reference. s_inode_wblist_lock protects sb->s_inodes_wb as well as* the local list because inodes can be dropped from either by writeback* completion.*/rcu_read_lock();spin_lock_irq(&sb->s_inode_wblist_lock);list_splice_init(&sb->s_inodes_wb, &sync_list);/** Data integrity sync. Must wait for all pages under writeback, because* there may have been pages dirtied before our sync call, but which had* writeout started before we write it out.  In which case, the inode* may not be on the dirty list, but we still have to wait for that* writeout.*/while (!list_empty(&sync_list)) {struct inode *inode = list_first_entry(&sync_list, struct inode,i_wb_list);struct address_space *mapping = inode->i_mapping;/** Move each inode back to the wb list before we drop the lock* to preserve consistency between i_wb_list and the mapping* writeback tag. Writeback completion is responsible to remove* the inode from either list once the writeback tag is cleared.*/list_move_tail(&inode->i_wb_list, &sb->s_inodes_wb);/** The mapping can appear untagged while still on-list since we* do not have the mapping lock. Skip it here, wb completion* will remove it.*/if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))continue;spin_unlock_irq(&sb->s_inode_wblist_lock);spin_lock(&inode->i_lock);if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {spin_unlock(&inode->i_lock);spin_lock_irq(&sb->s_inode_wblist_lock);continue;}__iget(inode);spin_unlock(&inode->i_lock);rcu_read_unlock();/** We keep the error status of individual mapping so that* applications can catch the writeback error using fsync(2).* See filemap_fdatawait_keep_errors() for details.*/filemap_fdatawait_keep_errors(mapping);cond_resched();iput(inode);rcu_read_lock();spin_lock_irq(&sb->s_inode_wblist_lock);}spin_unlock_irq(&sb->s_inode_wblist_lock);rcu_read_unlock();mutex_unlock(&sb->s_sync_lock);
}