YOLO11改进 | 主干网络 | 将backbone替换为Swin-Transformer结构【论文必备】

 秋招面试专栏推荐 :深度学习算法工程师面试问题总结【百面算法工程师】——点击即可跳转


💡💡💡本专栏所有程序均经过测试,可成功执行💡💡💡


本文给大家带来的教程是将YOLO11的backbone替换为Swin-Transformer结构来提取特征。文章在介绍主要的原理后,将手把手教学如何进行模块的代码添加和修改,并将修改后的完整代码放在文章的最后,方便大家一键运行,小白也可轻松上手实践。以帮助您更好地学习深度学习目标检测YOLO系列的挑战。 

专栏地址:YOLO11入门 + 改进涨点——点击即可跳转 欢迎订阅

目录

1.论文

2. 将Swin-Transformer添加到yolo11网络中

2.1 Swin-Transformer代码实现

2.2 Swin-Transformer的神经网络模块代码解析

2.3 更改init.py文件

2.4 添加yaml文件

2.5 在task.py中进行注册

2.6 执行程序

3.修改后的网络结构图

4. 完整代码分享

5. GFLOPs

6. 进阶

7.总结


1.论文

论文地址:Swin-Transformer点击即可跳转

官方代码:Swin-Transformer官方代码仓库点击即可跳转

2. 将Swin-Transformer添加到yolo11网络中

2.1 Swin-Transformer代码实现

关键步骤一: 将下面代码粘贴到在/ultralytics/ultralytics/nn/modules/block.py中


#swin transformer
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint as checkpoint
import numpy as np
from typing import Optionaldef drop_path_f(x, drop_prob: float = 0., training: bool = False):if drop_prob == 0. or not training:return xkeep_prob = 1 - drop_probshape = (x.shape[0],) + (1,) * (x.ndim - 1)random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)random_tensor.floor_()output = x.div(keep_prob) * random_tensorreturn outputclass DropPath(nn.Module):def __init__(self, drop_prob=None):super(DropPath, self).__init__()self.drop_prob = drop_probdef forward(self, x):return drop_path_f(x, self.drop_prob, self.training)def window_partition(x, window_size: int):B, H, W, C = x.shapex = x.view(B, H // window_size, window_size, W // window_size, window_size, C)# permute: [B, H//Mh, Mh, W//Mw, Mw, C] -> [B, H//Mh, W//Mh, Mw, Mw, C]# view: [B, H//Mh, W//Mw, Mh, Mw, C] -> [B*num_windows, Mh, Mw, C]windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)return windowsdef window_reverse(windows, window_size: int, H: int, W: int):B = int(windows.shape[0] / (H * W / window_size / window_size))# view: [B*num_windows, Mh, Mw, C] -> [B, H//Mh, W//Mw, Mh, Mw, C]x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)# permute: [B, H//Mh, W//Mw, Mh, Mw, C] -> [B, H//Mh, Mh, W//Mw, Mw, C]# view: [B, H//Mh, Mh, W//Mw, Mw, C] -> [B, H, W, C]x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)return xclass Mlp(nn.Module):def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):super().__init__()out_features = out_features or in_featureshidden_features = hidden_features or in_featuresself.fc1 = nn.Linear(in_features, hidden_features)self.act = act_layer()self.drop1 = nn.Dropout(drop)self.fc2 = nn.Linear(hidden_features, out_features)self.drop2 = nn.Dropout(drop)def forward(self, x):x = self.fc1(x)x = self.act(x)x = self.drop1(x)x = self.fc2(x)x = self.drop2(x)return xclass WindowAttention(nn.Module):r""" Window based multi-head self attention (W-MSA) module with relative position bias.It supports both of shifted and non-shifted window.Args:dim (int): Number of input channels.window_size (tuple[int]): The height and width of the window.num_heads (int): Number of attention heads.qkv_bias (bool, optional):  If True, add a learnable bias to query, key, value. Default: Trueattn_drop (float, optional): Dropout ratio of attention weight. Default: 0.0proj_drop (float, optional): Dropout ratio of output. Default: 0.0"""def __init__(self, dim, window_size, num_heads, qkv_bias=True, attn_drop=0., proj_drop=0.):super().__init__()self.dim = dimself.window_size = window_size  # [Mh, Mw]self.num_heads = num_headshead_dim = dim // num_headsself.scale = head_dim ** -0.5# define a parameter table of relative position biasself.relative_position_bias_table = nn.Parameter(torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads))  # [2*Mh-1 * 2*Mw-1, nH]# get pair-wise relative position index for each token inside the windowcoords_h = torch.arange(self.window_size[0])coords_w = torch.arange(self.window_size[1])coords = torch.stack(torch.meshgrid([coords_h, coords_w], indexing="ij"))  # [2, Mh, Mw]coords_flatten = torch.flatten(coords, 1)  # [2, Mh*Mw]# [2, Mh*Mw, 1] - [2, 1, Mh*Mw]relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]  # [2, Mh*Mw, Mh*Mw]relative_coords = relative_coords.permute(1, 2, 0).contiguous()  # [Mh*Mw, Mh*Mw, 2]relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0relative_coords[:, :, 1] += self.window_size[1] - 1relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1relative_position_index = relative_coords.sum(-1)  # [Mh*Mw, Mh*Mw]self.register_buffer("relative_position_index", relative_position_index)self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)self.attn_drop = nn.Dropout(attn_drop)self.proj = nn.Linear(dim, dim)self.proj_drop = nn.Dropout(proj_drop)nn.init.trunc_normal_(self.relative_position_bias_table, std=.02)self.softmax = nn.Softmax(dim=-1)def forward(self, x, mask: Optional[torch.Tensor] = None):"""Args:x: input features with shape of (num_windows*B, Mh*Mw, C)mask: (0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or None"""# [batch_size*num_windows, Mh*Mw, total_embed_dim]B_, N, C = x.shape# qkv(): -> [batch_size*num_windows, Mh*Mw, 3 * total_embed_dim]# reshape: -> [batch_size*num_windows, Mh*Mw, 3, num_heads, embed_dim_per_head]# permute: -> [3, batch_size*num_windows, num_heads, Mh*Mw, embed_dim_per_head]qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4).contiguous()# [batch_size*num_windows, num_heads, Mh*Mw, embed_dim_per_head]q, k, v = qkv.unbind(0)  # make torchscript happy (cannot use tensor as tuple)# transpose: -> [batch_size*num_windows, num_heads, embed_dim_per_head, Mh*Mw]# @: multiply -> [batch_size*num_windows, num_heads, Mh*Mw, Mh*Mw]q = q * self.scaleattn = (q @ k.transpose(-2, -1))# relative_position_bias_table.view: [Mh*Mw*Mh*Mw,nH] -> [Mh*Mw,Mh*Mw,nH]relative_position_bias = self.relative_position_bias_table[self.relative_position_index.view(-1)].view(self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1)relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous()  # [nH, Mh*Mw, Mh*Mw]attn = attn + relative_position_bias.unsqueeze(0)if mask is not None:# mask: [nW, Mh*Mw, Mh*Mw]nW = mask.shape[0]  # num_windows# attn.view: [batch_size, num_windows, num_heads, Mh*Mw, Mh*Mw]# mask.unsqueeze: [1, nW, 1, Mh*Mw, Mh*Mw]attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(1).unsqueeze(0)attn = attn.view(-1, self.num_heads, N, N)attn = self.softmax(attn)else:attn = self.softmax(attn)attn = self.attn_drop(attn)# @: multiply -> [batch_size*num_windows, num_heads, Mh*Mw, embed_dim_per_head]# transpose: -> [batch_size*num_windows, Mh*Mw, num_heads, embed_dim_per_head]# reshape: -> [batch_size*num_windows, Mh*Mw, total_embed_dim]#x = (attn @ v).transpose(1, 2).reshape(B_, N, C)x = (attn.to(v.dtype) @ v).transpose(1, 2).reshape(B_, N, C)x = self.proj(x)x = self.proj_drop(x)return xclass SwinTransformerBlock(nn.Module):r""" Swin Transformer Block.Args:dim (int): Number of input channels.num_heads (int): Number of attention heads.window_size (int): Window size.shift_size (int): Shift size for SW-MSA.mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: Truedrop (float, optional): Dropout rate. Default: 0.0attn_drop (float, optional): Attention dropout rate. Default: 0.0drop_path (float, optional): Stochastic depth rate. Default: 0.0act_layer (nn.Module, optional): Activation layer. Default: nn.GELUnorm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm"""def __init__(self, dim, num_heads, window_size=7, shift_size=0,mlp_ratio=4., qkv_bias=True, drop=0., attn_drop=0., drop_path=0.,act_layer=nn.GELU, norm_layer=nn.LayerNorm):super().__init__()self.dim = dimself.num_heads = num_headsself.window_size = window_sizeself.shift_size = shift_sizeself.mlp_ratio = mlp_ratioassert 0 <= self.shift_size < self.window_size, "shift_size must in 0-window_size"self.norm1 = norm_layer(dim)self.attn = WindowAttention(dim, window_size=(self.window_size, self.window_size), num_heads=num_heads, qkv_bias=qkv_bias,attn_drop=attn_drop, proj_drop=drop)self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()self.norm2 = norm_layer(dim)mlp_hidden_dim = int(dim * mlp_ratio)self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)def forward(self, x, attn_mask):H, W = self.H, self.WB, L, C = x.shapeassert L == H * W, "input feature has wrong size"shortcut = xx = self.norm1(x)x = x.view(B, H, W, C)# pad feature maps to multiples of window size# 把 feature map 给 pad 到 window size 的整数倍pad_l = pad_t = 0pad_r = (self.window_size - W % self.window_size) % self.window_sizepad_b = (self.window_size - H % self.window_size) % self.window_sizex = F.pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))_, Hp, Wp, _ = x.shape# cyclic shiftif self.shift_size > 0:shifted_x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))else:shifted_x = xattn_mask = None# partition windowsx_windows = window_partition(shifted_x, self.window_size)  # [nW*B, Mh, Mw, C]x_windows = x_windows.view(-1, self.window_size * self.window_size, C)  # [nW*B, Mh*Mw, C]# W-MSA/SW-MSAattn_windows = self.attn(x_windows, mask=attn_mask)  # [nW*B, Mh*Mw, C]# merge windowsattn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)  # [nW*B, Mh, Mw, C]shifted_x = window_reverse(attn_windows, self.window_size, Hp, Wp)  # [B, H', W', C]# reverse cyclic shiftif self.shift_size > 0:x = torch.roll(shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2))else:x = shifted_xif pad_r > 0 or pad_b > 0:# 把前面pad的数据移除掉x = x[:, :H, :W, :].contiguous()x = x.view(B, H * W, C)# FFNx = shortcut + self.drop_path(x)x = x + self.drop_path(self.mlp(self.norm2(x)))return xclass SwinStage(nn.Module):"""A basic Swin Transformer layer for one stage.Args:dim (int): Number of input channels.depth (int): Number of blocks.num_heads (int): Number of attention heads.window_size (int): Local window size.mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: Truedrop (float, optional): Dropout rate. Default: 0.0attn_drop (float, optional): Attention dropout rate. Default: 0.0drop_path (float | tuple[float], optional): Stochastic depth rate. Default: 0.0norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNormdownsample (nn.Module | None, optional): Downsample layer at the end of the layer. Default: Noneuse_checkpoint (bool): Whether to use checkpointing to save memory. Default: False."""def __init__(self, dim, c2, depth, num_heads, window_size,mlp_ratio=4., qkv_bias=True, drop=0., attn_drop=0.,drop_path=0., norm_layer=nn.LayerNorm, use_checkpoint=False):super().__init__()assert dim == c2, r"no. in/out channel should be same"self.dim = dimself.depth = depthself.window_size = window_sizeself.use_checkpoint = use_checkpointself.shift_size = window_size // 2# build blocksself.blocks = nn.ModuleList([SwinTransformerBlock(dim=dim,num_heads=num_heads,window_size=window_size,shift_size=0 if (i % 2 == 0) else self.shift_size,mlp_ratio=mlp_ratio,qkv_bias=qkv_bias,drop=drop,attn_drop=attn_drop,drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,norm_layer=norm_layer)for i in range(depth)])def create_mask(self, x, H, W):# calculate attention mask for SW-MSAHp = int(np.ceil(H / self.window_size)) * self.window_sizeWp = int(np.ceil(W / self.window_size)) * self.window_sizeimg_mask = torch.zeros((1, Hp, Wp, 1), device=x.device)  # [1, Hp, Wp, 1]h_slices = (slice(0, -self.window_size),slice(-self.window_size, -self.shift_size),slice(-self.shift_size, None))w_slices = (slice(0, -self.window_size),slice(-self.window_size, -self.shift_size),slice(-self.shift_size, None))cnt = 0for h in h_slices:for w in w_slices:img_mask[:, h, w, :] = cntcnt += 1mask_windows = window_partition(img_mask, self.window_size)  # [nW, Mh, Mw, 1]mask_windows = mask_windows.view(-1, self.window_size * self.window_size)  # [nW, Mh*Mw]attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)  # [nW, 1, Mh*Mw] - [nW, Mh*Mw, 1]# [nW, Mh*Mw, Mh*Mw]attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0))return attn_maskdef forward(self, x):B, C, H, W = x.shapex = x.permute(0, 2, 3, 1).contiguous().view(B, H * W, C)attn_mask = self.create_mask(x, H, W)  # [nW, Mh*Mw, Mh*Mw]for blk in self.blocks:blk.H, blk.W = H, Wif not torch.jit.is_scripting() and self.use_checkpoint:x = checkpoint.checkpoint(blk, x, attn_mask)else:x = blk(x, attn_mask)x = x.view(B, H, W, C)x = x.permute(0, 3, 1, 2).contiguous()return xclass PatchEmbed(nn.Module):def __init__(self, in_c=3, embed_dim=96, patch_size=4, norm_layer=None):super().__init__()patch_size = (patch_size, patch_size)self.patch_size = patch_sizeself.in_chans = in_cself.embed_dim = embed_dimself.proj = nn.Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size)self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()def forward(self, x):_, _, H, W = x.shape# paddingpad_input = (H % self.patch_size[0] != 0) or (W % self.patch_size[1] != 0)if pad_input:# to pad the last 3 dimensions,# (W_left, W_right, H_top,H_bottom, C_front, C_back)x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1],0, self.patch_size[0] - H % self.patch_size[0],0, 0))x = self.proj(x)B, C, H, W = x.shape# flatten: [B, C, H, W] -> [B, C, HW]# transpose: [B, C, HW] -> [B, HW, C]x = x.flatten(2).transpose(1, 2)x = self.norm(x)# view: [B, HW, C] -> [B, H, W, C]# permute: [B, H, W, C] -> [B, C, H, W]x = x.view(B, H, W, C)x = x.permute(0, 3, 1, 2).contiguous()return xclass PatchMerging(nn.Module):r""" Patch Merging Layer.Args:dim (int): Number of input channels.norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm"""def __init__(self, dim, c2, norm_layer=nn.LayerNorm):super().__init__()assert c2 == (2 * dim), r"no. out channel should be 2 * no. in channel "self.dim = dimself.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)self.norm = norm_layer(4 * dim)def forward(self, x):"""x: B, C, H, W"""B, C, H, W = x.shape# assert L == H * W, "input feature has wrong size"x = x.permute(0, 2, 3, 1).contiguous()# x = x.view(B, H*W, C)# paddingpad_input = (H % 2 == 1) or (W % 2 == 1)if pad_input:# to pad the last 3 dimensions, starting from the last dimension and moving forward.# (C_front, C_back, W_left, W_right, H_top, H_bottom)x = F.pad(x, (0, 0, 0, W % 2, 0, H % 2))x0 = x[:, 0::2, 0::2, :]  # [B, H/2, W/2, C]x1 = x[:, 1::2, 0::2, :]  # [B, H/2, W/2, C]x2 = x[:, 0::2, 1::2, :]  # [B, H/2, W/2, C]x3 = x[:, 1::2, 1::2, :]  # [B, H/2, W/2, C]x = torch.cat([x0, x1, x2, x3], -1)  # [B, H/2, W/2, 4*C]x = x.view(B, -1, 4 * C)  # [B, H/2*W/2, 4*C]x = self.norm(x)x = self.reduction(x)  # [B, H/2*W/2, 2*C]x = x.view(B, int(H / 2), int(W / 2), C * 2)x = x.permute(0, 3, 1, 2).contiguous()return x

2.2 Swin-Transformer的神经网络模块代码解析

  1. DropPath:实现随机深度正则化的类,这是一种在训练期间随机丢弃一些残差分支以提高模型泛化能力的技术。

  2. WindowAttention:实现具有相对位置偏差的基于窗口的多头自注意力 (W-MSA),它处理输入图像的小窗口以有效捕获局部依赖关系。

  3. SwinTransformerBlock:一个 Swin Transformer 块,包括基于窗口的多头自注意力和具有 GELU 激活的前馈网络 (FFN)。它支持移位窗口以提高模型捕获跨窗口连接的能力。

  4. SwinStage:Swin Transformer 的一个阶段,由多个 Swin Transformer 块组成。它包括处理循环移位、窗口分区和注意力应用的逻辑。

  5. PatchEmbed:将输入图像转换为补丁,然后将其线性投影到更高维的嵌入空间,本质上是为转换器层准备输入。

  6. PatchMerging:合并相邻补丁以减少空间维度,这有助于逐步降低特征采样率。

主要特点:

  • 窗口分区反转函数允许在非重叠窗口中处理输入,与标准自注意力相比,这降低了计算复杂度。

  • 相对位置编码可在不需要固定位置网格的情况下将空间信息添加到注意力机制中。

  • 移位窗口机制允许跨窗口连接并增强模型捕获远程依赖关系的能力。

  • 补丁嵌入和合并转换输入图像并逐步降低空间分辨率,使其对于大输入更有效率。

2.3 更改init.py文件

关键步骤二:修改modules文件夹下的__init__.py文件,先导入函数    

然后在下面的__all__中声明函数 

2.4 添加yaml文件

关键步骤三:在/ultralytics/ultralytics/cfg/models/11下面新建文件yolo11_Swin-Transformer.yaml文件,粘贴下面的内容

  • 目标检测
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'# [depth, width, max_channels]n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPss: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPsm: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPsl: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPsx: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs# YOLO11n backbone
backbone:# [from, repeats, module, args]- [-1, 1, PatchEmbed, [96, 4]]  # 0 [b, 96, 160, 160]- [-1, 1, SwinStage , [96, 2, 3, 7]]  # 1 [b, 96, 160, 160]- [-1, 1, PatchMerging, [192]]    # 2 [b, 192, 80, 80]- [-1, 1, SwinStage,  [192, 2, 6, 7]]  # 3 --F0-- [b, 192, 80, 80] p3- [-1, 1, PatchMerging, [384]]   # 4 [b, 384, 40, 40]- [-1, 1, SwinStage, [384, 6, 12, 7]] # 5 --F1-- [b, 384, 40, 40] p4- [-1, 1, PatchMerging, [768]]   # 6 [b, 768, 20, 20]- [-1, 1, SwinStage, [768, 2, 24, 7]] # 7 --F2-- [b, 768, 20, 20]- [-1, 1, SPPF, [768, 5]]# YOLO11n head
head:- [-1, 1, nn.Upsample, [None, 2, "nearest"]]- [[-1, 5], 1, Concat, [1]] # cat backbone P4- [-1, 2, C3k2, [512, False]] # 13- [-1, 1, nn.Upsample, [None, 2, "nearest"]]- [[-1, 3], 1, Concat, [1]] # cat backbone P3- [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)- [-1, 1, Conv, [256, 3, 2]]- [[-1, 11], 1, Concat, [1]] # cat head P4- [-1, 2, C3k2, [512, False]] # 19 (P4/16-medium)- [-1, 1, Conv, [512, 3, 2]]- [[-1, 8], 1, Concat, [1]] # cat head P5- [-1, 2, C3k2, [1024, True]] # 22 (P5/32-large)- [[14, 17, 20], 1, Detect, [nc]] # Detect(P3, P4, P5)
  • 语义分割
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'# [depth, width, max_channels]n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPss: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPsm: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPsl: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPsx: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs# YOLO11n backbone
backbone:# [from, repeats, module, args]- [-1, 1, PatchEmbed, [96, 4]]  # 0 [b, 96, 160, 160]- [-1, 1, SwinStage , [96, 2, 3, 7]]  # 1 [b, 96, 160, 160]- [-1, 1, PatchMerging, [192]]    # 2 [b, 192, 80, 80]- [-1, 1, SwinStage,  [192, 2, 6, 7]]  # 3 --F0-- [b, 192, 80, 80] p3- [-1, 1, PatchMerging, [384]]   # 4 [b, 384, 40, 40]- [-1, 1, SwinStage, [384, 6, 12, 7]] # 5 --F1-- [b, 384, 40, 40] p4- [-1, 1, PatchMerging, [768]]   # 6 [b, 768, 20, 20]- [-1, 1, SwinStage, [768, 2, 24, 7]] # 7 --F2-- [b, 768, 20, 20]- [-1, 1, SPPF, [768, 5]]# YOLO11n head
head:- [-1, 1, nn.Upsample, [None, 2, "nearest"]]- [[-1, 5], 1, Concat, [1]] # cat backbone P4- [-1, 2, C3k2, [512, False]] # 13- [-1, 1, nn.Upsample, [None, 2, "nearest"]]- [[-1, 3], 1, Concat, [1]] # cat backbone P3- [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)- [-1, 1, Conv, [256, 3, 2]]- [[-1, 11], 1, Concat, [1]] # cat head P4- [-1, 2, C3k2, [512, False]] # 19 (P4/16-medium)- [-1, 1, Conv, [512, 3, 2]]- [[-1, 8], 1, Concat, [1]] # cat head P5- [-1, 2, C3k2, [1024, True]] # 22 (P5/32-large)- [[14, 17, 20], 1, Segment, [nc, 32, 256]] # Segment(P3, P4, P5)
  • 旋转目标检测
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'# [depth, width, max_channels]n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPss: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPsm: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPsl: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPsx: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs# YOLO11n backbone
backbone:# [from, repeats, module, args]- [-1, 1, PatchEmbed, [96, 4]]  # 0 [b, 96, 160, 160]- [-1, 1, SwinStage , [96, 2, 3, 7]]  # 1 [b, 96, 160, 160]- [-1, 1, PatchMerging, [192]]    # 2 [b, 192, 80, 80]- [-1, 1, SwinStage,  [192, 2, 6, 7]]  # 3 --F0-- [b, 192, 80, 80] p3- [-1, 1, PatchMerging, [384]]   # 4 [b, 384, 40, 40]- [-1, 1, SwinStage, [384, 6, 12, 7]] # 5 --F1-- [b, 384, 40, 40] p4- [-1, 1, PatchMerging, [768]]   # 6 [b, 768, 20, 20]- [-1, 1, SwinStage, [768, 2, 24, 7]] # 7 --F2-- [b, 768, 20, 20]- [-1, 1, SPPF, [768, 5]]# YOLO11n head
head:- [-1, 1, nn.Upsample, [None, 2, "nearest"]]- [[-1, 5], 1, Concat, [1]] # cat backbone P4- [-1, 2, C3k2, [512, False]] # 13- [-1, 1, nn.Upsample, [None, 2, "nearest"]]- [[-1, 3], 1, Concat, [1]] # cat backbone P3- [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)- [-1, 1, Conv, [256, 3, 2]]- [[-1, 11], 1, Concat, [1]] # cat head P4- [-1, 2, C3k2, [512, False]] # 19 (P4/16-medium)- [-1, 1, Conv, [512, 3, 2]]- [[-1, 8], 1, Concat, [1]] # cat head P5- [-1, 2, C3k2, [1024, True]] # 22 (P5/32-large)- [[14, 17, 20], 1, OBB, [nc, 1]] # Detect(P3, P4, P5)

温馨提示:本文只是对yolo11基础上添加模块,如果要对yolo11n/l/m/x进行添加则只需要指定对应的depth_multiple 和 width_multiple


# YOLO11n
depth_multiple: 0.50  # model depth multiple
width_multiple: 0.25  # layer channel multiple
max_channel:1024# YOLO11s
depth_multiple: 0.50  # model depth multiple
width_multiple: 0.50  # layer channel multiple
max_channel:1024# YOLO11m
depth_multiple: 0.50  # model depth multiple
width_multiple: 1.00  # layer channel multiple
max_channel:512# YOLO11l 
depth_multiple: 1.00  # model depth multiple
width_multiple: 1.00  # layer channel multiple
max_channel:512 # YOLO11x
depth_multiple: 1.00  # model depth multiple
width_multiple: 1.50 # layer channel multiple
max_channel:512

2.5 在task.py中进行注册

关键步骤四:在parse_model函数中进行注册,添加GCNet

先在task.py导入函数

然后在task.py文件下找到parse_model这个函数,如下图,添加

elif m in (PatchMerging,PatchEmbed,SwinStage):c1, c2 = ch[f], args[0]if c2 != nc:c2 = make_divisible(min(c2, max_channels) * width, 8)args = [c1, c2, *args[1:]]

2.6 执行程序

关键步骤五: 在ultralytics文件中新建train.py,将model的参数路径设置为yolo11_Swin-Transformer.yaml的路径即可

from ultralytics import YOLO
import warnings
warnings.filterwarnings('ignore')
from pathlib import Pathif __name__ == '__main__':# 加载模型model = YOLO("ultralytics/cfg/11/yolo11.yaml")  # 你要选择的模型yaml文件地址# Use the modelresults = model.train(data=r"你的数据集的yaml文件地址",epochs=100, batch=16, imgsz=640, workers=4, name=Path(model.cfg).stem)  # 训练模型

   🚀运行程序,如果出现下面的内容则说明添加成功🚀  

                   from  n    params  module                                       arguments0                  -1  1      1176  ultralytics.nn.modules.block.PatchEmbed      [3, 24, 4]1                  -1  1     15462  ultralytics.nn.modules.block.SwinStage       [24, 24, 2, 3, 7]2                  -1  1      4800  ultralytics.nn.modules.block.PatchMerging    [24, 48]3                  -1  1     58572  ultralytics.nn.modules.block.SwinStage       [48, 48, 2, 6, 7]4                  -1  1     18816  ultralytics.nn.modules.block.PatchMerging    [48, 96]5                  -1  1    683208  ultralytics.nn.modules.block.SwinStage       [96, 96, 6, 12, 7]6                  -1  1     74496  ultralytics.nn.modules.block.PatchMerging    [96, 192]7                  -1  1    897840  ultralytics.nn.modules.block.SwinStage       [192, 192, 2, 24, 7]8                  -1  1     92736  ultralytics.nn.modules.block.SPPF            [192, 192, 5]9                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']10             [-1, 5]  1         0  ultralytics.nn.modules.conv.Concat           [1]11                  -1  1     99008  ultralytics.nn.modules.block.C3k2            [288, 128, 1, False]12                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']13             [-1, 3]  1         0  ultralytics.nn.modules.conv.Concat           [1]14                  -1  1     26976  ultralytics.nn.modules.block.C3k2            [176, 64, 1, False]15                  -1  1     36992  ultralytics.nn.modules.conv.Conv             [64, 64, 3, 2]16            [-1, 11]  1         0  ultralytics.nn.modules.conv.Concat           [1]17                  -1  1     86720  ultralytics.nn.modules.block.C3k2            [192, 128, 1, False]18                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]19             [-1, 8]  1         0  ultralytics.nn.modules.conv.Concat           [1]20                  -1  1    362496  ultralytics.nn.modules.block.C3k2            [320, 256, 1, True]21        [14, 17, 20]  1    464912  ultralytics.nn.modules.head.Detect           [80, [64, 128, 256]]
YOLO11_swin-Transformer summary: 390 layers, 3,071,922 parameters, 3,071,906 gradients, 29.1 GFLOPs

3.修改后的网络结构图

4. 完整代码分享

这个后期补充吧~,先按照步骤来即可

5. GFLOPs

关于GFLOPs的计算方式可以查看百面算法工程师 | 卷积基础知识——Convolution

未改进的YOLO11n GFLOPs

改进后的GFLOPs

6. 进阶

可以与其他的注意力机制或者损失函数等结合,进一步提升检测效果

7.总结

通过以上的改进方法,我们成功提升了模型的表现。这只是一个开始,未来还有更多优化和技术深挖的空间。在这里,我想隆重向大家推荐我的专栏——<专栏地址:YOLO11入门 + 改进涨点——点击即可跳转 欢迎订阅>。这个专栏专注于前沿的深度学习技术,特别是目标检测领域的最新进展,不仅包含对YOLO11的深入解析和改进策略,还会定期更新来自各大顶会(如CVPR、NeurIPS等)的论文复现和实战分享。

为什么订阅我的专栏? ——专栏地址:YOLO11入门 + 改进涨点——点击即可跳转 欢迎订阅

  1. 前沿技术解读:专栏不仅限于YOLO系列的改进,还会涵盖各类主流与新兴网络的最新研究成果,帮助你紧跟技术潮流。

  2. 详尽的实践分享:所有内容实践性也极强。每次更新都会附带代码和具体的改进步骤,保证每位读者都能迅速上手。

  3. 问题互动与答疑:订阅我的专栏后,你将可以随时向我提问,获取及时的答疑

  4. 实时更新,紧跟行业动态:不定期发布来自全球顶会的最新研究方向和复现实验报告,让你时刻走在技术前沿。

专栏适合人群:

  • 对目标检测、YOLO系列网络有深厚兴趣的同学

  • 希望在用YOLO算法写论文的同学

  • 对YOLO算法感兴趣的同学等

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/882447.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

LaTeX教程(016)-LaTeX文档结构(16)

LaTeX教程(016)- LaTeX \LaTeX LATE​X文档结构(16) 接上一讲 我们前面知道&#xff0c;\vref是对\ref的升级&#xff0c;而varioref包也提供了一个对\pageref升级的命令\vpageref。它和\vref的原理很相似&#xff0c;内置了一些判断。 \vpageref[same-page][other-page]{ke…

【C++】C++当中的复合类型——引用和指针

C当中的复合类型 最近开始系统地学习 C 的语法&#xff0c;参考的主要资料来自于 C Primer 第五版&#xff0c;对于学习过程中所遇到的较难理解的点&#xff0c;我会以blog的形式对问题和内容进行记录&#xff0c;并进行进一步地探讨。 这一部分的内容对应于参考资料 C Prime…

spring-cloud-alibaba-nacos-config2023.0.1.*启动打印配置文件内容

**背景&#xff1a;**在开发测试过程中如果可以打印出配置文件的内容&#xff0c;方便确认配置是否准确&#xff1b;那么如何才可以打印出来呢&#xff1b; spring-cloud-alibaba-nacos-config 调整日志级别 logging:level:com.alibaba.cloud.nacos.configdata.NacosConfigD…

Linux操作系统与windows无法相互复制问题

请先看完此文在进行操作&#xff01;&#xff01;&#xff01; 对于无法复制我们逐层分析&#xff1a; 1.为什么无法复制是不是少了什么工具(open-vm-tools-destop) 上网查阅可以看到如下 2.在此之前我的虚拟机装完Ubuntu 16.04的linux系统无法进行apt update(参考一下) li…

华三服务器R4900 G5在图形界面使用PMC阵列卡(P460-B4)创建RAID,并安装系统(中文教程)

环境以用户需求安装Centos7.9&#xff0c;服务器使用9块900G硬盘&#xff0c;创建RAID1和RAID6&#xff0c;留一块作为热备盘。 使用笔记本通过HDM管理口&#xff08;&#xff09;登录 使用VGA&#xff08;&#xff09;线连接显示器和使用usb线连接键盘鼠标&#xff0c;进行窗…

excel判断某一列(A列)中的数据是否在另一列(B列)中

如B列如果有7个元素&#xff0c;在A列右边的空白列中&#xff0c;输入如下公式&#xff1a; COUNTIF($B$1:$B$7,A1), 其中&#xff0c;$B$1:$B$7代表A列中的所有数据即绝对范围&#xff0c;A1代表B列中的一个单元格.

Servlet(一)

一.什么是servlet Servlet 是一种实现动态页面的技术。 是一组 Tomcat 提供给程序猿的 API, 帮助程序猿简单高效的开发一个 web app。 1.回顾 动态页面 vs 静态页面 静态页面也就是内容始终固定的页面。即使 用户不同/时间不同/输入的参数不同 , 页面内容也不会发生变化。(除…

从 Microsoft 官网下载 Windows 10

方法一&#xff1a; 打开 Microsoft 官网&#xff1a; 打开开发人员工具&#xff08;按 F12 或右键点击“检查”&#xff09;。 点击“电脑模拟手机”按钮&#xff0c;即下图&#xff1a; 点击后重新加载此网页&#xff0c;即可看到下载选项。

Palo Alto Networks Expedition 未授权SQL注入漏洞复现(CVE-2024-9465)

0x01 产品介绍&#xff1a; Palo Alto Networks Expedition 是一款强大的工具&#xff0c;帮助用户有效地迁移和优化网络安全策略&#xff0c;提升安全管理的效率和效果。它的自动化功能、策略分析和可视化报告使其在网络安全领域中成为一个重要的解决方案。 0x02 漏洞描述&am…

windows下安装、配置neo4j并服务化启动

第一步&#xff1a;下载Neo4j压缩包 官网下载地址&#xff1a;https://neo4j.com/download-center/ &#xff08;官网下载真的非常慢&#xff0c;而且会自己中断&#xff0c;建议从以下链接下载&#xff09; 百度网盘下载地址&#xff1a;链接&#xff1a;https://pan.baid…

周易解读:八卦02,八卦所代表的基本事物

八 卦02 上一节&#xff0c;我是讲完了八卦的卦象的画法的问题。这一节&#xff0c;我来尝试着去讲解八卦所代表的自然事物。 八卦是谁发明的呢&#xff1f;根据《周易说卦传》的说法&#xff0c;八卦是伏羲发明的。伏羲氏仰观天文&#xff0c;俯察地理&#xff0c;从中提取…

项目模块二:日志宏

一、代码展示 二、补充知识 1、LOG(level, format, ...) format 是用于宏识别格式化&#xff0c;类似于 printf("%s", str); 里面的 "%s" ... 不定参&#xff0c;传入宏的参数除了 level, format, 还有不确定个数的参数。 2、红色 \ 由于宏只能写在一…

链上相遇,节点之间的悸动与牵连

公主请阅 1. 返回倒数第 k 个节点1.1 题目说明1.2 题目分析1.3 解法一代码以及解释1.3 解法二代码以及解释 2.相交链表2.1 题目说明示例 1示例 2示例 3 2.2 题目分析2.3 代码部分2.4 代码分析 1. 返回倒数第 k 个节点 题目传送门 1.1 题目说明 题目名称&#xff1a; 面试题 02…

15分钟学 Go 第 10 天:函数参数和返回值

第10天&#xff1a;函数参数和返回值 目标&#xff1a;理解函数如何传递参数 在Go语言中&#xff0c;函数是程序的基本构建块。了解如何传递参数和返回值是编写高效、可复用代码的重要步骤。本文将详细讲解函数参数的类型、传递方式以及如何处理返回值&#xff0c;辅以代码示…

DP—子数组,子串系列 第一弹 -最大子数组和 -环形子数组的最大和 力扣

你好&#xff0c;欢迎阅读我的文章~ 个人主页&#xff1a;Mike 所属专栏&#xff1a;动态规划 ​ 53. 最大子数组和 最大子数组和 ​ 分析: 使用动态规划解决 状态表示: 1.以某个位置为结尾 2.以某个位置为起点 这里使用以某个位置为结尾&#xff0c;结合题目要求&#…

MySQL8.0主从同步报ERROR 13121错误解决方法

由于平台虚拟机宿主机迁移&#xff0c;导致一套MySQL主从库从节点故障&#xff0c;从节点服务终止&#xff0c;在服务启动后&#xff0c;恢复从节点同步服务&#xff0c;发现了如下报错&#xff1a; mysql> show slave status\G; *************************** 1. row *****…

GDAL+C#实现矢量多边形转栅格

1. 开发环境测试 参考C#配置GDAL环境&#xff0c;确保GDAL能使用&#xff0c;步骤简述如下&#xff1a; 创建.NET Framework 4.7.2的控制台应用 注意&#xff1a; 项目路径中不要有中文&#xff0c;否则可能报错&#xff1a;can not find proj.db 在NuGet中安装GDAL 3.9.1和G…

无人机之自主飞行关键技术篇

无人机自主飞行指的是无人机利用先进的算法和传感器&#xff0c;实现自我导航、路径规划、环境感知和自动避障等能力。这种飞行模式大大提升了无人机的智能化水平和操作的自动化程度。 一、传感器技术 传感器是无人机实现自主飞行和数据采集的关键组件&#xff0c;主要包括&a…

C语言复习第3章 函数

目录 一、函数介绍1.1 函数是什么1.2 C语言中函数的分类1.3 函数原型1.4 高内聚 低耦合1.5 C语言main函数的位置 二、函数的参数2.1 实参和形参2.2 函数的参数(实参)可以是表达式2.3 传值与传址(swap函数)2.4 明确形参是实参的临时拷贝2.5 void(如果不写函数返回值 默认是int)2…

python 爬虫 入门 三、登录以及代理。

目录 一、登录 &#xff08;一&#xff09;、登录4399 1.直接使用Cookie 2.使用账号密码进行登录 可选观看内容&#xff0c;使用python对密码进行加密&#xff08;无结果代码&#xff0c;只有过程分析&#xff09; 二、代理 免费代理 后续&#xff1a;协程&#xff0c;…