YOLOv11改进 | 主干篇 | YOLOv11引入MobileNetV4

1. MobileNetV4介绍

1.1 摘要：我们推出了最新一代的 MobileNet，称为 MobileNetV4 (MNv4)，具有适用于移动设备的通用高效架构设计。在其核心，我们引入了通用倒瓶颈（UIB）搜索块，这是一种统一且灵活的结构，融合了倒瓶颈（IB）、ConvNext、前馈网络（FFN）和新颖的额外深度（ExtraDW）变体。除了 UIB 之外，我们还推出了 Mobile MQA，这是一个专为移动加速器量身定制的注意力模块，可实现 39% 的显着加速。还引入了优化的神经架构搜索 (NAS) 配方，提高了 MNv4 搜索效率。 UIB、移动 MQA 和改进的 NAS 配方的集成产生了一套新的 MNv4 模型，这些模型在移动 CPU、DSP、GPU 以及 Apple Neural Engine 和 Google Pixel EdgeTPU 等专用加速器上大多实现 Pareto 最优——这一特性不是在测试的任何其他模型中都发现了。最后，为了进一步提高准确性，我们引入了一种新颖的蒸馏技术。通过这项技术的增强，我们的 MNv4-Hybrid-Large 模型可提供 87% ImageNet-1K 准确率，Pixel 8 EdgeTPU 运行时间仅为 3.8 毫秒。

官方论文地址：https://arxiv.org/pdf/2404.10518

官方代码地址：代码

1.2 简单介绍:

MobileNetV4（MNv4）是针对移动设备推出的新一代MobileNet模型，它采用了通用高效的架构设计。核心部分包括了Universal Inverted Bottleneck (UIB) 搜索模块、专为移动加速器定制的Mobile MQA注意力块以及通过优化的神经结构搜索（NAS）配方。这些组件的结合使得MNv4模型在多种硬件平台上实现了大多数Pareto最优性能。

UIB模块是MNv4的核心，它是一个统一且灵活的结构，融合了Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN) 以及一个新的Extra Depthwise (ExtraDW) 变体。这种结构的设计不仅简化了模型构建过程，还提高了在不同硬件上的兼容性和效率。

此外，MNv4还包括一个优化的NAS配方，这一配方显著提高了搜索效率并促成了更大规模的模型创建。通过引入一个离线蒸馏数据集来减少NAS奖励测量中的噪声，从而改善了模型质量。

Mobile MQA是MNv4的另一个关键组件，这是一个为移动加速器量身定制的注意力块，提供了超过39%的推理加速。MQA通过使用共享的头部来简化多头注意力机制（MHSA），这减少了内存访问的需求，从而提高了操作强度，并在移动设备上表现出色。

1.3 MobileNetV4模块结构图

2. 核心代码

from typing import Optional
import torch
import torch.nn as nn
import torch.nn.functional as F__all__ = ['MobileNetV4ConvLarge', 'MobileNetV4ConvSmall', 'MobileNetV4ConvMedium', 'MobileNetV4HybridMedium', 'MobileNetV4HybridLarge']MNV4ConvSmall_BLOCK_SPECS = {"conv0": {"block_name": "convbn","num_blocks": 1,"block_specs": [[3, 32, 3, 2]]},"layer1": {"block_name": "convbn","num_blocks": 2,"block_specs": [[32, 32, 3, 2],[32, 32, 1, 1]]},"layer2": {"block_name": "convbn","num_blocks": 2,"block_specs": [[32, 96, 3, 2],[96, 64, 1, 1]]},"layer3": {"block_name": "uib","num_blocks": 6,"block_specs": [[64, 96, 5, 5, True, 2, 3],[96, 96, 0, 3, True, 1, 2],[96, 96, 0, 3, True, 1, 2],[96, 96, 0, 3, True, 1, 2],[96, 96, 0, 3, True, 1, 2],[96, 96, 3, 0, True, 1, 4],]},"layer4": {"block_name": "uib","num_blocks": 6,"block_specs": [[96,  128, 3, 3, True, 2, 6],[128, 128, 5, 5, True, 1, 4],[128, 128, 0, 5, True, 1, 4],[128, 128, 0, 5, True, 1, 3],[128, 128, 0, 3, True, 1, 4],[128, 128, 0, 3, True, 1, 4],]},"layer5": {"block_name": "convbn","num_blocks": 2,"block_specs": [[128, 960, 1, 1],[960, 1280, 1, 1]]}
}MNV4ConvMedium_BLOCK_SPECS = {"conv0": {"block_name": "convbn","num_blocks": 1,"block_specs": [[3, 32, 3, 2]]},"layer1": {"block_name": "fused_ib","num_blocks": 1,"block_specs": [[32, 48, 2, 4.0, True]]},"layer2": {"block_name": "uib","num_blocks": 2,"block_specs": [[48, 80, 3, 5, True, 2, 4],[80, 80, 3, 3, True, 1, 2]]},"layer3": {"block_name": "uib","num_blocks": 8,"block_specs": [[80,  160, 3, 5, True, 2, 6],[160, 160, 3, 3, True, 1, 4],[160, 160, 3, 3, True, 1, 4],[160, 160, 3, 5, True, 1, 4],[160, 160, 3, 3, True, 1, 4],[160, 160, 3, 0, True, 1, 4],[160, 160, 0, 0, True, 1, 2],[160, 160, 3, 0, True, 1, 4]]},"layer4": {"block_name": "uib","num_blocks": 11,"block_specs": [[160, 256, 5, 5, True, 2, 6],[256, 256, 5, 5, True, 1, 4],[256, 256, 3, 5, True, 1, 4],[256, 256, 3, 5, True, 1, 4],[256, 256, 0, 0, True, 1, 4],[256, 256, 3, 0, True, 1, 4],[256, 256, 3, 5, True, 1, 2],[256, 256, 5, 5, True, 1, 4],[256, 256, 0, 0, True, 1, 4],[256, 256, 0, 0, True, 1, 4],[256, 256, 5, 0, True, 1, 2]]},"layer5": {"block_name": "convbn","num_blocks": 2,"block_specs": [[256, 960, 1, 1],[960, 1280, 1, 1]]}
}MNV4ConvLarge_BLOCK_SPECS = {"conv0": {"block_name": "convbn","num_blocks": 1,"block_specs": [[3, 24, 3, 2]]},"layer1": {"block_name": "fused_ib","num_blocks": 1,"block_specs": [[24, 48, 2, 4.0, True]]},"layer2": {"block_name": "uib","num_blocks": 2,"block_specs": [[48, 96, 3, 5, True, 2, 4],[96, 96, 3, 3, True, 1, 4]]},"layer3": {"block_name": "uib","num_blocks": 11,"block_specs": [[96,  192, 3, 5, True, 2, 4],[192, 192, 3, 3, True, 1, 4],[192, 192, 3, 3, True, 1, 4],[192, 192, 3, 3, True, 1, 4],[192, 192, 3, 5, True, 1, 4],[192, 192, 5, 3, True, 1, 4],[192, 192, 5, 3, True, 1, 4],[192, 192, 5, 3, True, 1, 4],[192, 192, 5, 3, True, 1, 4],[192, 192, 5, 3, True, 1, 4],[192, 192, 3, 0, True, 1, 4]]},"layer4": {"block_name": "uib","num_blocks": 13,"block_specs": [[192, 512, 5, 5, True, 2, 4],[512, 512, 5, 5, True, 1, 4],[512, 512, 5, 5, True, 1, 4],[512, 512, 5, 5, True, 1, 4],[512, 512, 5, 0, True, 1, 4],[512, 512, 5, 3, True, 1, 4],[512, 512, 5, 0, True, 1, 4],[512, 512, 5, 0, True, 1, 4],[512, 512, 5, 3, True, 1, 4],[512, 512, 5, 5, True, 1, 4],[512, 512, 5, 0, True, 1, 4],[512, 512, 5, 0, True, 1, 4],[512, 512, 5, 0, True, 1, 4]]},"layer5": {"block_name": "convbn","num_blocks": 2,"block_specs": [[512, 960, 1, 1],[960, 1280, 1, 1]]}
}def mhsa(num_heads, key_dim, value_dim, px):if px == 24:kv_strides = 2elif px == 12:kv_strides = 1query_h_strides = 1query_w_strides = 1use_layer_scale = Trueuse_multi_query = Trueuse_residual = Truereturn [num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides,use_layer_scale, use_multi_query, use_residual]MNV4HybridConvMedium_BLOCK_SPECS = {"conv0": {"block_name": "convbn","num_blocks": 1,"block_specs": [[3, 32, 3, 2]]},"layer1": {"block_name": "fused_ib","num_blocks": 1,"block_specs": [[32, 48, 2, 4.0, True]]},"layer2": {"block_name": "uib","num_blocks": 2,"block_specs": [[48, 80, 3, 5, True, 2, 4],[80, 80, 3, 3, True, 1, 2]]},"layer3": {"block_name": "uib","num_blocks": 8,"block_specs": [[80,  160, 3, 5, True, 2, 6],[160, 160, 0, 0, True, 1, 2],[160, 160, 3, 3, True, 1, 4],[160, 160, 3, 5, True, 1, 4, mhsa(4, 64, 64, 24)],[160, 160, 3, 3, True, 1, 4, mhsa(4, 64, 64, 24)],[160, 160, 3, 0, True, 1, 4, mhsa(4, 64, 64, 24)],[160, 160, 3, 3, True, 1, 4, mhsa(4, 64, 64, 24)],[160, 160, 3, 0, True, 1, 4]]},"layer4": {"block_name": "uib","num_blocks": 12,"block_specs": [[160, 256, 5, 5, True, 2, 6],[256, 256, 5, 5, True, 1, 4],[256, 256, 3, 5, True, 1, 4],[256, 256, 3, 5, True, 1, 4],[256, 256, 0, 0, True, 1, 2],[256, 256, 3, 5, True, 1, 2],[256, 256, 0, 0, True, 1, 2],[256, 256, 0, 0, True, 1, 4, mhsa(4, 64, 64, 12)],[256, 256, 3, 0, True, 1, 4, mhsa(4, 64, 64, 12)],[256, 256, 5, 5, True, 1, 4, mhsa(4, 64, 64, 12)],[256, 256, 5, 0, True, 1, 4, mhsa(4, 64, 64, 12)],[256, 256, 5, 0, True, 1, 4]]},"layer5": {"block_name": "convbn","num_blocks": 2,"block_specs": [[256, 960, 1, 1],[960, 1280, 1, 1]]}
}MNV4HybridConvLarge_BLOCK_SPECS = {"conv0": {"block_name": "convbn","num_blocks": 1,"block_specs": [[3, 24, 3, 2]]},"layer1": {"block_name": "fused_ib","num_blocks": 1,"block_specs": [[24, 48, 2, 4.0, True]]},"layer2": {"block_name": "uib","num_blocks": 2,"block_specs": [[48, 96, 3, 5, True, 2, 4],[96, 96, 3, 3, True, 1, 4]]},"layer3": {"block_name": "uib","num_blocks": 11,"block_specs": [[96,  192, 3, 5, True, 2, 4],[192, 192, 3, 3, True, 1, 4],[192, 192, 3, 3, True, 1, 4],[192, 192, 3, 3, True, 1, 4],[192, 192, 3, 5, True, 1, 4],[192, 192, 5, 3, True, 1, 4],[192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],[192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],[192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],[192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],[192, 192, 3, 0, True, 1, 4]]},"layer4": {"block_name": "uib","num_blocks": 14,"block_specs": [[192, 512, 5, 5, True, 2, 4],[512, 512, 5, 5, True, 1, 4],[512, 512, 5, 5, True, 1, 4],[512, 512, 5, 5, True, 1, 4],[512, 512, 5, 0, True, 1, 4],[512, 512, 5, 3, True, 1, 4],[512, 512, 5, 0, True, 1, 4],[512, 512, 5, 0, True, 1, 4],[512, 512, 5, 3, True, 1, 4],[512, 512, 5, 5, True, 1, 4, mhsa(8, 64, 64, 12)],[512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],[512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],[512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],[512, 512, 5, 0, True, 1, 4]]},"layer5": {"block_name": "convbn","num_blocks": 2,"block_specs": [[512, 960, 1, 1],[960, 1280, 1, 1]]}
}MODEL_SPECS = {"MobileNetV4ConvSmall": MNV4ConvSmall_BLOCK_SPECS,"MobileNetV4ConvMedium": MNV4ConvMedium_BLOCK_SPECS,"MobileNetV4ConvLarge": MNV4ConvLarge_BLOCK_SPECS,"MobileNetV4HybridMedium": MNV4HybridConvMedium_BLOCK_SPECS,"MobileNetV4HybridLarge": MNV4HybridConvLarge_BLOCK_SPECS
}def make_divisible(value: float,divisor: int,min_value: Optional[float] = None,round_down_protect: bool = True,
) -> int:"""This function is copied from here"https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_layers.py"This is to ensure that all layers have channels that are divisible by 8.Args:value: A `float` of original value.divisor: An `int` of the divisor that need to be checked upon.min_value: A `float` of  minimum value threshold.round_down_protect: A `bool` indicating whether round down more than 10%will be allowed.Returns:The adjusted value in `int` that is divisible against divisor."""if min_value is None:min_value = divisornew_value = max(min_value, int(value + divisor / 2) // divisor * divisor)# Make sure that round down does not go down by more than 10%.if round_down_protect and new_value < 0.9 * value:new_value += divisorreturn int(new_value)def conv_2d(inp, oup, kernel_size=3, stride=1, groups=1, bias=False, norm=True, act=True):conv = nn.Sequential()padding = (kernel_size - 1) // 2conv.add_module('conv', nn.Conv2d(inp, oup, kernel_size, stride, padding, bias=bias, groups=groups))if norm:conv.add_module('BatchNorm2d', nn.BatchNorm2d(oup))if act:conv.add_module('Activation', nn.ReLU6())return convclass InvertedResidual(nn.Module):def __init__(self, inp, oup, stride, expand_ratio, act=False, squeeze_excitation=False):super(InvertedResidual, self).__init__()self.stride = strideassert stride in [1, 2]hidden_dim = int(round(inp * expand_ratio))self.block = nn.Sequential()if expand_ratio != 1:self.block.add_module('exp_1x1', conv_2d(inp, hidden_dim, kernel_size=3, stride=stride))if squeeze_excitation:self.block.add_module('conv_3x3',conv_2d(hidden_dim, hidden_dim, kernel_size=3, stride=stride, groups=hidden_dim))self.block.add_module('red_1x1', conv_2d(hidden_dim, oup, kernel_size=1, stride=1, act=act))self.use_res_connect = self.stride == 1 and inp == oupdef forward(self, x):if self.use_res_connect:return x + self.block(x)else:return self.block(x)class UniversalInvertedBottleneckBlock(nn.Module):def __init__(self,inp,oup,start_dw_kernel_size,middle_dw_kernel_size,middle_dw_downsample,stride,expand_ratio):"""An inverted bottleneck block with optional depthwises.Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py"""super().__init__()# Starting depthwise conv.self.start_dw_kernel_size = start_dw_kernel_sizeif self.start_dw_kernel_size:stride_ = stride if not middle_dw_downsample else 1self._start_dw_ = conv_2d(inp, inp, kernel_size=start_dw_kernel_size, stride=stride_, groups=inp, act=False)# Expansion with 1x1 convs.expand_filters = make_divisible(inp * expand_ratio, 8)self._expand_conv = conv_2d(inp, expand_filters, kernel_size=1)# Middle depthwise conv.self.middle_dw_kernel_size = middle_dw_kernel_sizeif self.middle_dw_kernel_size:stride_ = stride if middle_dw_downsample else 1self._middle_dw = conv_2d(expand_filters, expand_filters, kernel_size=middle_dw_kernel_size, stride=stride_,groups=expand_filters)# Projection with 1x1 convs.self._proj_conv = conv_2d(expand_filters, oup, kernel_size=1, stride=1, act=False)# Ending depthwise conv.# this not used# _end_dw_kernel_size = 0# self._end_dw = conv_2d(oup, oup, kernel_size=_end_dw_kernel_size, stride=stride, groups=inp, act=False)def forward(self, x):if self.start_dw_kernel_size:x = self._start_dw_(x)# print("_start_dw_", x.shape)x = self._expand_conv(x)# print("_expand_conv", x.shape)if self.middle_dw_kernel_size:x = self._middle_dw(x)# print("_middle_dw", x.shape)x = self._proj_conv(x)# print("_proj_conv", x.shape)return xclass MultiQueryAttentionLayerWithDownSampling(nn.Module):def __init__(self, inp, num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides,dw_kernel_size=3, dropout=0.0):"""Multi Query Attention with spatial downsampling.Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py3 parameters are introduced for the spatial downsampling:1. kv_strides: downsampling factor on Key and Values only.2. query_h_strides: vertical strides on Query only.3. query_w_strides: horizontal strides on Query only.This is an optimized version.1. Projections in Attention is explict written out as 1x1 Conv2D.2. Additional reshapes are introduced to bring a up to 3x speed up."""super().__init__()self.num_heads = num_headsself.key_dim = key_dimself.value_dim = value_dimself.query_h_strides = query_h_stridesself.query_w_strides = query_w_stridesself.kv_strides = kv_stridesself.dw_kernel_size = dw_kernel_sizeself.dropout = dropoutself.head_dim = key_dim // num_headsif self.query_h_strides > 1 or self.query_w_strides > 1:self._query_downsampling_norm = nn.BatchNorm2d(inp)self._query_proj = conv_2d(inp, num_heads * key_dim, 1, 1, norm=False, act=False)if self.kv_strides > 1:self._key_dw_conv = conv_2d(inp, inp, dw_kernel_size, kv_strides, groups=inp, norm=True, act=False)self._value_dw_conv = conv_2d(inp, inp, dw_kernel_size, kv_strides, groups=inp, norm=True, act=False)self._key_proj = conv_2d(inp, key_dim, 1, 1, norm=False, act=False)self._value_proj = conv_2d(inp, key_dim, 1, 1, norm=False, act=False)self._output_proj = conv_2d(num_heads * key_dim, inp, 1, 1, norm=False, act=False)self.dropout = nn.Dropout(p=dropout)def forward(self, x):batch_size, seq_length, _, _ = x.size()if self.query_h_strides > 1 or self.query_w_strides > 1:q = F.avg_pool2d(self.query_h_stride, self.query_w_stride)q = self._query_downsampling_norm(q)q = self._query_proj(q)else:q = self._query_proj(x)px = q.size(2)q = q.view(batch_size, self.num_heads, -1, self.key_dim)  # [batch_size, num_heads, seq_length, key_dim]if self.kv_strides > 1:k = self._key_dw_conv(x)k = self._key_proj(k)v = self._value_dw_conv(x)v = self._value_proj(v)else:k = self._key_proj(x)v = self._value_proj(x)k = k.view(batch_size, self.key_dim, -1)  # [batch_size, key_dim, seq_length]v = v.view(batch_size, -1, self.key_dim)  # [batch_size, seq_length, key_dim]# calculate attn scoreattn_score = torch.matmul(q, k) / (self.head_dim ** 0.5)attn_score = self.dropout(attn_score)attn_score = F.softmax(attn_score, dim=-1)context = torch.matmul(attn_score, v)context = context.view(batch_size, self.num_heads * self.key_dim, px, px)output = self._output_proj(context)return outputclass MNV4LayerScale(nn.Module):def __init__(self, init_value):"""LayerScale as introduced in CaiT: https://arxiv.org/abs/2103.17239Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.pyAs used in MobileNetV4.Attributes:init_value (float): value to initialize the diagonal matrix of LayerScale."""super().__init__()self.init_value = init_valuedef forward(self, x):gamma = self.init_value * torch.ones(x.size(-1), dtype=x.dtype, device=x.device)return x * gammaclass MultiHeadSelfAttentionBlock(nn.Module):def __init__(self,inp,num_heads,key_dim,value_dim,query_h_strides,query_w_strides,kv_strides,use_layer_scale,use_multi_query,use_residual=True):super().__init__()self.query_h_strides = query_h_stridesself.query_w_strides = query_w_stridesself.kv_strides = kv_stridesself.use_layer_scale = use_layer_scaleself.use_multi_query = use_multi_queryself.use_residual = use_residualself._input_norm = nn.BatchNorm2d(inp)if self.use_multi_query:self.multi_query_attention = MultiQueryAttentionLayerWithDownSampling(inp, num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides)else:self.multi_head_attention = nn.MultiheadAttention(inp, num_heads, kdim=key_dim)if self.use_layer_scale:self.layer_scale_init_value = 1e-5self.layer_scale = MNV4LayerScale(self.layer_scale_init_value)def forward(self, x):# Not using CPE, skipped# input normshortcut = xx = self._input_norm(x)# multi queryif self.use_multi_query:x = self.multi_query_attention(x)else:x = self.multi_head_attention(x, x)# layer scaleif self.use_layer_scale:x = self.layer_scale(x)# use residualif self.use_residual:x = x + shortcutreturn xdef build_blocks(layer_spec):if not layer_spec.get('block_name'):return nn.Sequential()block_names = layer_spec['block_name']layers = nn.Sequential()if block_names == "convbn":schema_ = ['inp', 'oup', 'kernel_size', 'stride']for i in range(layer_spec['num_blocks']):args = dict(zip(schema_, layer_spec['block_specs'][i]))layers.add_module(f"convbn_{i}", conv_2d(**args))elif block_names == "uib":schema_ = ['inp', 'oup', 'start_dw_kernel_size', 'middle_dw_kernel_size', 'middle_dw_downsample', 'stride','expand_ratio', 'msha']for i in range(layer_spec['num_blocks']):args = dict(zip(schema_, layer_spec['block_specs'][i]))msha = args.pop("msha") if "msha" in args else 0layers.add_module(f"uib_{i}", UniversalInvertedBottleneckBlock(**args))if msha:msha_schema_ = ["inp", "num_heads", "key_dim", "value_dim", "query_h_strides", "query_w_strides", "kv_strides","use_layer_scale", "use_multi_query", "use_residual"]args = dict(zip(msha_schema_, [args['oup']] + (msha)))layers.add_module(f"msha_{i}", MultiHeadSelfAttentionBlock(**args))elif block_names == "fused_ib":schema_ = ['inp', 'oup', 'stride', 'expand_ratio', 'act']for i in range(layer_spec['num_blocks']):args = dict(zip(schema_, layer_spec['block_specs'][i]))layers.add_module(f"fused_ib_{i}", InvertedResidual(**args))else:raise NotImplementedErrorreturn layersclass MobileNetV4(nn.Module):def __init__(self, model):# MobileNetV4ConvSmall  MobileNetV4ConvMedium  MobileNetV4ConvLarge# MobileNetV4HybridMedium  MobileNetV4HybridLarge"""Params to initiate MobilenNetV4Args:model : support 5 types of models as indicated in"https://github.com/tensorflow/models/blob/master/official/vision/modeling/backbones/mobilenet.py""""super().__init__()assert model in MODEL_SPECS.keys()self.model = modelself.spec = MODEL_SPECS[self.model]# conv0self.conv0 = build_blocks(self.spec['conv0'])# layer1self.layer1 = build_blocks(self.spec['layer1'])# layer2self.layer2 = build_blocks(self.spec['layer2'])# layer3self.layer3 = build_blocks(self.spec['layer3'])# layer4self.layer4 = build_blocks(self.spec['layer4'])# layer5self.layer5 = build_blocks(self.spec['layer5'])self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]def forward(self, x):x0 = self.conv0(x)x1 = self.layer1(x0)x2 = self.layer2(x1)x3 = self.layer3(x2)x4 = self.layer4(x3)# x5 = self.layer5(x4)# x5 = nn.functional.adaptive_avg_pool2d(x5, 1)return [x1, x2, x3, x4]def MobileNetV4ConvSmall():model = MobileNetV4('MobileNetV4ConvSmall')return modeldef MobileNetV4ConvMedium():model = MobileNetV4('MobileNetV4ConvMedium')return modeldef MobileNetV4ConvLarge():model = MobileNetV4('MobileNetV4ConvLarge')return modeldef MobileNetV4HybridMedium():model = MobileNetV4('MobileNetV4HybridMedium')return modeldef MobileNetV4HybridLarge():model = MobileNetV4('MobileNetV4HybridLarge')return modelif __name__ == "__main__":# Generating Sample imageimage_size = (1, 3, 640, 640)image = torch.rand(*image_size)# Modelmodel = MobileNetV4HybridLarge()out = model(image)for i in range(len(out)):print(out[i].shape)

3.YOLOv11中添加MobileNetV4方式

3.1 在ultralytics/nn下新建Extramodule

3.2 在Extramodule里创建MobileNetV4

在MobileNetV4.py文件里添加给出的MobileNetV4代码

添加完MobileNetV4代码后，在ultralytics/nn/Extramodule/__init__.py文件中引用

3.3 在tasks.py里引用

在ultralytics/nn/tasks.py文件里引用Extramodule

（1）在tasks.py找到parse_model（ctrl+f 可以直接搜索parse_model位置）

（2）

        elif m in {MobileNetV4ConvLarge, MobileNetV4ConvSmall,MobileNetV4ConvMedium, MobileNetV4HybridMedium, MobileNetV4HybridLarge}:m = m(*args)c2 = m.width_listbackbone = True

（3）将elif m is AIFI:以下的代码全部替换成我给的

上述代码全部替换以下代码：

        elif m is AIFI:args = [ch[f], *args]elif m in {HGStem, HGBlock}:c1, cm, c2 = ch[f], args[0], args[1]args = [c1, cm, c2, *args[2:]]if m is HGBlock:args.insert(4, n)  # number of repeatsn = 1elif m is ResNetLayer:c2 = args[1] if args[3] else args[1] * 4elif m is nn.BatchNorm2d:args = [ch[f]]elif m is Concat:c2 = sum(ch[x] for x in f)elif m in {Detect, WorldDetect, Segment, Pose, OBB, ImagePoolingAttn, v10Detect}:args.append([ch[x] for x in f])if m is Segment:args[2] = make_divisible(min(args[2], max_channels) * width, 8)elif m is RTDETRDecoder:  # special case, channels arg must be passed in index 1args.insert(1, [ch[x] for x in f])elif m is CBLinear:c2 = args[0]c1 = ch[f]args = [c1, c2, *args[1:]]elif m is CBFuse:c2 = ch[f[-1]]else:c2 = ch[f]if isinstance(c2, list):m_ = mm_.backbone = Trueelse:m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args)  # modulet = str(m)[8:-2].replace('__main__.', '')  # module typem.np = sum(x.numel() for x in m_.parameters())  # number paramsm_.i, m_.f, m_.type = i + 4 if backbone else i, f, t  # attach index, 'from' index, typeif verbose:LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f}  {t:<45}{str(args):<30}')  # printsave.extend(x % (i + 4 if backbone else i) for x in ([f] if isinstance(f, int) else f) if x != -1)  # append to savelistlayers.append(m_)if i == 0:ch = []if isinstance(c2, list):ch.extend(c2)if len(c2) != 5:ch.insert(0, 0)else:ch.append(c2)return nn.Sequential(*layers), sorted(save)

（4）这个修改不在def parse_model中，但是还在tasks.py中，在tasks.py前面几行

    # 主干修改处def _predict_once(self, x, profile=False, visualize=False, embed=None):"""Perform a forward pass through the network.Args:x (torch.Tensor): The input tensor to the model.profile (bool):  Print the computation time of each layer if True, defaults to False.visualize (bool): Save the feature maps of the model if True, defaults to False.embed (list, optional): A list of feature vectors/embeddings to return.Returns:(torch.Tensor): The last output of the model."""y, dt, embeddings = [], [], []  # outputsfor m in self.model:if m.f != -1:  # if not from previous layerx = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layersif profile:self._profile_one_layer(m, x, dt)if hasattr(m, 'backbone'):x = m(x)if len(x) != 5:  # 0 - 5x.insert(0, None)for index, i in enumerate(x):if index in self.save:y.append(i)else:y.append(None)x = x[-1]  # 最后一个输出传给下一层else:x = m(x)  # runy.append(x if m.i in self.save else None)  # save outputif visualize:feature_visualization(x, m.type, m.i, save_dir=visualize)if embed and m.i in embed:embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1))  # flattenif m.i == max(embed):return torch.unbind(torch.cat(embeddings, 1), dim=0)return x

（5）在ultralytics/models/yolo/detect/train.py里找到

到此全部修改结束。

4. 新建一个yolo11MobileNetV4.yaml文件

# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'# [depth, width, max_channels]n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPss: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPsm: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPsl: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPsx: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs# YOLO11n backbone
backbone:# [from, repeats, module, args]- [-1, 1, MobileNetV4ConvSmall, []]  # 4- [-1, 1, SPPF, [1024, 5]]  # 5- [-1, 2, C2PSA, [1024]] # 6# YOLO11n head
head:- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 7- [[-1, 3], 1, Concat, [1]]  # 8 cat backbone P4- [-1, 3, C3k2, [512, False]]  # 9- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 10- [[-1, 2], 1, Concat, [1]]  # 11 cat backbone P3- [-1, 3, C3k2, [256, False]]  # 12 (P3/8-small)- [-1, 1, Conv, [256, 3, 2]] # 13- [[-1, 9], 1, Concat, [1]]  # 14 cat head P4- [-1, 3, C3k2, [512, False]]  # 15 (P4/16-medium)- [-1, 1, Conv, [512, 3, 2]] # 16- [[-1, 6], 1, Concat, [1]]  # 17 cat head P5- [-1, 3, C3k2, [1024, False]]  # 18 (P5/32-large)- [[12, 15, 18], 1, Detect, [nc]]  # Detect(P3, P4, P5)

大家根据自己的数据集实际情况，修改nc大小。

5.模型训练

import warnings
warnings.filterwarnings('ignore')
from ultralytics import YOLOif __name__ == '__main__':model = YOLO(r'D:\yolo\yolov11\ultralytics-main\datasets\yolo11MobileNetV4.yaml')model.train(data=r'D:\yolo\yolov11\ultralytics-main\datasets\data.yaml',cache=False,imgsz=640,epochs=100,single_cls=False,  # 是否是单类别检测batch=4,close_mosaic=10,workers=0,device='0',optimizer='SGD',amp=True,project='runs/train',name='exp',)

模型结构打印，成功运行：