BEVFormer详细复现方案

✨✨ 欢迎大家来访Srlua的博文（づ￣3￣）づ╭❤～✨✨

🌟🌟 欢迎各位亲爱的读者，感谢你们抽出宝贵的时间来阅读我的文章。

我是Srlua小谢，在这里我会分享我的知识和经验。🎥

希望在这里，我们能一起探索IT世界的奥妙，提升我们的技能。🔮

记得先点赞👍后阅读哦~ 👏👏

📘📚 所属专栏：传知代码论文复现

欢迎访问我的主页：Srlua小谢获取更多信息和资源。✨✨🌙🌙

论文解读

论文方法

方法描述

方法改进

解决的问题

论文实验

论文总结

文章优点

方法创新点

未来展望

复现过程

安装过程

数据准备（mini数据集为例）

NuScenes

使用

训练和测试

使用FP16去训练模型.

Visualization

模型权重

复现过程中遇到的问题

参考材料

本文所有资源均可在该地址处获取。

论文解读

本文介绍了一种新的框架——BEVFormer，用于学习具有时空Transformer的统一BEV表征，以支持多个自动驾驶感知任务。BEVFormer利用空间和时间信息，通过预定的网格状BEV查询向量与空间和时间域交互。为了聚合空间信息，作者设计了一个空间交叉注意力，每个BEV查询向量从跨相机视图的感兴趣区域提取空间特征。对于时间信息，作者提出了一种时间自注意力来递归融合历史BEV信息。实验结果表明，在nuScenes测试集上，BEVFormer的NDS指标达到了最新的56.9%，比之前的最佳技术高出9.0分，与基于lidar的基线性能相当。此外，BEVFormer还显著提高了低能见度条件下目标速度估计和召回率的准确性。

论文方法

方法描述

本论文提出了一种名为BEVFormer的新型BEV特征生成框架，该框架能够通过注意力机制有效地聚合来自多视角相机的时空特征和历史BEV特征。BEVFormer主要由六个编码器层组成，其中包括BEV查询、空间交叉注意力和时间自我注意力三种定制设计。BEVFormer的BEV查询向量是一组网格形状的可学习参数，用于从多目相机视图中查询BEV空间中的特征。空间交叉注意力和时间自注意力则用于根据BEV查询从多相机图像中查找和聚合空间特征和历史BEV的时间特征。BEVFormer还设计了一个基于可变形注意力的端到端三维探测头和一个基于二维分割方法Panoptic SegFormer的地图分割头。

方法改进

BEVFormer采用了可变形注意力和时间自注意力两种注意力机制，使得BEVFormer能够在不增加计算成本的前提下，有效地聚合来自多视角相机的时空特征和历史BEV特征。另外，BEVFormer还设计了一个基于可变形DETR探测器的端到端三维探测头和一个基于二维分割方法Panoptic SegFormer的地图分割头，使得BEVFormer可以应用于三维物体检测和地图分割等多个自动驾驶感知任务。

解决的问题

BEVFormer的目标是解决多目相机三维感知问题，即如何从多目相机视图中聚合时空特征和历史BEV特征，以实现准确的三维物体检测和地图分割。传统的方法通常是独立完成三维物体检测或地图分割任务，而BEVFormer通过引入注意力机制，能够有效地聚合来自多视角相机的时空特征和历史BEV特征，从而提高了多目相机三维感知的准确性。

论文实验

本文介绍了BEVFormer模型在nuScenes和Waymo两个公共自动驾驶数据集上的实验结果。实验内容包括：

数据集介绍：作者使用了nuScenes和Waymo两个数据集进行实验，其中nuScenes数据集包含1000个场景，每个场景持续时间大约为20s，关键样本以2Hz的频率进行标注；Waymo开放数据集是一个大型自动驾驶数据集，拥有798个训练序列和202个验证序列。
实验设置：作者采用了两种类型的骨干ResNet101-DCN和VoVnet99，并使用FPN输出的多尺度特征。在nuScenes上，BEV查询向量的默认大小为200×200，感知范围为[-51.2m, 51.2m]的X轴和Y轴，BEV网格分辨率s大小为0.512m。在Waymo上，BEV查询向量的默认空间形状为300×220，感知范围为X轴[-35.0m, 75.0m]，Y轴[-75.0m, 75.0m]，每个网格分辨率s的大小为0.5m。
三维目标检测结果：作者使用了检测头训练模型进行检测任务，并与其他先进的三维目标检测方法进行比较。在nuScenes测试集和验证集拆分上，BEVFormer模型在val集上超过了DETR3D模型超过9.2个百分点(51.7% NDS vs. 42.5% NDS)。在测试集上，BEVFormer模型在没有花里胡哨的情况下实现了56.9% NDS，比DETR3D(47.9% NDS)高了9.0个百分点。此外，BEVFormer模型的时间信息在多镜头检测的速度估计中起着至关重要的作用，平均速度误差(mA VE)为0.378米/秒，接近基于激光雷达的方法[43]的性能。
多任务感知结果：作者同时使用检测头和分割头对模型进行训练，以验证模型对多任务的学习能力。在nuScenes验证集上，BEVFormer在所有任务中都实现了更高的性能，例如，在闪避任务上高出11.0个百分点(52.0% NDS v.s. 41.0% NDS)。与单独训练任务相比，多任务学习通过共享更多的模块来节省计算成本和减少推理时间。

论文总结

文章优点

本文提出的 BEVFormer 框架能够高效聚合时空信息，生成强大的 BEV 特征，从而提高了视觉感知模型的性能。
作者通过实验表明，利用多镜头输入的时空信息可以显著提高视觉感知模型的性能，这对于构建更好、更安全的自动驾驶系统至关重要。
BEVFormer 的设计思路基于 Transformer 使用注意力机制动态地聚合有价值的特性，这种方法能够更好地适应不同的任务需求。

方法创新点

本文提出了一种基于 Transformer 的鸟瞰视角(BEV)编码器，称为 BEV-Former，它可以有效地聚合环视相机的时空特征和历史 BEV 特征。
作者设计了可学习的 BEV 查询向量，并设计了空间交叉注意力层和时间自注意力层，分别从跨摄像头和历史 BEV 中查找空间特征和时间特征，并将其聚合为统一的 BEV 特征。
BEVFormer 能够同时支持多种自动驾驶感知任务，包括 3D 检测和地图分割，这使得它成为了一个更加通用的视觉感知方法。

未来展望

本文提出的 BEVFormer 只是以下更强大的视觉感知方法的基础，基于视觉的感知系统仍有巨大的潜力有待开发。
未来的研究方向可以探索如何进一步优化 BEVFormer 的性能，比如增加更多的时空特征信息或者改进注意力机制等。
同时也可以考虑将 BEVFormer 应用于其他领域，比如医疗影像分析等。

复现过程

安装过程

Following (https://mmdetection3d.readthedocs.io/en/latest/getting_started.html#installation)

a. Create a conda virtual environment and activate it.

conda create -n open-mmlab python=3.8 -y
conda activate open-mmlab

b. Install PyTorch and torchvision following the official instructions.

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
# Recommended torch>=1.9

c. Install gcc>=5 in conda env (optional).

conda install -c omgarcia gcc-6 # gcc-6.2

c. Install mmcv-full.

pip install mmcv-full==1.4.0
#  pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html

d. Install mmdet and mmseg.

pip install mmdet==2.14.0
pip install mmsegmentation==0.14.1

e. Install mmdet3d from source code.

git clone https://github.com/open-mmlab/mmdetection3d.git
cd mmdetection3d
git checkout v0.17.1 # Other versions may not be compatible.
python setup.py install

f. Install Detectron2 and Timm.

pip install einops fvcore seaborn iopath==0.1.9 timm==0.6.13  typing-extensions==4.5.0 pylint ipython==8.12  numpy==1.19.5 matplotlib==3.5.2 numba==0.48.0 pandas==1.4.4 scikit-image==0.19.3 setuptools==59.5.0
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

g. Clone BEVFormer.

git clone https://github.com/fundamentalvision/BEVFormer.git

h. Prepare pretrained models.

cd bevformer
mkdir ckptscd ckpts & wget https://github.com/zhiqi-li/storage/releases/download/v1.0/r101_dcn_fcos3d_pretrain.pth

数据准备（mini数据集为例）

NuScenes

Download nuScenes V1.0 full dataset data and CAN bus expansion data HERE. Prepare nuscenes data by running

Download CAN bus expansion

# download 'can_bus.zip'
unzip can_bus.zip 
# move can_bus to data dir

Prepare nuScenes data

We genetate custom annotation files which are different from mmdet3d’s

python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes --version v1.0-mini --canbus ./data

Using the above code will generate nuscenes_infos_temporal_{train,val}.pkl.

Folder structure

bevformer
├── projects/
├── tools/
├── configs/
├── ckpts/
│   ├── r101_dcn_fcos3d_pretrain.pth
├── data/
│   ├── can_bus/
│   ├── nuscenes/
│   │   ├── maps/
│   │   ├── samples/
│   │   ├── sweeps/
│   │   ├── v1.0-test/
|   |   ├── v1.0-trainval/
|   |   ├── nuscenes_infos_temporal_train.pkl
|   |   ├── nuscenes_infos_temporal_val.pkl

使用

训练和测试

Train BEVFormer with 8 GPUs

./tools/dist_train.sh ./projects/configs/bevformer/bevformer_base.py 8

Eval BEVFormer with 8 GPUs

./tools/dist_test.sh ./projects/configs/bevformer/bevformer_base.py ./path/to/ckpts.pth 8

Note: using 1 GPU to eval can obtain slightly higher performance because continuous video may be truncated with multiple GPUs. By default we report the score evaled with 8 GPUs.

使用FP16去训练模型.

The above training script can not support FP16 training,
and we provide another script to train BEVFormer with FP16.

./tools/fp16/dist_train.sh ./projects/configs/bevformer_fp16/bevformer_tiny_fp16.py 8

Visualization

see visual.py

模型权重

Backbone	Method	Lr Schd	NDS	mAP	memroy	Config	Download
R50	BEVFormer-tiny_fp16	24ep	35.9	25.7	-	config	model/log
R50	BEVFormer-tiny	24ep	35.4	25.2	6500M	config	model/log
R101-DCN	BEVFormer-small	24ep	47.9	37.0	10500M	config	model/log
R101-DCN	BEVFormer-base	24ep	51.7	41.6	28500M	config	model/log
R50	BEVformerV2-t1-base	24ep	42.6	35.1	23952M	config	model/log
R50	BEVformerV2-t1-base	48ep	43.9	35.9	23952M	config	model/log
R50	BEVformerV2-t1	24ep	45.3	38.1	37579M	config	model/log
R50	BEVformerV2-t1	48ep	46.5	39.5	37579M	config	model/log
R50	BEVformerV2-t2	24ep	51.8	42.0	38954M	config	model/log
R50	BEVformerV2-t2	48ep	52.6	43.1	38954M	config	model/log
R50	BEVformerV2-t8	24ep	55.3	46.0	40392M	config	model/log

复现过程中遇到的问题

# 错误1
...from numba.np.ufunc import _internal
SystemError: initialization of _internal failed without raising an exception
# 修改方法： 降低numpy版本即可
pip install numpy==1.23.4# 错误2
ModuleNotFoundError: No module named 'spconv'
# 修改方法  需要跟cuda配置上, 本人是cuda-11.3, 安装版本如下
pip install spconv-cu113# 错误3
ModuleNotFoundError: No module named 'IPython'
# 修改方法
pip install IPython# 错误4
# 情况1：'No module named 'projects.mmdet3d_plugin'
# 情况2：ModuleNotFoundError: No module named 'tools'
# 情况3: ModuleNotFoundError: No module named 'tools.data_converter'
# 因为tools和projects.mmdet3d_plugin都是从本地导入模块, 
# 导入失败要么是python环境变量没生效, 要么是模块的路径不对
# 修改办法: 更新python-path环境即可, 当前python虚拟环境的终端执行下面语句
export PYTHONPATH=$PYTHONPATH:"./"
# 如果还报错检查这句代码的路径是否正确, 可是使用绝对路径代替# 错误5
TypeError: FormatCode() got an unexpected keyword argument 'verify'
# 修改办法: 降低yapf版本
pip install yapf==0.40.1# 错误6 
ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory
# 原因： 安装的mmcv与cuda版本没对用上，建议去whl官方下载离线安装
# 修改参考1.4.1安装mmcv-full教程# 错误7
# AttributeError: module 'distutils' has no attribute 'version'
修改：更新setuptools版本
pip install setuptools==58.4.0# 错误8
# docker里面提示libGL.so.1不存在
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
# 修改方法：安装ffmpeg即可
apt-get install ffmpeg -y# 错误9 pip安装mmcv-full时报错
subprocess.CalledProcessError: Command '['which', 'g++']' returned non-zero exit status 1.[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for mmcv-full
# 修改方法：由于g++，gcc工具没安装,安装build-essential即可
sudo apt-get install build-essential# 错误10 训练时显存爆炸 RuntimeError: CUDA out of memory
# 修改：先将配置文件中samples_per_gpu改为1即可workers_per_gpu改0测试环境，
# 后期正式训练时逐渐增加这2个参数的数字, 直到显存占满
# 如果设置成1和0都显存不够, 可以更换显卡了
samples_per_gpu=1, workers_per_gpu=0

参考材料

https://zhuanlan.zhihu.com/p/658855697
@article{li2022bevformer,title={BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers},author={Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Qiao, Yu and Dai, Jifeng}journal={arXiv preprint arXiv:2203.17270},year={2022}
}
@article{Yang2022BEVFormerVA,title={BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision},author={Chenyu Yang and Yuntao Chen and Haofei Tian and Chenxin Tao and Xizhou Zhu and Zhaoxiang Zhang and Gao Huang and Hongyang Li and Y. Qiao and Lewei Lu and Jie Zhou and Jifeng Dai},journal={ArXiv},year={2022},
}

[1]Li Z, Wang W, Li H, et al. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 1-18.