BLIP 算法阅读记录---一个许多多模态大语言模型的基本组件

论文地址:😈

目录

一、环境配置以及数据集准备

数据集准备

数据集格式展示 

环境配置,按照官网所述即可

二、一些调整

vit_base的预训练模型 

 远程debug的设置

Tokenizer初始化失败

读入网络图片的调整

三、训练过程

 Image Encoder Layer

Text Encoder

Text Decoder 

损失函数

 ITC(Image-Text Contrastive)

ITM(Image-Text Matching)

 LM(anguage Modeling)

网络结构 


一、环境配置以及数据集准备

数据集准备

官网提供了下载数据集json文件的接口。但是很可能打不开,因为其放在了谷歌云上

https://storage.googleapis.com/

不过不要担心,网页打不开,咱们可以利用python去爬它。

# urlopen模块 读取数据
from __future__ import (absolute_import, division, print_function, unicode_literals)from urllib.request import urlopen
import jsonjson_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_filtered.json'response = urlopen(json_url)
# 读取数据
req = response.read()
# 写入文件
with open('ccs_filtered.json', 'wb') as f:  # 保存的路径自行设置f.write(req)
f.close()
# 加载json格式
# file_urllid = json.loads(req)
# print(file_urllid)

数据集格式展示 

正如官网所述,json文件中的格式为{'image': path_of_image, 'caption': text_of_image},部分展示如下:

[{"caption": "a wooden chair in the living room", "url": "http://static.flickr.com/2723/4385058960_b0f291553e.jpg"},{"caption": "the plate near the door almost makes it look like the cctv camera is the logo of the hotel", "url": "http://static.flickr.com/3348/3183472382_156bcbc461.jpg"},{"caption": "a close look at winter stoneflies reveals mottled wings and black or brown bodies photo by jason du pont", "url": "http://static.flickr.com/114/291586170_82b2c80750.jpg"},{"caption": "karipol leipzig, germany, abandoned factory for house und car cleaning supplies in leipzig founded in 1897, closed in 1995", "url": "http://static.flickr.com/5170/5208095391_f030f31dd1.jpg"},{"caption": "this train car is parked permanently in a yard in royal, nebraska", "url": "http://static.flickr.com/3067/4554530891_f0d45b0967.jpg"},{"caption": "another shot of the building next to the sixth floor museum in dallas", "url": "http://static.flickr.com/3145/2858049135_0944b40a34.jpg"},{"caption": "a tropical forest in the train station how cool is that", "url": "http://static.flickr.com/4113/5212209590_6c7d40a7f6.jpg"},{"caption": "le hast le hill in le pink le shirt", "url": "http://static.flickr.com/1338/4597496270_7d91f01f88.jpg"},{"caption": "daniel up by the red bag tackling", "url": "http://static.flickr.com/3129/2783268248_4930260b9d.jpg"},{"caption": "a window on a service building is seen at allerton park tuesday, oct 27, 2009, in monticello, ill", "url": "http://static.flickr.com/2516/4054292066_3b74bfcf89.jpg"},{"caption": "girls jumping off of the 50ft cliff at abique lake in santa fe new mexico", "url": "http://static.flickr.com/2637/4371962912_77b4c74878.jpg"},{"caption": "a young chhantyal girl in traditional dress", "url": "http://static.flickr.com/3075/3240630067_772ebf5504.jpg"},{"caption": "view from the train to snowdon hidden in the cloud", "url": "http://static.flickr.com/4114/4803115610_711924e2ab.jpg"},{"caption": "the only non white tiger in the habitat when i went there", "url": "http://static.flickr.com/2481/3873285144_25aa3013ca.jpg"},{"caption": "the contrast of the flowering pear tree against the bare branches of the other trees caught my eye", "url": "http://static.flickr.com/3198/3124550410_b70442da56.jpg"},{"caption": "this is my friend taking a nap in my sleeping bag with our friend's dog for company", "url": "http://static.flickr.com/1167/1466307446_c1a332c5ec.jpg"},{"caption": "view of old castle in field ii", "url": "http://static.flickr.com/4101/4758652582_8a1d44a1a0.jpg"},

a wooden chair in the living room

 this is my friend taking a nap in my sleeping bag with our friend's dog for company

环境配置,按照官网所述即可

# 不是顺序执行,大致意思
conda create -n BLIP python=3.8
pip install -r requirement.txt
pip install opencv-python
pip install opencv-python-headless

按照官网,去configs/pretrain.yaml修改json files的路径。然后自行创建保存的路径

mkdir output/pretrain  (大致意思,不是执行这个)

执行测试,(这里只用了一块GPU)

python -m torch.distributed.run --nproc_per_node=1 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain

出现问题

ModuleNotFoundError: No module named 'ruamel_yaml'

解决方法

# 将
import ruamel_yaml as yaml
#改为
import yaml

出现问题

FileNotFoundError: [Errno 2] No such file or directory: './configs/Pretrain.yaml'

 解决方法

用绝对路径代替

出现问题

RuntimeError: The NVIDIA driver on your system is too old (found version 11060). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.

解决方法

见这里 😼

出现问题

FileNotFoundError: [Errno 2] No such file or directory: 'configs/bert_config.json'

 解决办法

用绝对路劲代替

出现问题

TypeError: '<=' not supported between instances of 'float' and 'str'

解决办法

yaml错把3e-4等识别为str型数据 

lr=float(config['init_lr']) 转成float型,或者去yaml文件中将

3e-4

1e-6

等直接换成

0.0003

0.0000001

出现问题

image = Image.open(ann['image']).convert('RGB')
KeyError: 'image'

 解决办法

见本文中的 第二部分中的 读入网络图片的调整

二、一些调整

vit_base的预训练模型 

执行训练时会去网上下载这个预训练模型。为了避免重复下载,可将下载的预训练模型保存到一个文件夹中,并更改代码中的设置

checkpoint = torch.load('******/deit_base_patch16_224-b5f2ef4d.pth', ap_location="cpu")

 远程debug的设置

遇见的问题

error:unrecognized argument: --local-rank=0

解决方法见这里 🐰 

Tokenizer初始化失败

由于初始化需要去hugging face下载模型文,所以有可能由于网络原因报错

TimeoutError: [Errno 110] Connection timed out

解决方法见这里😼

并且 self.text_encoder 和self.text_decoder也需要同步进行更改

读入网络图片的调整

原数据集json文件太大,所以我编写脚本截取了一部分,只有4M的,20000个图片。

 我不知道为什么,官网说的数据集json文件格式是{'image': path_of_image, 'caption': text_of_image},但是我下载的json文件格式是{'url': url_of_image, 'caption': text_of_image}。这个可能是我下载的是CC3M+CC12M+SBU,来自网页的,如果是coco的话可能就不一样了,就和他所述的格式一样。

所以,为了在网页上抓取并读入图片,需要修改一下源代码 (参考了这里🐰)

pretrain_dataset.py 中

# 导入库
from skimage import io# 插入代码
self.use_url = True# 以及函数def read_image_from_url(self, url):# 从URL下载图像image = io.imread(url)image = Image.fromarray(image)return image
# 将源代码修改为def __getitem__(self, index):    ann = self.annotation[index]   if self.use_url:image = self.read_image_from_url(ann['url'])else:image = Image.open(ann['image']).convert('RGB')image = self.transform(image)caption = pre_caption(ann['caption'],30)return image, caption

然而可能遇到网络原因,403,不能读取图片。因此由于本次只是测试,就写了个脚本下载了截取的一些图片并配置json格式到 {'image': path_of_image, 'caption': text_of_image}。具体地,截取了原数据集json文件的前300个,

import jsonpath='/root/data/zjx/Code-subject/BLIP/ccs_filtered.json'
save_path ='/root/data/zjx/Code-subject/BLIP/cut_css_filtered.json'data = json.load(open(path, 'r'))
with open(save_path, 'w') as f2:save_len = 300for i in range(save_len):js = json.dumps(data[i])if i==0:f2.write('['+js+',')elif i == save_len-1:f2.write(js+']')else:f2.write(js+',')
print('finish')

下载图片保存到文件夹并实现配置json文件

import json
from skimage import io
from tqdm import tqdmdef download_from_image(url, save_path, ind):image = io.imread(url)if ind <10:index_ = '000'+ str(ind)elif 9<ind<100:index_ = '00'+str(ind)else:index_ = '0'+str(ind)io.imsave(save_path+'/'+index_+'.jpg', image)return index_+'.jpg'def do_this_work(ori_json_path, new_json_path, yunduan_path, save_image_path, save_image_path_name):dict_data = json.load(open(ori_json_path, 'r'))num = len(dict_data)with open(new_json_path, 'w') as f2:for i, dict_info in enumerate(tqdm(dict_data)):new_dict = {}new_dict["caption"]=dict_info["caption"]image_path = download_from_image(dict_info['url'], save_image_path, i)new_dict["image"] = yunduan_path+'/'+save_image_path_name+'/'+image_pathnew_dict = json.dumps(new_dict)if i == 0:f2.write('[' + new_dict + ',')elif i == num - 1:f2.write(new_dict + ']')else:f2.write(new_dict + ',')if __name__=='__main__':path = './cut_css_filtered.json'save_image_path = './save_imgae'save_image_path_name = 'save_imgae'save_new_json_path = './new_ccs_filtered.json'yunduan_path = '/root/data/zjx/Code-subject/BLIP'do_this_work(path, save_new_json_path, yunduan_path, save_image_path, save_image_path_name)

接着把yaml文件中的batch size 调小了

本地IDE debug 的配置 

--nproc_per_node=1
--use_env
/root/data/zjx/Code-subject/BLIP/BLIP-main/pretrain.py
--config
/root/data/zjx/Code-subject/BLIP/BLIP-main/configs/pretrain.yaml
--output_dir
output/Pretrain

映射关系

/root/anaconda3/envs/BLIP/lib/python3.8/site-packages/torch/distributed/launch.py

C:\Users\Lenovo\AppData\Local\JetBrains\PyCharm2021.2\remote_sources\2036786058\597056065\torch\distributed\launch.py

三、训练过程

 Image Encoder Layer

采用的ViT架构,大致流程如下

接着用线性层映射了一下, 并进行了二范数归一化。

Text Encoder

首先 对caption进行量化,返回的是个BtachEncoding,其中的属性如下举例,编码的最大长度为本次选取的为30

{'input_ids': tensor([[  101,  1037,  2485,  2298,  2012,  3467,  2962, 24019,  7657,  9587,26328,  2094,  4777,  1998,  2304,  2030,  2829,  4230,  6302,  2011,4463,  4241, 21179,   102,     0,     0,     0,     0,     0,     0],[  101,  1037,  4799,  3242,  1999,  1996,  2542,  2282,   102,     0,0,     0,     0,     0,     0,     0,     0,     0,     0,     0,0,     0,     0,     0,     0,     0,     0,     0,     0,     0]],device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,0, 0, 0, 0, 0, 0],[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0]], device='cuda:0')}

mask的长度和句子量化后的实值长度一样。 

文本Encoder采用的BertModel结构

对于mask的处理

1、get_extended_attention_mask

        最终extended_attention_mask 为 None

2、get_head_mask        

        最终  [None, None, None, None, None, None, None, None, None, None, None, None]

其长度等于隐藏层的数量 12

大致流程如下

Text Decoder 

损失函数

 ITC(Image-Text Contrastive)

采用 ALBEF算法实现。

ITM(Image-Text Matching)

1、将 Image Token 与 Text Token 送入text encoder ,生成正样本的 hidden state

2、制作负样本,负样本为来自不同批的输入,即不是自身其它的都是负样本,其中抓取1个就行。这里利用

        with torch.no_grad():       weights_t2i = F.softmax(sim_t2i[:,:bs],dim=1)+1e-4  # (2,2)weights_t2i.fill_diagonal_(0)   # (2,2) 主对角线为0weights_i2t = F.softmax(sim_i2t[:,:bs],dim=1)+1e-4  # (2,2)weights_i2t.fill_diagonal_(0)  # (2,2)

制作权重,使得只有主对角线元素为0.其余都有权重(这里的权重矩阵为正方形,大小为batch size),再配合

torch.multinomial

进行权重多分布采样。保证负样本不会和自身出现在同一批中。

负样本采样完毕后,进一步的对齐负样本,执行操作

TxetTokenPos -- (cat) -- TxetTokenNeg|                         |
ImagTokenNeg -- (cat) -- ImagTokenPos

至此负样本制作完毕。(过程中该包括了mask 的对齐)

3、将 Image Token  Neg和  Text Token Neg 送入text encoder,得到 neg 的 hidden state

4、将 正 负样本的 中间预测特征 拼接在一起,经过 Linear head 输出 类别预测

5、 正样本标签 lable = [1, 1,...,0,0,..0]  1的个数为batch size,0的个数为2*batch size (因为负样本的制作)

        计算损失, 二分类交叉熵

 LM(anguage Modeling)

输入和标签如下举例: input:text 量化的 token, 但是第一个位置设置为 BOS 

label:tensor([[30522,  1037,  2485,  2298,  2012,  3467,  2962, 24019,  7657,  9587,26328,  2094,  4777,  1998,  2304,  2030,  2829,  4230,  6302,  2011,4463,  4241, 21179,   102,  -100,  -100,  -100,  -100,  -100,  -100],[30522,  1037,  4799,  3242,  1999,  1996,  2542,  2282,   102,  -100,-100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,-100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100]],device='cuda:0')
input: tensor([[30522,  1037,  2485,  2298,  2012,  3467,  2962, 24019,  7657,  9587,26328,  2094,  4777,  1998,  2304,  2030,  2829,  4230,  6302,  2011,4463,  4241, 21179,   102,     0,     0,     0,     0,     0,     0],[30522,  1037,  4799,  3242,  1999,  1996,  2542,  2282,   102,     0,0,     0,     0,     0,     0,     0,     0,     0,     0,     0,0,     0,     0,     0,     0,     0,     0,     0,     0,     0]],device='cuda:0')

1、input 经过 text decoder 输出(2,30,768),再经过BertOnlyMLMHead输出预测(2,30,30524), 30524为词汇表总长度。

2、计算损失  即似然

用decoder 的 输出预测 的前 29个单词 与 标签的后29个单词计算。

文中说采用自回归方式,由于transformer的并行能力,训练时一步到位。测试或者demo时则需要自回归方式去生成句子。 

总损失为它们三加和 

网络结构 

BLIP_Pretrain((visual_encoder): VisionTransformer((patch_embed): PatchEmbed((proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))(norm): Identity())(pos_drop): Dropout(p=0.0, inplace=False)(blocks): ModuleList((0): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(1): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(2): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(3): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(4): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(5): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(6): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(7): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(8): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(9): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(10): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(11): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False))))(norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True))(text_encoder): BertModel((embeddings): BertEmbeddings((word_embeddings): Embedding(30524, 768)(position_embeddings): Embedding(512, 768)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))(encoder): BertEncoder((layer): ModuleList((0): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(1): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(2): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(3): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(4): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(5): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(6): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(7): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(8): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(9): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(10): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(11): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))))))(vision_proj): Linear(in_features=768, out_features=256, bias=True)(text_proj): Linear(in_features=768, out_features=256, bias=True)(itm_head): Linear(in_features=768, out_features=2, bias=True)(visual_encoder_m): VisionTransformer((patch_embed): PatchEmbed((proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))(norm): Identity())(pos_drop): Dropout(p=0.0, inplace=False)(blocks): ModuleList((0): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(1): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(2): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(3): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(4): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(5): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(6): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(7): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(8): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(9): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(10): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False)))(11): Block((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU(approximate='none')(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop): Dropout(p=0.0, inplace=False))))(norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True))(vision_proj_m): Linear(in_features=768, out_features=256, bias=True)(text_encoder_m): BertModel((embeddings): BertEmbeddings((word_embeddings): Embedding(30524, 768, padding_idx=0)(position_embeddings): Embedding(512, 768)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))(encoder): BertEncoder((layer): ModuleList((0): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(1): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(2): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(3): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(4): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(5): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(6): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(7): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(8): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(9): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(10): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(11): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))))))(text_proj_m): Linear(in_features=768, out_features=256, bias=True)(text_decoder): BertLMHeadModel((bert): BertModel((embeddings): BertEmbeddings((word_embeddings): Embedding(30524, 768)(position_embeddings): Embedding(512, 768)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))(encoder): BertEncoder((layer): ModuleList((0): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(1): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(2): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(3): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(4): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(5): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(6): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(7): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(8): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(9): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(10): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(11): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(crossattention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))))))(cls): BertOnlyMLMHead((predictions): BertLMPredictionHead((transform): BertPredictionHeadTransform((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True))(decoder): Linear(in_features=768, out_features=30524, bias=True))))
)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/809540.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

FebHost:英国.UK域名注册使用中存在哪些侵权行为?

截至2023年6月&#xff0c;英国.uk域名作为全球第五大热门顶级域名&#xff0c;注册数量超过1100万&#xff0c;成为全球最知名和广泛使用的域名之一。英国域名家族包括四个独特的域名后缀——.uk、.co.uk、.org.uk 和 .me.uk——每个都有其独特的特点&#xff0c;并根据数字领…

Mac下用adb命令安装apk到android设备笔记

查询了些资料记录备用。以下是在Mac上使用命令行安装APK文件的步骤&#xff1a; 1. 下载并安装ADB&#xff1a; 如果您的Mac上没有安装ADB&#xff0c;请从官方的Android开发者网站下载Android SDK Platform Tools&#xff1a;Android SDK Platform Tools。将下载的ZIP文件解…

三次 Bspline(B样条曲线) NURBS曲线的绘制 matlab

先来了解几个概念&#xff1a; 1.1 节点向量&#xff1a; B-Spline需要定义曲线的节点向量U&#xff0c;它可以对应到Bezier曲线的参数u。 其元素个数 (m1) 和曲线阶数 k 、控制点个数n满足&#xff1a;m1k1n1 如果U的每段的距离是相等&#xff0c;那么这个B-Spline就被称为均…

关于UCG游戏平台的一些思考

UCG游戏平台&#xff0c;全称User Generated Content&#xff0c;即用户生成内容。它涵盖了所有玩家可以自主编辑的部分&#xff0c;包含并不限于换装、捏脸、关卡摆放等内容。 UCG概念在最近又火了起来&#xff0c;但这个模式出现的并不早。早在10多年前&#xff0c;war3编辑器…

为linux和windows系统备份还原点,防止系统出问题无法恢复

一、linux系统操作办法&#xff1a; sudo apt update sudo apt install timeshift timeshift --create 输出结果如下&#xff1a; 等待约5分钟就会创建成功&#xff1a; 这个备份功能只备份系统&#xff0c;不备份文件&#xff0c;但也不会删除文件。 工作站系统的保存位置&a…

Win10安装sqlplus遇到报错的解决办法

1.下载安装sqlplus.exe的错误解决过程 最近有用到sqlplus连接Oracle数据库执行自动化脚本&#xff0c;Orcle服务器版本是11.2.0.1。在Navicat工具上通过如下语句查询到的版本信息截图如图1所示&#xff1a; SELECT * FROM v$version; 图1 Oracle服务器版本信息 其中“Oracle Da…

Docker部署SpringBoot+Vue前后端分离项目

文章目录 1. 安装Docker1. 1 卸载旧版Docker1.2 配置yum仓库1.3 安装Docker1.4 添加自启动配置1.5 配置阿里云镜像加速1.6 测试 2. 安装Nginx2.1 拉取镜像2.2 安装Nginx2.3 测试 3. 安装MySQL3.1 拉取镜像3.2 安装MySQL3.3 连接MySQL 4. 部署SpringBoot项目4.1 Maven打包4.2 编…

深度学习Vue框架生命周期(三)

一.什么是生命周期&#xff1f; 在vue中&#xff0c;生命周期就是vue实例程序从创建到销毁的这个过程&#xff0c;在生命周期中&#xff0c;不同阶段我们可以做不同的事情。vue的生命周期是创建阶段、挂载阶段、更新阶段、销毁阶段 二.什么是钩子函数&#xff1f; 钩子函数就是…

数据库数据恢复—Sql Server数据库文件丢失如何恢复数据?

服务器数据恢复环境&#xff1a; 一台安装windows server操作系统的服务器。一组由8块硬盘组建的RAID5&#xff0c;划分LUN供这台服务器使用。 在windows服务器内装有SqlServer数据库。存储空间LUN划分了两个逻辑分区。 服务器故障&初检&#xff1a; 由于未知原因&#xf…

Windows联网状态工具TCPView

文章目录 TCPView命令行工具更多Sysinternals Suite工具 TCPView TCPView用于显示系统上所有 TCP 和 UDP 终结点的详细列表&#xff0c;包括本地和远程地址以及 TCP 连接的状态&#xff0c;界面如下。 列表的表头含义如下 表头含义表头含义Process name应用名称Process id进程…

最新Android Studio导入aar包的方法

以前的方式&#xff0c;目前看网上也大多数都是这种方式&#xff0c;导致我本地加的时候一直有问题 但是这样都无法sync以及编译通过&#xff0c;因为方式已经变了 1&#xff1a;将aar文件复制到MyApplication\app\libs下 2&#xff1a;在MyApplication\app\build.gradle下添加…

HTTP请求报文介绍

本章简要介绍渗透测试员在攻击Web应⽤程序时可能遇到的关键技术。 将分析HTTP协议、服务器和客⼾端常⽤的技术以及⽤于在各种情形下呈现数据的编码⽅案。 这些技术⼤都简单易懂&#xff0c;掌握其相关特性对于向Web应⽤程序发动有效攻击极其重要。 1.1 HTTP协议概述介绍 HTT…

VMvare进行靶场环境搭建,防火墙连接[物理主机,攻击机,靶机],主机与VM虚拟网卡拓扑形象,web连接防火墙报错

配置目标 两块虚拟网卡分别为vmnet1和vmnet8 vmnet1配置两个网段192.168.20.1/24和192.168.30.1/24 其中192.168.20.0网段将防火墙管理接口0/0/0&#xff0c;接口地址为192.168.20.100和物理机192.168.20.1/24进行连接 其中192.168.30.0网段将防火墙1/0/0接口&#xff0c;接…

智能AI写作,自动写文案效率高

随着科技的不断发展&#xff0c;人工智能领域的应用也日益广泛&#xff0c;其中智能AI写作作为一项新兴技术&#xff0c;正逐渐改变着传统文案写作的方式。智能AI写作是利用人工智能技术来生成文案内容&#xff0c;其高效率和高质量的特点吸引了越来越多的用户。在这个信息爆炸…

第十届蓝桥杯省赛真题(C/C++大学B组)

试题 A: 组队 答案&#xff1a;490 试题 B: 年号字串 #include <bits/stdc.h> using namespace std;int main() {//26进制数 int n;cin>>n;string s "111";for(int i s.length() - 1;i >0;i--){s[i] A - 1 n % 26;n / 26;}cout<<s<<…

如何发现高危的PoC和EXP?漏洞检测方法 示例,实战应急实战举例,包括:SQLi、XSS、SSTI/ELI、文件哈希、SSRF、命令执行/命令注入等等

如何发现高危的PoC和EXP?漏洞检测方法 & 示例,实战应急实战举例,包括:SQLi、XSS、SSTI/ELI、文件哈希、SSRF、命令执行/命令注入等等。 在网络安全领域,发现高危的PoC(Proof of Concept)和EXP(Exploit)对于防范和应对潜在的安全威胁至关重要。以下是关于如何发现高…

leetcode 1766

leetcode 1766 题目 例子 思路 将边的关系&#xff0c;转化为树结构。 记录val 对应的id 列表。 记录是否遍历过节点。 记录id 和对应的深度。 使用dfs&#xff0c; 从根开始遍历。 代码实现 class Solution { private:vector<vector<int>> gcds;//val : the …

AliyunCTF 2024 - BadApple

文章目录 前言环境搭建漏洞分析漏洞利用参考 前言 本文首发于看雪论坛 https://bbs.kanxue.com/thread-281291.htm 依稀记得那晚被阿里CTF支配的恐惧&#xff0c;今年的阿里CTF笔者就做了一道签到PWN题&#xff0c;当时也是下定决心要学习 jsc pwn 然后复现这道 BadApple 题目…

github克隆报错:failed: The TLS connection was non-properly terminated.

github克隆gazebo_ros_control报错 fatal: unable to access https://github.com/ros-controls/gazebo_ros_control.git/: gnutls_handshake() failed: The TLS connection was non-properly terminated. sudo apt-get install ros-noetic-gazebo-ros-control git 克隆gazeb…

如何正确使用数字化仪前端信号调理?(一)

一、前言 板卡式的数字转换器和类似测量仪器&#xff0c;比如图1所示的德思特TS-M4i系列&#xff0c;都需要为各种各样的特性信号与内部模数转换器&#xff08;ADC&#xff09;的固定输入范围做匹配。 图1&#xff1a;德思特TS-M4i系列高速数字化仪&#xff0c;包括2或4通道版…