[CLIP-VIT-L + Qwen] 多模态大模型源码阅读

[CLIP-VIT-L + Qwen] 多模态大模型源码阅读 - 语言模型篇（2）

多模态学习笔记-语言模型篇（2）

参考repo:WatchTower-Liu/VLM-learning; url:vlm-learning

吐槽

今天的源码看的欲仙欲死，NTK(neural tangent kernel), rotary_position_embedding这些在之前的学习中完全闻所未闻，导致看的时候一脸懵逼，只能说不愧是Qwen大模型，各种sota的技术都用上了。就是看的有点费劲TAT~

学习心得

这次还是读源码，接着上一次的笔记(多模态源码阅读-1)接着讲，上一次讲了在第一次处理输入的序列数据时，去除掉序列数据input_ids中的image_token(也就是应当替换为图像数据的地方)，并且将device设定为input_ids或者input_embeds挂载的设备(gpu or cpu)，需要注意的是在前向传播时不能同时传入input_ids和input_embeds参数，只需传入其一即可。
下面来看看接下来的源码，还是前向传播部分(注意，这里的前向传播代码不是Qwen的原装代码，是为了多模态适配重写的代码)。

		output_attentions = (output_attentionsif output_attentions is not Noneelse self.config.output_attentions)output_hidden_states = (output_hidden_statesif output_hidden_states is not Noneelse self.config.output_hidden_states)use_cache = use_cache if use_cache is not None else self.config.use_cachereturn_dict = (return_dict if return_dict is not None else self.config.use_return_dict)

这段代码用来初始化output_attention, output_hidden_states和use_cache 三个参数，参数的含义分别如下:
output_attention: 用来确定是否输出注意力权重
output_hidden_states: 用来确定是否输出所有(每个时间步)的隐藏状态，而不是只输出最后一个时间步的隐藏状态。
use_cache: 是否使用缓存机制，通过缓存已计算的键值对信息来减少重复计算，加快模型的推理速度，具体来说，模型会在推理过程中逐步生成每个token，同时将计算得到的每个token对应的K和V缓存起来。当生成下一个token时，模型可以复用之前缓存的K和V，只对新token进行Attention计算，而无需重新计算整个序列的Attention，这样可以显著减少计算量，提高效率。
return_dict: 很好理解，返回值是否以字典形式返回。

		if input_ids is not None and inputs_embeds is not None:raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")elif input_ids is not None:input_shape = input_ids.size()input_ids = input_ids.view(-1, input_shape[-1]).contiguous()batch_size = input_ids.shape[0]elif inputs_embeds is not None:input_shape = inputs_embeds.size()[:-1]batch_size = inputs_embeds.shape[0]else:raise ValueError("You have to specify either input_ids or inputs_embeds")

如果我们同时传入了Input_ids和input_embeds就会报错，错误信息为不能同时指定input_ids和input_embeds, 如果input_ids不为空(此时Input_embeds为空)，我们用size()函数获取input_ids的形状，通常为(batch_size, seq_len)，利用view函数将input_ids转换为二维张量，contifuous()函数的目的是让Input_ids在内存中连续，这是使用view函数的前提条件(reshape函数就不需要，但是view函数对老代码的兼容性更好)。
如果我们传入了inputs_embeds，一般情况下它的形状是（batch_size, seq_len, embed_size），我们不需要它的embed_size维度，所以在获取input_shape的时候去除掉了最后一个维度, 其batch_szie就为第一个维度。
如果input_ids和input_embeds都未传入，报错，要求我们必须传入至少一个序列输入。

if images is not None and first_step:input_shape = input_shape[0], input_shape[-1] + self.otherConfig["image_context_length"]   ############

在推理或者训练的第一步时，我们需要对输入数据的形状进行处理，一般情况下，input_shape[0]为betch_szie，input_shape[-1]为seq_len，我们需要再seq_len维度加上图片的上下文序列长度，以便后续对图片特征和文字输入进行融合。

		if token_type_ids is not None:token_type_ids = token_type_ids.view(-1, input_shape[-1])if position_ids is not None:position_ids = position_ids.view(-1, input_shape[-1])

这里对token_type_ids和position_ids进行形状的调整，其中token_type用来区分不同模态的token, position_id则是transformer模型的基础，由于transformer模型的自注意力机制天然无法考虑序列的位置信息，它是并行处理序列输入的每一个元素，而RNN则是递归处理数据，天然可以记忆之前时间步的信息，因此我们需要position_id。我们确保token_type_id和position_id的最后一个维度都是seq_len + image_context_len，以便于后续的处理。

        if past_key_values is None:past_length = 0past_key_values = tuple([None] * len(self.h))else:if self.use_cache_quantization:past_length = past_key_values[0][0][0].size(2)else:past_length = past_key_values[0][0].size(-2)if position_ids is None:position_ids = torch.arange(past_length,input_shape[-1] + past_length,dtype=torch.long,device=device,)position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])

如果没有缓存的键值对信息，我们将past_length置为0，代表我们目前尚未处理任何序列输入信息(假设use_cache)，将past_key_valuies初始化为长度为注意力头数量(self.h)的元组，用来存储每个注意力头的key_value信息。
如果有缓存的键值对信息，并且启动了缓存量化，past_key_values[0][0][0]很抽象，这里我们知道它是取第一个注意力头的键张量的第三个维度就行。如果不启用缓存量化，同样是取第一个注意力头的键张量的倒数第二个维度。我们只需要知道这两个维度是past_len就行，底层深究会很麻烦。
如果没有传入position_ids,或者说我们目前不处于推理和训练的第一步，我们初始化一个positon_ids，起始位置在past_len之后，终止位置为past_len + (seq_len + image_context_len)，这里加上past_len是为了保持长度不变，数据类型为torch.long,。
最后对position_id进行重新塑性，position_ids原本的size为(seq_len + image_context_len,)，我们添加一个为1的维度(用unsqueeze(0))，并且将position_ids最后一个维度重塑为(seq_len + image_context_len)，这里可能有点多此一举，但是为了代码的健壮性也无妨。
————————————————————————————————————————————
时间原因，今天先写到这里，明日再战fighting~

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/bicheng/52414.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！