BiMPM实战文本匹配【下】

引言

这是BiMPM实战文本匹配的第二篇文章。

注意力匹配

image-20230910152308092

如上图所示,首先计算每个正向(或反向)上下文嵌入 h i p → \overset{\rightarrow}{\pmb h_i^p} hip(或 h i p ← \overset{\leftarrow}{\pmb h_i^p} hip)与另一句的每个正向(或反向)上下文嵌入 h j q → \overset{\rightarrow}{\pmb h_j^q} hjq(或 h j q ← \overset{\leftarrow}{\pmb h_j^q} hjq)之间的余弦相似度:

image-20230910152542494

然后,把 α i , j → \overset{\rightarrow}{\alpha_{i,j}} αi,j(或 α i , j ← \overset{\leftarrow}{\alpha_{i,j}} αi,j)当成 h j q → \overset{\rightarrow}{\pmb h_j^q} hjq(或 h j q ← \overset{\leftarrow}{\pmb h_j^q} hjq)的权重(注意这里没有用softmax进行归一化),接着通过加权求和句子 Q Q Q所有上下文嵌入来计算它的一个注意力向量:

image-20230910152856059

最后,为每个正向(或反向)上下文嵌入 h i p → \overset{\rightarrow}{\pmb h_i^p} hip(或 h i p ← \overset{\leftarrow}{\pmb h_i^p} hip)和它相应的注意力向量计算匹配:

image-20230910153028402

这种匹配方式又复杂了一点,但可以分步来看。

第一步,如公式 ( 7 ) (7) (7)所示,直接计算这两个序列之间的余弦相似度,得到第一个序列每个时间步上对应另一个序列所有时间步上的余弦相似度值,得到一个余弦相似度矩阵。

第二步,把这个余弦相似度值当成权重去对第二个序列计算加权,然后除以第二个序列所有时间步上的余弦相似度值总和;

第三步,用第一个序列的原始向量和第二步计算加权后的向量去计算匹配。

    def _attentive_matching(self, v1: Tensor, v2: Tensor, cosine_matrix: Tensor, w: Tensor) -> Tensor:"""Args:v1 (Tensor): (batch_size, seq_len1, hidden_size)v2 (Tensor): (batch_size, seq_len2, hidden_size)cosine_matrix (Tensor): (batch_size, seq_len1, seq_len2)w (Tensor):  (l, hidden_size)Returns:Tensor:"""# (batch_size, seq_len1, hidden_size)attentive_vec = self._mean_attentive_vectors(v2, cosine_matrix)# (batch_size, seq_len1, num_perspective, hidden_size)attentive_vec = self._time_distributed_multiply(attentive_vec, w)# (batch_size, seq_len, num_perspective, hidden_size)v1 = self._time_distributed_multiply(v1, w)# (batch_size, seq_len1, num_perspective)return self._cosine_similarity(v1, attentive_vec)

第一步的结果作为这里的参数cosine_matrix,从它的形状(batch_size, seq_len1, seq_len2)也可以看出来。

第二步通过attentive_vec = self._mean_attentive_vectors(v2, cosine_matrix)计算。传入的是v2,表示以v1为主。

第三步就是后面的代码,分别计算加权后的向量,然后计算匹配。

我们继续来看第一步的结果是如何计算的,其实很简单:

# (batch_size, seq_len1, hidden_size) -> (batch_size, seq_len1, 1, hidden_size)
v1 = v1.unsqueeze(2)
# (batch_size, seq_len2, hidden_size) -> (batch_size, 1, seq_len2, hidden_size)
v2 = v2.unsqueeze(1)
# (batch_size, seq_len1, seq_len2)
cosine_matrix = self._cosine_similarity(v1, v2)

还是利用广播,然后在hidden_size维度上计算余弦相似度。

然后第二步的实现为:

    def _mean_attentive_vectors(self, v2: Tensor, cosine_matrix: Tensor) -> Tensor:"""calculte mean attentive vector for the entire sentence by weighted summing allthe contextual embeddings of the entire sentence.Args:v2 (Tensor):  v2 (batch_size, seq_len2, hidden_size)cosine_matrix (Tensor): cosine_matrix (batch_size, seq_len1, seq_len2)Returns:Tensor: (batch_size, seq_len1, hidden_size)"""#  (batch_size, seq_len1, seq_len2, 1)expanded_cosine_matrix = cosine_matrix.unsqueeze(-1)# (batch_size, 1, seq_len2, hidden_size)v2 = v2.unsqueeze(1)# (batch_size, seq_len1, hidden_size)weighted_sum = (expanded_cosine_matrix * v2).sum(2)# (batch_size, seq_len1, 1)sum_cosine = (cosine_matrix.sum(-1) + self.epsilon).unsqueeze(-1)# (batch_size, seq_len1, hidden_size)return weighted_sum / sum_cosine

它接收第二个序列和余弦相似度矩阵。

为了让cosine_matrix当成权重,它要去乘v2,也要分别对它们进行扩充维度。

expanded_cosine_matrix会被广播为(batch_size, seq_len1, seq_len2, hidden_size)v2.unsqueeze(1)也会被广播为(batch_size, seq_len1, seq_len2, hidden_size)

然后就可以逐元素相乘,得到(batch_size, seq_len1, seq_len2, hidden_size)的结果,加权需要在第二个序列的时间步维度上进行求和,所以就是sum(2),变成了(batch_size, seq_len1, hidden_size)

此时我们仅得到公式 ( 8 ) (8) (8)上的分子。

接下来我们要计算分母。分母就更简单了,直接拿cosine_matrix也在第二个序列的时间步维度上求和:

sum_cosine = (cosine_matrix.sum(-1) + self.epsilon).unsqueeze(-1),同时保持最后一个维度,相当于用了keepdim=True。同样,为了防止分母为零,还加了一个极小数。

最后用分子除以分母就得到了公式 ( 8 ) (8) (8)的计算结果。

回到_attentive_matching中,后面的代码实现了公式 ( 9 ) (9) (9)的,即分别乘以视角权重后进行匹配。

注意这里有一个相对关系,不仅需要计算 P P P Q Q Q的注意力匹配,还要计算 Q Q Q P P P的注意力匹配。

最大注意力匹配

image-20230910153215650

上图显示了这种匹配策略的图示。这个策略类似于注意力匹配策略。然而,与将所有上下文嵌入的加权和作为注意向量不同,这里选择与最高余弦相似度的上下文嵌入作为注意向量。然后,我们将句子 P P P的每个上下文嵌入与其新的注意向量进行匹配。

    def _max_attentive_matching(self, v1: Tensor, v2: Tensor, cosine_matrix: Tensor, w: Tensor) -> Tensor:"""Args:v1 (Tensor): (batch_size, seq_len1, hidden_size)v2 (Tensor): (batch_size, seq_len2, hidden_size)cosine_matrix (Tensor): (batch_size, seq_len1, seq_len2)w (Tensor): (num_perspective, hidden_size)Returns:Tensor: (batch_size, seq_len1, num_perspective)"""# (batch_size, seq_len1, embedding_szie)max_attentive_vec = self._max_attentive_vectors(v2, cosine_matrix)# (batch_size, seq_len1, num_perspective, hidden_size)max_attentive_vec = self._time_distributed_multiply(max_attentive_vec, w)# (batch_size, seq_len1, num_perspective, hidden_size)v1 = self._time_distributed_multiply(v1, w)# (batch_size, seq_len1, num_perspective)return self._cosine_similarity(v1, max_attentive_vec)

首先计算v1v2的余弦相似度矩阵,这里和注意力匹配中一样。然后让从中选择最高余弦相似度对应的那个时间步作为待匹配的注意力向量。最后用v1和这个注意力向量去计算匹配。

这里_max_attentive_vectors实现为:

    def _max_attentive_vectors(self, v2: Tensor, cosine_matrix: Tensor) -> Tensor:"""calculte max attentive vector for the entire sentence by pickingthe contextual embedding with the highest cosine similarity.Args:v2 (Tensor): v2 (batch_size, seq_len2, hidden_size)cosine_matrix (Tensor): cosine_matrix (batch_size, seq_len1, seq_len2)Returns:Tensor: (batch_size, seq_len1, hidden_size)"""# (batch_size, seq_len1) index value between [0, seq_len2)_, max_v2_step_idx = cosine_matrix.max(-1)hidden_size = v2.size(-1)seq_len1 = max_v2_step_idx.size(-1)# (batch_size * seq_len2, hidden_size)v2 = v2.contiguous().view(-1, hidden_size)# (batch_size * seq_len1, )max_v2_step_idx = max_v2_step_idx.contiguous().view(-1)# (batch_size * seq_len1, hidden_size)max_v2 = v2[max_v2_step_idx]# (batch_size, seq_len1, hidden_size)return max_v2.view(-1, seq_len1, hidden_size)

余弦相似度矩阵cosine_matrix的形状为(batch_size, seq_len1, seq_len2),要想从v2中时间步维度上取最大值,需要先从cosine_matrix获取余弦相似度最大值对应的索引,max_v2_step_idx就是这个索引,它的取值返回是[0, seq_len2),可以包含重复值。

然后先把v2转换成 (batch_size * seq_len2, hidden_size),索引也转换为(batch_size * seq_len1, )的形状。

通过v2[max_v2_step_idx]取得(batch_size * seq_len1, hidden_size)形状的结果。

最后转换回v1的形状。

接下来是剩下和v1去进行多视角匹配。

匹配层完整代码

class MatchingLayer(nn.Module):def __init__(self, args: Namespace) -> None:super().__init__()self.args = argsself.l = args.num_perspectiveself.epsilon = args.epsilon  # prevent dividing zerofor i in range(1, 9):self.register_parameter(f"mp_w{i}",nn.Parameter(torch.rand(self.l, args.hidden_size)),)self.reset_parameters()def reset_parameters(self):for _, parameter in self.named_parameters():nn.init.kaiming_normal_(parameter)def extra_repr(self) -> str:return ",".join([p[0] for p in self.named_parameters()])def forward(self, p: Tensor, q: Tensor) -> Tensor:"""p: (batch_size, seq_len_p, 2 * hidden_size)q: (batch_size, seq_len_q, 2 * hidden_size)"""# both p_fw and p_bw are (batch_size, seq_len_p, hidden_size)p_fw, p_bw = torch.split(p, self.args.hidden_size, -1)# both q_fw and q_bw are (batch_size, seq_len_q, hidden_size)q_fw, q_bw = torch.split(q, self.args.hidden_size, -1)# 1. Full Matching# (batch_size, seq_len1, 2 * l)m1 = torch.cat([self._full_matching(p_fw, q_fw[:, -1, :], self.mp_w1),self._full_matching(p_bw, q_bw[:, 0, :], self.mp_w2),],dim=-1,)# 2. Maxpooling Matching# (batch_size, seq_len1, 2 * l)m2 = torch.cat([self._max_pooling_matching(p_fw, q_fw, self.mp_w3),self._max_pooling_matching(p_bw, q_bw, self.mp_w4),],dim=-1,)# 3. Attentive Matching# (batch_size, seq_len1, seq_len2)consine_matrix_fw = self._consine_matrix(p_fw, q_fw)# (batch_size, seq_len1, seq_len2)consine_matrix_bw = self._consine_matrix(p_bw, q_bw)# (batch_size, seq_len1, 2 * l)m3 = torch.cat([self._attentive_matching(p_fw, q_fw, consine_matrix_fw, self.mp_w5),self._attentive_matching(p_bw, q_bw, consine_matrix_bw, self.mp_w6),],dim=-1,)# 4. Max Attentive Matching# (batch_size, seq_len1, 2 * l)m4 = torch.cat([self._max_attentive_matching(p_fw, q_fw, consine_matrix_fw, self.mp_w7),self._max_attentive_matching(p_bw, q_bw, consine_matrix_bw, self.mp_w8),],dim=-1,)# (batch_size, seq_len1, 8 * l)return torch.cat([m1, m2, m3, m4], dim=-1)def _cosine_similarity(self, v1: Tensor, v2: Tensor) -> Tensor:"""compute cosine similarity between v1 and v2.Args:v1 (Tensor): (..., hidden_size)v2 (Tensor): (..., hidden_size)Returns:Tensor: (..., l)"""# element-wise multiplycosine = v1 * v2# caculate on hidden_size dimenstaioncosine = cosine.sum(-1)# caculate on hidden_size dimenstaionv1_norm = torch.sqrt(torch.sum(v1**2, -1).clamp(min=self.epsilon))v2_norm = torch.sqrt(torch.sum(v2**2, -1).clamp(min=self.epsilon))# (batch_size, seq_len, l) or (batch_size, seq_len1, seq_len2, l)return cosine / (v1_norm * v2_norm)def _time_distributed_multiply(self, x: Tensor, w: Tensor) -> Tensor:"""element-wise multiply vector and weights.Args:x (Tensor): sequence vector (batch_size, seq_len, hidden_size) or singe vector (batch_size, hidden_size)w (Tensor): weights (num_perspective, hidden_size)Returns:Tensor: (batch_size, seq_len, num_perspective, hidden_size) or (batch_size, num_perspective, hidden_size)"""# dimension of xn_dim = x.dim()hidden_size = x.size(-1)# if n_dim == 3seq_len = x.size(1)# (batch_size * seq_len, hidden_size) for n_dim == 3# (batch_size, hidden_size) for n_dim == 2x = x.contiguous().view(-1, hidden_size)# (batch_size * seq_len, 1, hidden_size) for n_dim == 3# (batch_size, 1, hidden_size) for n_dim == 2x = x.unsqueeze(1)# (1, num_perspective, hidden_size)w = w.unsqueeze(0)# (batch_size * seq_len, num_perspective, hidden_size) for n_dim == 3# (batch_size, num_perspective, hidden_size) for n_dim == 2x = x * w# reshape to original shapeif n_dim == 3:# (batch_size, seq_len, num_perspective, hidden_size)x = x.view(-1, seq_len, self.l, hidden_size)elif n_dim == 2:# (batch_size, num_perspective, hidden_size)x = x.view(-1, self.l, hidden_size)# (batch_size, seq_len, num_perspective, hidden_size) for n_dim == 3# (batch_size, num_perspective, hidden_size) for n_dim == 2return xdef _full_matching(self, v1: Tensor, v2_last: Tensor, w: Tensor) -> Tensor:"""full matching operation.Args:v1 (Tensor): the full embedding vector sequence (batch_size, seq_len1, hidden_size)v2_last (Tensor): single embedding vector (batch_size, hidden_size)w (Tensor): weights of one direction (num_perspective, hidden_size)Returns:Tensor: (batch_size, seq_len1, num_perspective)"""# (batch_size, seq_len1, num_perspective, hidden_size)v1 = self._time_distributed_multiply(v1, w)# (batch_size, num_perspective, hidden_size)v2 = self._time_distributed_multiply(v2_last, w)# (batch_size, 1, num_perspective, hidden_size)v2 = v2.unsqueeze(1)# (batch_size, seq_len1, num_perspective)return self._cosine_similarity(v1, v2)def _max_pooling_matching(self, v1: Tensor, v2: Tensor, w: Tensor) -> Tensor:"""max pooling matching operation.Args:v1 (Tensor): (batch_size, seq_len1, hidden_size)v2 (Tensor): (batch_size, seq_len2, hidden_size)w (Tensor): (num_perspective, hidden_size)Returns:Tensor: (batch_size, seq_len1, num_perspective)"""# (batch_size, seq_len1, num_perspective, hidden_size)v1 = self._time_distributed_multiply(v1, w)# (batch_size, seq_len2, num_perspective, hidden_size)v2 = self._time_distributed_multiply(v2, w)# (batch_size, seq_len1, 1, num_perspective, hidden_size)v1 = v1.unsqueeze(2)# (batch_size, 1, seq_len2, num_perspective, hidden_size)v2 = v2.unsqueeze(1)# (batch_size, seq_len1, seq_len2, num_perspective)cosine = self._cosine_similarity(v1, v2)# (batch_size, seq_len1, num_perspective)return cosine.max(2)[0]def _consine_matrix(self, v1: Tensor, v2: Tensor) -> Tensor:"""Args:v1 (Tensor): (batch_size, seq_len1, hidden_size)v2 (Tensor): _description_Returns:Tensor: (batch_size, seq_len1, seq_len2)"""# (batch_size, seq_len1, 1, hidden_size)v1 = v1.unsqueeze(2)# (batch_size, 1, seq_len2, hidden_size)v2 = v2.unsqueeze(1)# (batch_size, seq_len1, seq_len2)return self._cosine_similarity(v1, v2)def _mean_attentive_vectors(self, v2: Tensor, cosine_matrix: Tensor) -> Tensor:"""calculte mean attentive vector for the entire sentence by weighted summing allthe contextual embeddings of the entire sentence.Args:v2 (Tensor):  v2 (batch_size, seq_len2, hidden_size)cosine_matrix (Tensor): cosine_matrix (batch_size, seq_len1, seq_len2)Returns:Tensor: (batch_size, seq_len1, hidden_size)"""#  (batch_size, seq_len1, seq_len2, 1)expanded_cosine_matrix = cosine_matrix.unsqueeze(-1)# (batch_size, 1, seq_len2, hidden_size)v2 = v2.unsqueeze(1)# (batch_size, seq_len1, hidden_size)weighted_sum = (expanded_cosine_matrix * v2).sum(2)# (batch_size, seq_len1, 1)sum_cosine = (cosine_matrix.sum(-1) + self.epsilon).unsqueeze(-1)# (batch_size, seq_len1, hidden_size)return weighted_sum / sum_cosinedef _max_attentive_vectors(self, v2: Tensor, cosine_matrix: Tensor) -> Tensor:"""calculte max attentive vector for the entire sentence by pickingthe contextual embedding with the highest cosine similarity.Args:v2 (Tensor): v2 (batch_size, seq_len2, hidden_size)cosine_matrix (Tensor): cosine_matrix (batch_size, seq_len1, seq_len2)Returns:Tensor: (batch_size, seq_len1, hidden_size)"""# (batch_size, seq_len1) index value between [0, seq_len2)_, max_v2_step_idx = cosine_matrix.max(-1)hidden_size = v2.size(-1)seq_len1 = max_v2_step_idx.size(-1)# (batch_size * seq_len2, hidden_size)v2 = v2.contiguous().view(-1, hidden_size)# (batch_size * seq_len1, )max_v2_step_idx = max_v2_step_idx.contiguous().view(-1)# (batch_size * seq_len1, hidden_size)max_v2 = v2[max_v2_step_idx]# (batch_size, seq_len1, hidden_size)return max_v2.view(-1, seq_len1, hidden_size)def _attentive_matching(self, v1: Tensor, v2: Tensor, cosine_matrix: Tensor, w: Tensor) -> Tensor:"""Args:v1 (Tensor): (batch_size, seq_len1, hidden_size)v2 (Tensor): (batch_size, seq_len2, hidden_size)cosine_matrix (Tensor): (batch_size, seq_len1, seq_len2)w (Tensor):  (l, hidden_size)Returns:Tensor:"""# (batch_size, seq_len1, hidden_size)attentive_vec = self._mean_attentive_vectors(v2, cosine_matrix)# (batch_size, seq_len1, num_perspective, hidden_size)attentive_vec = self._time_distributed_multiply(attentive_vec, w)# (batch_size, seq_len, num_perspective, hidden_size)v1 = self._time_distributed_multiply(v1, w)# (batch_size, seq_len1, num_perspective)return self._cosine_similarity(v1, attentive_vec)def _max_attentive_matching(self, v1: Tensor, v2: Tensor, cosine_matrix: Tensor, w: Tensor) -> Tensor:"""Args:v1 (Tensor): (batch_size, seq_len1, hidden_size)v2 (Tensor): (batch_size, seq_len2, hidden_size)cosine_matrix (Tensor): (batch_size, seq_len1, seq_len2)w (Tensor): (num_perspective, hidden_size)Returns:Tensor: (batch_size, seq_len1, num_perspective)"""# (batch_size, seq_len1, num_perspective, hidden_size)v1 = self._time_distributed_multiply(v1, w)# (batch_size, seq_len1, embedding_szie)max_attentive_vec = self._max_attentive_vectors(v2, cosine_matrix)# (batch_size, seq_len1, num_perspective, hidden_size)max_attentive_vec = self._time_distributed_multiply(max_attentive_vec, w)# (batch_size, seq_len1, num_perspective)return self._cosine_similarity(v1, max_attentive_vec)

第79行将所有这四种匹配策略应用于句子 P P P的每个时间步,并将生成的八个向量拼接在一起,作为 P P P的每个时间步的匹配向量。

还对反向匹配方向执行相同的过程,最终得到两个匹配向量。

聚合层

用于将上面的两个匹配向量序列聚合成一个定长匹配向量。这里使用了另一个BiLSTM层,并且单独应用在这两个匹配向量序列上,通过拼接BiLSTM最后一个时间步的向量(图1四个绿色的向量)又构建了新的定长向量。每个匹配向量序列应用到BiLSTM上会同时得到正向和负向的最后一个时间步的输出,两个匹配向量序列就可以得到四个。

class AggregationLayer(nn.Module):def __init__(self, args: Namespace):super().__init__()self.agg_lstm = nn.LSTM(input_size=args.num_perspective * 8,hidden_size=args.hidden_size,batch_first=True,bidirectional=True,)self.reset_parameters()def reset_parameters(self):nn.init.kaiming_normal_(self.agg_lstm.weight_ih_l0)nn.init.constant_(self.agg_lstm.bias_ih_l0, val=0)nn.init.orthogonal_(self.agg_lstm.weight_hh_l0)nn.init.constant_(self.agg_lstm.bias_hh_l0, val=0)nn.init.kaiming_normal_(self.agg_lstm.weight_ih_l0_reverse)nn.init.constant_(self.agg_lstm.bias_ih_l0_reverse, val=0)nn.init.orthogonal_(self.agg_lstm.weight_hh_l0_reverse)nn.init.constant_(self.agg_lstm.bias_hh_l0_reverse, val=0)def forward(self, v1: Tensor, v2: Tensor) -> Tensor:"""Args:v1 (Tensor): (batch_size, seq_len1, l * 8)v2 (Tensor): (batch_size, seq_len2, l * 8)Returns:Tensor: (batch_size, 4 * hidden_size)"""batch_size = v1.size(0)# v1_last (2, batch_size, hidden_size)_, (v1_last, _) = self.agg_lstm(v1)# v2_last (2, batch_size, hidden_size)_, (v2_last, _) = self.agg_lstm(v2)# v1_last (batch_size, 2, hidden_size)v1_last = v1_last.transpose(1, 0)v2_last = v2_last.transpose(1, 0)# v1_last (batch_size, 2 * hidden_size)v1_last = v1_last.contiguous().view(batch_size, -1)# v2_last (batch_size, 2 * hidden_size)v2_last = v2_last.contiguous().view(batch_size, -1)# (batch_size, 4 * hidden_size)return torch.cat([v1_last, v2_last], dim=-1)

聚合层相比较来说就非常简单了。

要注意的是这里如何取最后一个时间步的状态,最后拼接成(batch_size, 4 * hidden_size)的输出。

预测层

该层的目标是评估我们需要的概率分布 Pr ( y ∣ P , Q ) \text{Pr}(y|P,Q) Pr(yP,Q)。作者应用了一个两层前馈网络,输入是上一层的定长向量。

class Prediction(nn.Module):def __init__(self, args: Namespace) -> None:super().__init__()self.predict = nn.Sequential(nn.Linear(args.hidden_size * 4, args.hidden_size * 2),nn.ReLU(),nn.Dropout(args.dropout),nn.Linear(args.hidden_size * 2, args.num_classes),)self.reset_parameters()def reset_parameters(self) -> None:nn.init.uniform_(self.predict[0].weight, -0.005, 0.005)nn.init.constant_(self.predict[0].bias, val=0)nn.init.uniform_(self.predict[-1].weight, -0.005, 0.005)nn.init.constant_(self.predict[-1].bias, val=0)def forward(self, x: Tensor) -> Tensor:return self.predict(x)

和前几个模型的实现类似,这里是一个双层FFN。

最后的最后把这些网络层组合到一起,构成BiMPM的实现。

BiMPM实现

class BiMPM(nn.Module):def __init__(self, args: Namespace) -> None:super().__init__()self.args = args# the concatenated embedding size of wordself.d = args.word_embedding_dim + args.char_hidden_sizeself.l = args.num_perspectiveself.word_rep = WordRepresentation(args)self.context_rep = ContextRepresentation(args)self.matching_layer = MatchingLayer(args)self.aggregation = AggregationLayer(args)self.prediction = Prediction(args)def dropout(self, x: Tensor) -> Tensor:return F.dropout(input=x, p=self.args.dropout, training=self.training)def forward(self, p: Tensor, q: Tensor, char_p: Tensor, char_q: Tensor) -> Tensor:"""Args:p (Tensor): word inpute sequence (batch_size, seq_len1)q (Tensor): word inpute sequence (batch_size, seq_len2)char_p (Tensor): character input sequence (batch_size, seq_len1, word_len)char_q (Tensor): character input sequence (batch_size, seq_len1, word_len)Returns:Tensor: (batch_size,  2)"""# (batch_size, seq_len1, word_embedding_dim + char_hidden_size)p_rep = self.dropout(self.word_rep(p, char_p))# (batch_size, seq_len2, word_embedding_dim + char_hidden_size)q_rep = self.dropout(self.word_rep(q, char_q))#  batch_size, seq_len1, 2 * hidden_size)p_context = self.dropout(self.context_rep(p_rep))#  batch_size, seq_len2, 2 * hidden_size)q_context = self.dropout(self.context_rep(q_rep))# (batch_size, seq_len1, 8 * l)p_match = self.dropout(self.matching_layer(p_context, q_context))# (batch_size, seq_len2, 8 * l)q_match = self.dropout(self.matching_layer(q_context, p_context))# (batch_size,  4 * hidden_size)aggregation = self.dropout(self.aggregation(p_match, q_match))# (batch_size,  2)return self.prediction(aggregation)

模型训练

定义指标函数:

def metrics(y: torch.Tensor, y_pred: torch.Tensor) -> Tuple[float, float, float, float]:TP = ((y_pred == 1) & (y == 1)).sum().float()  # True PositiveTN = ((y_pred == 0) & (y == 0)).sum().float()  # True NegativeFN = ((y_pred == 0) & (y == 1)).sum().float()  # False NegatvieFP = ((y_pred == 1) & (y == 0)).sum().float()  # False Positivep = TP / (TP + FP).clamp(min=1e-8)  # Precisionr = TP / (TP + FN).clamp(min=1e-8)  # RecallF1 = 2 * r * p / (r + p).clamp(min=1e-8)  # F1 scoreacc = (TP + TN) / (TP + TN + FP + FN).clamp(min=1e-8)  # Accuraryreturn acc, p, r, F1

定义评估和训练函数:

def evaluate(data_iter: DataLoader, model: nn.Module
) -> Tuple[float, float, float, float]:y_list, y_pred_list = [], []model.eval()for x1, x2, c1, c2, y in tqdm(data_iter):x1 = x1.to(device).long()x2 = x2.to(device).long()c1 = c1.to(device).long()c2 = c2.to(device).long()y = torch.LongTensor(y).to(device)output = model(x1, x2, c1, c2)pred = torch.argmax(output, dim=1).long()y_pred_list.append(pred)y_list.append(y)y_pred = torch.cat(y_pred_list, 0)y = torch.cat(y_list, 0)acc, p, r, f1 = metrics(y, y_pred)return acc, p, r, f1def train(data_iter: DataLoader,model: nn.Module,criterion: nn.CrossEntropyLoss,optimizer: torch.optim.Optimizer,print_every: int = 500,verbose=True,
) -> None:model.train()for step, (x1, x2, c1, c2, y) in enumerate(tqdm(data_iter)):x1 = x1.to(device).long()x2 = x2.to(device).long()c1 = c1.to(device).long()c2 = c2.to(device).long()y = torch.LongTensor(y).to(device)output = model(x1, x2, c1, c2)loss = criterion(output, y)optimizer.zero_grad()loss.backward()optimizer.step()if verbose and (step + 1) % print_every == 0:pred = torch.argmax(output, dim=1).long()acc, p, r, f1 = metrics(y, pred)print(f" TRAIN iter={step+1} loss={loss.item():.6f} accuracy={acc:.3f} precision={p:.3f} recal={r:.3f} f1 score={f1:.4f}")

注意此时输入有4个。

开始训练:

		make_dirs(args.save_dir)if args.cuda:device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")else:device = torch.device("cpu")print(f"Using device: {device}.")vectorizer_path = os.path.join(args.save_dir, args.vectorizer_file)train_df = build_dataframe_from_csv(args.dataset_csv.format("train"))test_df = build_dataframe_from_csv(args.dataset_csv.format("test"))dev_df = build_dataframe_from_csv(args.dataset_csv.format("dev"))print("Creating a new Vectorizer.")train_chars = train_df.sentence1.to_list() + train_df.sentence2.to_list()char_vocab = Vocabulary.build(train_chars, args.min_char_freq)args.char_vocab_size = len(char_vocab)train_word_df = tokenize_df(train_df)test_word_df = tokenize_df(test_df)dev_word_df = tokenize_df(dev_df)train_sentences = train_df.sentence1.to_list() + train_df.sentence2.to_list()word_vocab = Vocabulary.build(train_sentences, args.min_word_freq)args.word_vocab_size = len(word_vocab)words = [word_vocab.lookup_token(idx) for idx in range(args.word_vocab_size)]longest_word = ""for word in words:if len(word) > len(longest_word):longest_word = wordargs.max_word_len = len(longest_word)char_vectorizer = TMVectorizer(char_vocab, len(longest_word))word_vectorizer = TMVectorizer(word_vocab, args.max_len)train_dataset = TMDataset(train_df, char_vectorizer, word_vectorizer)test_dataset = TMDataset(test_df, char_vectorizer, word_vectorizer)dev_dataset = TMDataset(dev_df, char_vectorizer, word_vectorizer)train_data_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True)dev_data_loader = DataLoader(dev_dataset, batch_size=args.batch_size)test_data_loader = DataLoader(test_dataset, batch_size=args.batch_size)print(f"Arguments : {args}")model = BiMPM(args)print(f"Model: {model}")model = model.to(device)optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)criterion = nn.CrossEntropyLoss()for epoch in range(args.num_epochs):train(train_data_loader,model,criterion,optimizer,print_every=args.print_every,verbose=args.verbose,)print("Begin evalute on dev set.")with torch.no_grad():acc, p, r, f1 = evaluate(dev_data_loader, model)print(f"EVALUATE [{epoch+1}/{args.num_epochs}]  accuracy={acc:.3f} precision={p:.3f} recal={r:.3f} f1 score={f1:.4f}")model.eval()acc, p, r, f1 = evaluate(test_data_loader, model)print(f"TEST accuracy={acc:.3f} precision={p:.3f} recal={r:.3f} f1 score={f1:.4f}")
BiMPM((word_rep): WordRepresentation((char_embed): Embedding(4699, 20, padding_idx=0)(word_embed): Embedding(35092, 300)(char_lstm): LSTM(20, 50, batch_first=True))(context_rep): ContextRepresentation((context_lstm): LSTM(350, 100, batch_first=True, bidirectional=True))(matching_layer): MatchingLayer(mp_w1,mp_w2,mp_w3,mp_w4,mp_w5,mp_w6,mp_w7,mp_w8)(aggregation): AggregationLayer((agg_lstm): LSTM(160, 100, batch_first=True, bidirectional=True))(prediction): Prediction((predict): Sequential((0): Linear(in_features=400, out_features=200, bias=True)(1): ReLU()(2): Dropout(p=0.1, inplace=False)(3): Linear(in_features=200, out_features=2, bias=True)))
)
...TRAIN iter=1000 loss=0.042604 accuracy=0.984 precision=0.974 recal=1.000 f1 score=0.986880%|████████  | 1500/1866 [11:37<02:49,  2.16it/s]TRAIN iter=1500 loss=0.057931 accuracy=0.969 precision=0.973 recal=0.973 f1 score=0.9730
100%|██████████| 1866/1866 [14:26<00:00,  2.15it/s]
Begin evalute on dev set.
100%|██████████| 69/69 [00:11<00:00,  6.12it/s]
EVALUATE [10/10]  accuracy=0.836 precision=0.815 recal=0.870 f1 score=0.8418
100%|██████████| 98/98 [00:16<00:00,  6.11it/s]
TEST accuracy=0.825 precision=0.766 recal=0.936 f1 score=0.8424
100%|██████████| 98/98 [00:16<00:00,  6.11it/s]
TEST[best f1] accuracy=0.825 precision=0.766 recal=0.936 f1 score=0.8424

测试集上的结果比ESIM好了近1个点,但模型复杂度太高了。

参考

  1. [论文笔记]BiMPM
  2. https://github.com/JeremyHide/BiMPM_Pytorch

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/90293.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

计算机图像处理-直方图均衡化

直方图均衡化 直方图均衡化是图像灰度变换中有一个非常有用的方法。图像的直方图是对图像对比度效果上的一种处理&#xff0c;旨在使得图像整体效果均匀&#xff0c;黑与白之间的各个像素级之间的点分布更均匀一点。通过这种方法&#xff0c;亮度可以更好地在直方图上分布。 …

2009-2018年31省份旅游收入(入境、国内、总收入;第三产值;GDP)

2009&#xff0d;2018年31省份旅游收入&#xff08;入境、国内、总收入&#xff1b;第三产值&#xff1b;GDP&#xff09; 1、时间&#xff1a;2009-2018年 2、指标&#xff1a; 汇率、入境旅游收入&#xff08;万美元&#xff09;、国内旅游收入&#xff08;亿元&#xff0…

中秋节快乐

中秋节快乐&#xff0c;国庆节快乐

架构案例2022(四十二)

促销管理系统 某电子商务公司拟升级其会员与促销管理系统&#xff0c;向用户提供个性化服务&#xff0c;提高用户的粘性。在项目立项之初&#xff0c;公司领导层一致认为本次升级的主要目标是提升会员管理方式的灵活性&#xff0c;由于当前用户规模不大&#xff0c;业务也相对…

专栏更新情况:华为流程、产品经理、战略管理、IPD

目录 前言 01 华为流程体系入门课 CSDN学院 02 产品经理进阶课 CSDN学院 03 BLM 战略方法论进阶课 04 IPD 进阶 100 例专栏 作者简介 前言 已上线四大课程专栏更新情况&#xff1a; 01 华为流程体系入门课&#xff08;视频图文&#xff09;&#xff1b; 02 硬件产品经…

安防监控产品经营商城小程序的作用是什么

安防监控产品覆盖面较大&#xff0c;监控器、门禁、对讲机、烟感等都有很高用途&#xff0c;家庭、办公单位各场景往往用量不少&#xff0c;对商家来说&#xff0c;市场高需求背景下也带来了众多生意&#xff0c;但线下门店的局限性&#xff0c;导致商家想要进一步增长不容易。…

C++之容器类有趣的实验(二百四十一)

简介&#xff1a; CSDN博客专家&#xff0c;专注Android/Linux系统&#xff0c;分享多mic语音方案、音视频、编解码等技术&#xff0c;与大家一起成长&#xff01; 优质专栏&#xff1a;Audio工程师进阶系列【原创干货持续更新中……】&#x1f680; 人生格言&#xff1a; 人生…

无线WIFI工业路由器可用于楼宇自动化

钡铼4G工业路由器支持BACnet MS/TP协议。BACnet MS/TP协议是一种用于工业自动化的开放式通信协议&#xff0c;被广泛应用于楼宇自动化、照明控制、能源管理等领域。通过钡铼4G工业路由器的支持&#xff0c;可以使设备间实现高速、可靠的数据传输&#xff0c;提高自动化水平。 钡…

Kubernetes(k8s)上搭建一主两从的mysql8集群

Kubernetes上搭建一主两从的mysql8集群 环境准备搭建nfs服务器安装NFS暴露nfs目录开启nfs服务器 安装MySQL集群创建命名空间创建MySQL密码的Secret安装MySQL主节点创建pv和pvc主节点的配置文件部署mysql主节点 安装第一个MySQL Slave节点创建pv和pvc第一个从节点配置文件部署my…

聚观早报 | 飞书签约韵达速递;蔚来首颗自研芯片“杨戬”量产

【聚观365】9月22日消息 飞书签约韵达速递 蔚来首颗自研芯片“杨戬”10月量产 靳玉志接任华为车 BU CEO 亚马逊发布全新Alexa语音助手 OpenAI推出图像生成器DALL-E 3 飞书签约韵达速递 近日&#xff0c;国内物流服务公司韵达快递宣布全员上飞书。飞书解决方案副总裁何斌表…

Redis学习第九天

今天是Jedis&#xff01;作者的Redis在游戏本上&#xff0c;但是Java的IDEA总是下载不了&#xff0c;所以只能作为概念听一听了&#xff0c;目前无法做到实操。 Jedis概念 Jedis实操 首先要保证redis的服务器开启&#xff0c;然后引入jedis依赖&#xff0c;最后通过服务器的I…

Java新领域—设计

SSM SpringBoot 微信小程序 JSP 安卓

人工智能安全-2-非平衡数据处理(2)

5 算法层面 代价敏感&#xff1a;设置损失函数的权重&#xff0c;使得少数类判别错误的损失大于多数类判别错误的损失&#xff1b; 单类分类器方法&#xff1a;仅对少数类进行训练&#xff0c;例如运用SVM算法&#xff1b; 集成学习方法&#xff1a;即多个分类器&#xff0c;然…

大模型lora微调-chatglm2

通义千问大模型微调源码&#xff08;chatglm2 微调失败&#xff0c;训练通义千问成功&#xff09;&#xff1a;GitHub - hiyouga/LLaMA-Efficient-Tuning: Easy-to-use LLM fine-tuning framework (LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, ChatGLM2)Easy-to-use LLM fine-tun…

[JAVAee]MyBatis

目录 MyBatis简介 MyBatis的准备工作 框架的添加 连接数据库字符串的配置 MyBatis中XML路径的配置 ​编辑 MyBatis的使用 各层的实现 进行数据库操作 增加操作 拓展 修改操作 删除操作 查询操作 结果映射 单表查询 多表查询 like模糊查询 动态SQL / MyBa…

delphi 11 安装失败

delphi 11 安装遇到如下图&#xff1a; 解决方法&#xff1a; 以管理员身份重新安装&#xff01;&#xff01;&#xff01; 以管理员身份重新安装&#xff01;&#xff01;&#xff01; 以管理员身份重新安装&#xff01;&#xff01;&#xff01; 管理员身份&#xff01;&…

同城信息服务源码 本地生活服务小程序源码

同城信息服务源码 本地生活服务小程序源码 功能介绍&#xff1a; 基本设置&#xff1a;网站参数、安全设置、分站管理、支付设置、操作日志、地区设置、公交地铁、国际区号、清理缓存、模板风格、模块管理、域名管理、底部菜单、消息通知、登录设置 其他设置&#xff1a;关键…

企业年报API的应用:从金融投资到市场研究

引言 在数字化时代&#xff0c;企业年报不再仅仅是一份财务报告&#xff0c;它们变成了宝贵的信息资源&#xff0c;可用于各种商业应用。企业年报API已经改变了金融投资和市场研究的方式&#xff0c;使得从中获取数据变得更加高效和灵活。本文将深入探讨企业年报API的应用&…

箱讯科技成功闯入第八届“创客中国”全国总决赛—在国际物流领域一枝独秀

添加图片注释&#xff0c;不超过 140 字&#xff08;可选&#xff09; 2023年9月26日&#xff0c;第八届“创客中国”数字化转型中小企业创新创业大赛决赛在贵州圆满收官。 经过初赛、复赛、决赛的激烈角逐&#xff0c;箱讯科技与众多强劲对手同台竞技&#xff0c;最终凭借出…

“全景江西·南昌专场”数字技术应用场景发布会 | 万广明市长莅临拓世集团展位,一览AIGC科技魅力

随着数字技术的迅猛发展&#xff0c;传统产业正在发生深刻的变革&#xff0c;新兴产业蓬勃兴起。但要想实现数字经济超常规发展&#xff0c;就要在数字产业化上培育新优势&#xff0c;大力实施数字经济核心产业提速行动&#xff0c;加快推进“一核三基地”建设。在这个数字经济…