Bert和LSTM:情绪分类中的表现

一、说明

        这篇文章的目的是评估和比较 2 种深度学习算法(BERT 和 LSTM)在情感分析中进行二元分类的性能。评估将侧重于两个关键指标:准确性(衡量整体分类性能)和训练时间(评估每种算法的效率)。

二、数据

        为了实现这一目标,我使用了IMDB数据集,其中包括50,000条电影评论。数据集平均分为 25,000 条正面评论和 25,000 条负面评论,使其适用于训练和测试二进制情绪分析模型。若要获取数据集,请转到以下链接:

50K电影评论的IMDB数据集

大型影评数据集

www.kaggle.com

        下图显示了数据集的五行。我给积极情绪分配了 1,给消极情绪分配了 0。

 

三、算法

        1. 长短期记忆(LSTM):它是一种循环神经网络(RNN),旨在处理顺序数据。它可以通过使用存储单元和门来捕获长期依赖关系。

        2. BERT(来自变压器的双向编码器表示):它是一种预先训练的基于变压器的模型,使用自监督学习方法。利用双向上下文来理解句子中单词的含义。

        -配置

        对于 LSTM,模型采用文本序列以及每个序列的相应长度作为输入。它嵌入文本(嵌入维度 = 20),通过 LSTM 层(大小 = 64)处理文本,通过 ReLU 激活的全连接层传递最后一个隐藏状态,最后应用 S 形激活以生成 0 到 1 之间的单个输出值。(周期数:10,学习率:0.001,优化器:亚当)

        对于BERT,我使用了DistilBertForSequenceClassification,它基于DistilBERT架构。DistilBERT是原始BERT模型的较小,蒸馏版本。它旨在具有较少数量的参数并降低计算复杂性,同时保持相似的性能水平。(周期数:3,学习率:5e-5,优化器:亚当)

四、LSTM 代码

!pip install torchtext!pip install portalocker>=2.0.0import torch
import torch.nn as nnfrom torchtext.datasets import IMDB
from torch.utils.data.dataset import random_split# Step 1: load and create the datasetstrain_dataset = IMDB(split='train')
test_dataset = IMDB(split='test')# Set random number to 123 to compare with BERT model
torch.manual_seed(123)
train_dataset, valid_dataset = random_split(list(train_dataset), [20000, 5000])## Step 2: find unique tokens (words)
import re
from collections import Counter, OrderedDicttoken_counts = Counter()def tokenizer(text):text = re.sub('<[^>]*>', '', text)emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())text = re.sub('[\W]+', ' ', text.lower()) +\' '.join(emoticons).replace('-', '')tokenized = text.split()return tokenizedfor label, line in train_dataset:tokens = tokenizer(line)token_counts.update(tokens)print('Vocab-size:', len(token_counts))## Step 3: encoding each unique token into integers
from torchtext.vocab import vocabsorted_by_freq_tuples = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)vocab = vocab(ordered_dict)'''
The special tokens "<pad>" and "<unk>" are inserted into the vocabulary using vocab.insert_token("<pad>", 0) and vocab.insert_token("<unk>", 1) respectively.
The index 0 is assigned to "<pad>" token, which is typically used for padding sequences.
The index 1 is assigned to "<unk>" token, which represents unknown or out-of-vocabulary tokens.vocab.set_default_index(1) sets the default index of the vocabulary to 1, which corresponds to the "<unk>" token.
This means that if a token is not found in the vocabulary, it will be mapped to the index 1 by default.
'''vocab.insert_token("<pad>", 0)
vocab.insert_token("<unk>", 1)
vocab.set_default_index(1)print([vocab[token] for token in ['this', 'is', 'an', 'example']])'''
The IMDB class in datatext contains 1 = negative and 2 = positive
''''''
The label_pipeline lambda function takes a label value x as input.
It checks if the label value x is equal to 2 using the comparison x == 2.
If the condition is true, it returns a float value of 1.0. This implies that the label is positive.
If the condition is false (i.e., the label value is not equal to 2), it returns a float value of 0.0. This implies that the label is negative.
'''text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: 1. if x == 2 else 0'''
This line suggests that the subsequent computations and tensors will be moved to the specified CUDA device for processing,
taking advantage of GPU acceleration if available.
'''device = torch.device("cuda:0")## Step 3-B: wrap the encode and transformation function'''
Instead of loading the whole reviews into memory which is way too expensive for the computer,
you can load a batch for manuy times which requires way less memory as compared to loading the complete data set.
Another reason that we use batch is that if we load the whole dataset at once, the deep learning algorithm(may be a neural network)
has to store errors values for all data points in the memory and this will cause a great decrease in speed of training.
With batches, the model updates the parameters(weights and bias) only after passing through the whole data set.
'''def collate_batch(batch):label_list, text_list, lengths = [], [], []for _label, _text in batch:label_list.append(label_pipeline(_label))processed_text = torch.tensor(text_pipeline(_text),dtype=torch.int64)text_list.append(processed_text)lengths.append(processed_text.size(0))## Convert lists to tensorslabel_list = torch.tensor(label_list)lengths = torch.tensor(lengths)## pads the text sequences in text_list to have the same length by adding padding tokens.padded_text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True)return padded_text_list.to(device), label_list.to(device), lengths.to(device)## Take a small batch to check if the wrapping worksfrom torch.utils.data import DataLoader
dataloader = DataLoader(train_dataset, batch_size=4, shuffle=False, collate_fn=collate_batch)
text_batch, label_batch, length_batch = next(iter(dataloader))
print(text_batch)
print(label_batch)
print(length_batch)
print(text_batch.shape)## Step 4: batching the datasetsbatch_size = 32train_dl = DataLoader(train_dataset, batch_size=batch_size,shuffle=True, collate_fn=collate_batch)
valid_dl = DataLoader(valid_dataset, batch_size=batch_size,shuffle=False, collate_fn=collate_batch)
test_dl = DataLoader(test_dataset, batch_size=batch_size,shuffle=False, collate_fn=collate_batch)print(len(list(train_dl.dataset)))
print(len(list(valid_dl.dataset)))
print(len(list(test_dl.dataset)))'''
the code defines an RNN model that takes encoded text inputs,
processes them through an embedding layer and an LSTM layer,
and produces a binary output using fully connected layers and a sigmoid activation function.
The model is initialized with specific parameters and moved to the specified device for computation.
'''class RNN(nn.Module):def __init__(self, vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size):super().__init__()self.embedding = nn.Embedding(vocab_size,embed_dim,padding_idx=0)self.rnn = nn.LSTM(embed_dim, rnn_hidden_size,batch_first=True)self.fc1 = nn.Linear(rnn_hidden_size, fc_hidden_size)self.relu = nn.ReLU()self.fc2 = nn.Linear(fc_hidden_size, 1)self.sigmoid = nn.Sigmoid()def forward(self, text, lengths):out = self.embedding(text)out = nn.utils.rnn.pack_padded_sequence(out, lengths.cpu().numpy(), enforce_sorted=False, batch_first=True)out, (hidden, cell) = self.rnn(out)out = hidden[-1, :, :]out = self.fc1(out)out = self.relu(out)out = self.fc2(out)out = self.sigmoid(out)return outvocab_size = len(vocab)
embed_dim = 20
rnn_hidden_size = 64
fc_hidden_size = 64torch.manual_seed(123)
model = RNN(vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size)
model = model.to(device)def train(dataloader):model.train()total_acc, total_loss = 0, 0for text_batch, label_batch, lengths in dataloader:optimizer.zero_grad()pred = model(text_batch, lengths)[:, 0]loss = loss_fn(pred, label_batch)loss.backward()optimizer.step()total_acc += ((pred>=0.5).float() == label_batch).float().sum().item()total_loss += loss.item()*label_batch.size(0)return total_acc/len(dataloader.dataset), total_loss/len(dataloader.dataset)def evaluate(dataloader):model.eval()total_acc, total_loss = 0, 0with torch.no_grad():for text_batch, label_batch, lengths in dataloader:pred = model(text_batch, lengths)[:, 0]loss = loss_fn(pred, label_batch.float())  # Convert label_batch to Floattotal_acc += ((pred >= 0.5).float() == label_batch).float().sum().item()total_loss += loss.item() * label_batch.size(0)return total_acc/len(list(dataloader.dataset)),\total_loss/len(list(dataloader.dataset))import time
start_time = time.time()loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)num_epochs = 10torch.manual_seed(123)for epoch in range(num_epochs):acc_train, loss_train = train(train_dl)acc_valid, loss_valid = evaluate(valid_dl)print(f'Epoch {epoch} accuracy: {acc_train:.4f} val_accuracy: {acc_valid:.4f}')print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')
print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')acc_test, _ = evaluate(test_dl)
print(f'test_accuracy: {acc_test:.4f}')"""### Test with new movie reviews of Spider-Man: Across the Spider-Verse (2023)"""def collate_single_text(text):processed_text = torch.tensor(text_pipeline(text), dtype=torch.int64)length = processed_text.size(0)padded_text = nn.utils.rnn.pad_sequence([processed_text], batch_first=True)return padded_text.to(device), lengthtext = "It is the first marvel movie to make me shed a tear. It has heart, it feels so alive with it's conveyance of emotions and feelings, it uses our nostalgia for the first movie AGAINST US it is on a completely new level of animation, there is a twist on every turn you make while watching this movie. "
padded_text, length = collate_single_text(text)
padded_text = padded_text.to(device)model.eval()  # Set the model to evaluation mode
with torch.no_grad():encoded_text = padded_text.to(device)  # Move the encoded_text tensor to the CUDA devicelengths = torch.tensor([len(encoded_text)])  # Compute the length of the text sequenceoutput = model(encoded_text, lengths)  # Pass the lengths argumentprobability = output.item()  # Obtain the predicted probabilityif probability >= 0.5:prediction = "Positive"else:prediction = "Negative"print(f"Text: {text}")
print(f"Prediction: {prediction} (Probability: {probability})")text = "This movie was very boring and garbage this is why Hollywood has zero imagination. They rewrote Spiderman as Miles Morales so that they can fit the DEI agenda which was more important than time. "
padded_text, length = collate_single_text(text)
padded_text = padded_text.to(device)model.eval()  # Set the model to evaluation mode
with torch.no_grad():encoded_text = padded_text.to(device)  # Move the encoded_text tensor to the CUDA devicelengths = torch.tensor([len(encoded_text)])  # Compute the length of the text sequenceoutput = model(encoded_text, lengths)  # Pass the lengths argumentprobability = output.item()  # Obtain the predicted probabilityif probability >= 0.5:prediction = "Positive"else:prediction = "Negative"print(f"Text: {text}")
print(f"Prediction: {prediction} (Probability: {probability})")

五、BERT代码

!pip install transformersimport gzip
import shutil
import timeimport pandas as pd
import requests
import torch
import torch.nn.functional as F
import torchtextimport transformers
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassificationtorch.backends.cudnn.deterministic = True
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')NUM_EPOCHS = 3path = '/content/drive/MyDrive/data/movie_data.csv'df = pd.read_csv(path)df.head()df.shapetrain_texts = df.iloc[:35000]['review'].values
train_labels = df.iloc[:35000]['sentiment'].valuesvalid_texts = df.iloc[35000:40000]['review'].values
valid_labels = df.iloc[35000:40000]['sentiment'].valuestest_texts = df.iloc[40000:]['review'].values
test_labels = df.iloc[40000:]['sentiment'].valuestokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
valid_encodings = tokenizer(list(valid_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)train_encodings[0]class IMDbDataset(torch.utils.data.Dataset):def __init__(self, encodings, labels):self.encodings = encodingsself.labels = labelsdef __getitem__(self, idx):item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}item['labels'] = torch.tensor(self.labels[idx])return itemdef __len__(self):return len(self.labels)train_dataset = IMDbDataset(train_encodings, train_labels)
valid_dataset = IMDbDataset(valid_encodings, valid_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=16, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(DEVICE)
model.train()optim = torch.optim.Adam(model.parameters(), lr=5e-5)def compute_accuracy(model, data_loader, device):with torch.no_grad():correct_pred, num_examples = 0, 0for batch_idx, batch in enumerate(data_loader):### Prepare datainput_ids = batch['input_ids'].to(device)attention_mask = batch['attention_mask'].to(device)labels = batch['labels'].to(device)outputs = model(input_ids, attention_mask=attention_mask)logits = outputs['logits']predicted_labels = torch.argmax(logits, 1)num_examples += labels.size(0)correct_pred += (predicted_labels == labels).sum()return correct_pred.float()/num_examples * 100start_time = time.time()for epoch in range(NUM_EPOCHS):model.train()for batch_idx, batch in enumerate(train_loader):### Prepare datainput_ids = batch['input_ids'].to(DEVICE)attention_mask = batch['attention_mask'].to(DEVICE)labels = batch['labels'].to(DEVICE)### Forwardoutputs = model(input_ids, attention_mask=attention_mask, labels=labels)loss, logits = outputs['loss'], outputs['logits']### Backwardoptim.zero_grad()loss.backward()optim.step()### Loggingif not batch_idx % 250:print (f'Epoch: {epoch+1:04d}/{NUM_EPOCHS:04d} | 'f'Batch {batch_idx:04d}/{len(train_loader):04d} | 'f'Loss: {loss:.4f}')model.eval()with torch.set_grad_enabled(False):print(f'Training accuracy: 'f'{compute_accuracy(model, train_loader, DEVICE):.2f}%'f'\nValid accuracy: 'f'{compute_accuracy(model, valid_loader, DEVICE):.2f}%')print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

六、结果

 

七、为什么BERT的性能优于LSTM?

        BERT之所以获得高准确率,有几个原因:
        1)BERT通过考虑给定单词两侧的周围单词来捕获单词的上下文含义。这种双向方法使模型能够理解语言的细微差别并有效地捕获单词之间的依赖关系。

        2)BERT采用变压器架构,可有效捕获顺序数据中的长期依赖关系。转换器采用自我注意机制,使模型能够权衡句子中不同单词的重要性。这种注意力机制有助于BERT专注于相关信息,从而获得更好的表示和更高的准确性。

        3)BERT在大量未标记的数据上进行预训练。这种预训练允许模型学习一般语言表示,并获得对语法、语义和世界知识的广泛理解。通过利用这些预训练的知识,BERT可以更好地适应下游任务并实现更高的准确性。

八、结论

        与 LSTM 相比,BERT 确实需要更长的时间来微调,因为它的架构更复杂,参数空间更大。但同样重要的是要考虑到BERT在许多任务中的性能优于LSTM。 达门·

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/64466.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Mac不想用iTerm2了怎么办

这东西真是让人又爱又恨&#xff0c;爱的是它的UI还真不错&#xff0c;恨的是它把我的环境给破坏啦&#xff01;让我每次启动终端之后都要重新source激活我的python环境&#xff0c;而且虚拟环境前面没有括号啦&#xff01;这怎么能忍&#xff01;在UI和实用性面前我断然选择实…

React笔记(三)类组件(1)

一、组件的概念 使用组件方式进行编程&#xff0c;可以提高开发效率&#xff0c;提高组件的复用性、提高代码的可维护性和可扩展性 React定义组件的方式有两种 类组件&#xff1a;React16.8版本之前几乎React使用都是类组件 函数组件:React16.8之后&#xff0c;函数式组件使…

ebay测评,物理环境与IP环境:解决平台风控问题的关键

近期eBay平台出现了大量风控问题&#xff0c;导致许多买家账号受到影响。实际上&#xff0c;这主要是由于环境搭建方面存在主要问题。时至2023年&#xff0c;许多人的技术方案仍停留在几年前&#xff0c;要么使用一键新机工具配合国外IP&#xff0c;要么使用指纹浏览器配合国外…

Kotlin inline、noinline、crossinline 深入解析

主要内容&#xff1a; inline 高价函数的原理分析Non-local returns noinlinecrossinline inline 如果有C语言基础的&#xff0c;inline 修饰一个函数表示该函数是一个内联函数。编译时&#xff0c;编译器会将内联函数的函数体拷贝到调用的地方。我们先看下在一个普通的 kot…

mac idea启动没反应 无法启动

遇到的问题如下&#xff1a; 启动idea&#xff0c;没反应 无法启动&#xff0c;不论破解还是别的原因&#xff0c;总之无法启动了 应用程序–找到idea–右击显示包内容–Contents–MacOS–打开idea 弹出框提示如下&#xff1a; 双击这个idea可执行文件 1&#xff09;先查看日志…

Mac安装brew、mysql、redis

mac安装brew mac安装brewmac安装mysql并配置开机启动mac安装redis并配置开机启动 mac安装brew 第一步&#xff1a;执行. /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"第二步&#xff1a;输入开机密码 第三…

正规黄金代理的三大要素

对于现货黄金投资来说&#xff0c;寻找一个正规的黄金代理是十分重要的问题。在目前的现货黄金投资市场中&#xff0c;现货黄金代理的数量很多&#xff0c;他们都致力于耕耘现货黄金投资市场。当越来越多的专业人士加入到现货黄金投资的市场中当中时&#xff0c;这个市场将会越…

mybatis源码学习-2-项目结构

写在前面,这里会有很多借鉴的内容,有以下三个原因 本博客只是作为本人学习记录并用以分享,并不是专业的技术型博客笔者是位刚刚开始尝试阅读源码的人,对源码的阅读流程乃至整体架构并不熟悉,观看他人博客可以帮助我快速入门如果只是笔者自己观看,难免会有很多弄不懂乃至理解错误…

ZMTP协议

ZoreMQ Transport Protocol是一个传输层协议&#xff0c;用于ZMQ的连接的信息交互&#xff0c;本文档描述的是3.0协议&#xff0c;主要分析基于NULL Security Mechanism 协议语法 ZMTP由三部分组成&#xff0c;分别是 greeting、handshake、traffic 部分描述构成greeting描述…

从零开始,探索C语言中的字符串

字符串 1. 前言2. 预备知识2.1 字符2.2 字符数组 3. 什么是字符串4. \04.1 \0是什么4.2 \0的作用4.2.1 打印字符串4.2.2 求字符串长度 1. 前言 大家好&#xff0c;我是努力学习游泳的鱼。你已经学会了如何使用变量和常量&#xff0c;也知道了字符的概念。但是你可能还不了解由…

BuhoCleaner for mac:让你的Mac重获新生

你是否曾经因为电脑运行缓慢而感到困扰&#xff1f;是否曾经因为大量的垃圾文件和无效的临时文件而感到头疼&#xff1f;如果你有这样的烦恼&#xff0c;那么BuhoCleaner for mac就是你的救星&#xff01; BuhoCleaner for mac是一款专门为Mac用户设计的系统清理工具&#xff…

Linux中Tomcat发布war包后无法正常访问非静态资源

事故现象 在CentOS8中安装完WEB环境&#xff0c;首次部署WEB项目DEMO案例&#xff0c;发现可以静态的网页内容&#xff0c; 但是无法向后台发送异步请求&#xff0c;全部出现404问题&#xff0c;导致数据库数据无法渲染到界面上。 原因分析 CentOS请求中提示用来获取资源的连…

跨模态可信感知

文章目录 跨模态可信感知综述摘要引言跨协议通信模式PCP网络架构 跨模态可信感知跨模态可信感知的概念跨模态可信感知的热点研究场景目前存在的挑战可能改进的方案 参考文献 跨模态可信感知综述 摘要 随着人工智能相关理论和技术的崛起&#xff0c;通信和感知领域的研究引入了…

URL重定向漏洞

URL重定向漏洞 1. URL重定向1.1. 漏洞位置 2. URL重定向基础演示2.1. 查找漏洞2.1.1. 测试漏洞2.1.2. 加载完情况2.1.3. 验证漏洞2.1.4. 成功验证 2.2. 代码修改2.2.1. 用户端代码修改2.2.2. 攻击端代码修改 2.3. 利用思路2.3.1. 用户端2.3.1.1. 验证跳转 2.3.2. 攻击端2.3.2.1…

睿趣科技:开抖音小店挣钱吗到底

在当今数字化时代&#xff0c;社交媒体平台成为了创业者们寻找商机和赚钱的新途径。而抖音作为一款风靡全球的短视频分享平台&#xff0c;自然也成为了许多人开设小店、进行创业的选择之一。那么&#xff0c;开抖音小店能否真正实现盈利&#xff0c;成为了一个备受关注的话题。…

【UE5】用法简介-使用MAWI高精度树林资产的地形材质与添加风雪效果

首先我们新建一个basic工程 然后点击floor按del键&#xff0c;把floor给删除。 只留下空白场景 点击“地形” 在这个范例里&#xff0c;我只创建一个500X500大小的地形&#xff0c;只为了告诉大家用法&#xff0c;点击创建 创建好之后有一大片空白的地形出现 让我们点左上角…

uniapp实现移动端的视频图片轮播组件

1、视频轮播组件app体验地址 https://tt.appc02.com/nesxr6 2、支持小程序、H5、APP&#xff0c;在小程序上运行的效果图 3、使用方法 第一步&#xff0c;按照截图步骤配置好 第二步&#xff1a;参考以下代码&#xff0c;使用视频图片轮播组件 <template><view>…

JavaScript -【第一周】

文章来源于网上收集和自己原创&#xff0c;若侵害到您的权利&#xff0c;请您及时联系并删除~~~ JavaScript 介绍 变量、常量、数据类型、运算符等基础概念 能够实现数据类型的转换&#xff0c;结合四则运算体会如何编程。 体会现实世界中的事物与计算机的关系理解什么是数据并…

工程制造领域:企业IT架构

一、IT组织规划架构图 1.1 IT服务保证梯队与指导思想 二、整体业务规划架构图 三、数据化项目规划架构图 四、应用系统集成架构图

Shell-AI:基于LLM实现自然语言理解的CLI工具

一、前言 随着AI技术的普及&#xff0c;部分技术领域的门槛逐步降低&#xff0c;比如非科班出身&#xff0c;非技术专业&#xff0c;甚至从未涉足技术领域&#xff0c;完全不懂服务器部署和运维&#xff0c;如今可以依托AI大模型非常轻松的掌握和使用相关技术&#xff0c;来解…