论文学习-Bert 和GPT 有什么区别?

Foundation Models, Transformers, BERT and GPT

总结一下:

  • Bert 是学习向量表征,让句子中某个词的Embedding关联到句子中其他重要词。最终学习下来,就是词向量的表征。这也是为什么Bert很容易用到下游任务,在做下游任务的时候,需要增加一些MLP对这些特征进行分类啥的,也就是所谓的微调fine-tune。在Bert的训练中,采用了MASK(完形填空)的思想,用句子中的其他词去预测被挖空的词–Self-Supervised Learning(不需要给句子label,只需要挖空)。这也是Bert不需要Decoder的原因。

  • GPT在做生成,结果是下一个特定词被选中的概率。给一个句子,去生成下一个字,然后再把这个字包含到句子中,重新送入模型,再生成下一个字。周而复始。我能理解这个任务用Decoder可以完成,但为什么这个过程不加入encoder了。–后面看到之后再补充

Bert和GPT都属于预训练模型,在预训练阶段,只不过在目标函数的选取上,Bert采用了完型填空的训练方式,GPT选择的是给定一句话预测下一个字的训练方式。在微调阶段,GPT选择使用两个目标函数结合的方式进行微调,而Bert的话,需要结合任务添加一些层对语义特征进行处理。

GPT选择的方式相对Bert要更困难,预测未来比预测中间状态要难得多,这也是为什么OpenAI要将模型的规模一直做大,才能达到GPT3.5 、GPT4的这种效果。


补充
之前李沐老师的视频里面其实也有讲,但是没记住。论带着问题学习的重要性 -_-

Transformer有两个东西,一个是encoder、一个是decoder。区别在于,encoder在对第i个元素抽取特征时,可以看到整个序列里面的所有元素。而decoder因为有掩码的存在,在对第i个元素抽取特征时,只能看到当前元素和它之前的元素,当前位置后面的词通过一个掩码使得在计算注意力机制的时候变成0。因为是标准的语言模型,只对前预测。对第i个词进行预测的时候,不能看到之后的词。所以GPT(Generative Pre-Training)使用的只是decoder。


学习链接Blog—完全转载

  • https://heidloff.net/article/foundation-models-transformers-bert-and-gpt/
    在这里插入图片描述

Since I’m excited by the incredible capabilities which technologies like ChatGPT and Bard provide, I’m trying to understand better how they work. This post summarizes my current understanding about foundation models, transformers, BERT and GPT.

Note that I’m only learning these concepts and not everything might be fully correct but might help some people to understand the high level concepts.

I know that there are many more and more modern Foundation Models than BERT and GPT, but I want to start ‘simple’ and these two models are probably the most known ones these days.

The technologies below are not trivial and there are lots of articles, papers and full courses even on certain aspects of each technology only. Instead of going into detail, I try to explain what they do and what concepts they use.

Foundation Models

BERT and GPT are both foundation models. Let’s look at the definition and characteristics:

  • Pre-trained on different types of unlabeled datasets (e.g., language and images)
  • Self-supervised learning
  • Generalized data representations which can be used in multiple downstream tasks (e.g., classification and generation)
  • The Transformer architecture is mostly used, but not mandatory

Read my blog Foundation Models at IBM to find out more.

Transformer Architecture

Most foundation models use the transformer architecture. Let’s look at the definition:

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing and computer vision.

In 2017 transformers were introduced: Attention is all you need. They are the next generation of Recurrent Neural Networks and Long Short-Term Memory architectures and have several benefits:

  • Parallel processing: Increases performance and scalability
  • Bidirectionality: Allows understanding of ambiguous words and coreferences

The original transformer architecture defines two main parts, an encoder and a decoder. However, not all foundation models use both parts. BERT only uses encoders, GPT only decoders. More on this later.

Attention

Both encoders and decoders use the concept of ‘attention’. Attention basically means to focus on the important pieces of information and to blend out the unimportant pieces. I like to compare this with ‘fast reading’. Rather than reading full articles or even full books, I often browse chapter titles, first words of paragraphs and scan paragraphs for keywords to find what I’m looking for.

The words of an article, the parts of an image or the words in a sentence that should get most attention change dependent on what you are looking for. Let’s look at a simple example sentence:

“Sarah went to a restaurant to meet her friend that night.”

The following words should get attention for the following queries:

  • What? -> ‘went’, ‘meet’
  • Where? -> ‘a restaurant’
  • Who? -> ‘Sarah’, ‘her friend’
  • When? -> ‘that night’

To determine the attention of words (more exactly tokens) ‘queries’, ‘keys’ and ‘values’ are used by encoders and decoders in transformers. All of them are presented in vectors. Keys are found for certain queries if they are closest to the query vector. Keys are an encoded representation for values, in simple cases they can be the same.

There are different algorithms to implement the attention concept. I think an easy way to understand how this can work is to rank words high that are often used together in sentences. For example, ‘where’ and ‘restaurant’ have probably a closer relation than ‘restaurant’ and ‘faith’. So, for the query ‘where’ the word ‘restaurant’ gets more attention.

Encoders and Decoders

As mentioned, there are encoders and decoders. BERT uses encoders only, GTP uses decoders only. Both options understand language including syntax and semantics. Especially the next generation of large language models like GPT with billions of parameters do this very well.

The two models focus on different scenarios. However, since the field of foundation models is evolving, the differentiation is often fuzzier.

  • BERT (encoder): classification (e.g., sentiment), questions and answers, summarization, named entity recognition
  • GPT (decoder): translation, generation (e.g., stories)

The outputs of the core models are different:

  • BERT (encoder): Embeddings representing words with attention information in a certain context
  • GPT (decoder): Next words with probabilities

Both models are pretrained and can be reused without intensive training. Some of them are available as open source and can be downloaded from communities like Hugging Face, others are commercial. Reuse is important, since trainings are often very resource intensive and expensive which few companies can afford.

The pretrained models can be extended and customized for different domains and specific tasks. Layers can sometimes be reused without modifications and more layers are added on top. If layers need to be modified, the new training is more expensive. The technique to customize these models is called Transfer Learning, since the same generic model can easily be transferred to other domains.

BERT - Encoders

BERT uses the encoder part of the transformer architecture so that it understands semantic and syntactic language information. The output of BERT are embeddings, not predicted next words. To leverage these embeddings, other layer(s) need to be added on top, for example text classification or questions and answers.

BERT uses a genius trick for the training. For supervised training it is often expensive to get labeled data, sometimes it’s impossible. The trick is to use masks as I described in my post Evolution of AI explained via a simple Sample. Let’s take a simple example, an unlabeled sentence:

“Sarah went to a restaurant to meet her friend that night.”

This is converted into:

  • Text: “Sarah went to a restaurant to meet her MASK that night.”
  • Label: “Sarah went to a restaurant to meet her friend that night.”

Note that this is a very simplified description only since there aren’t ‘real’ labels in BERT.

In other words, BERT produces labeled data for originally unlabeled data. This technique is called Self-Supervised Learning. It works very well for huge amounts of data.

In masked language models like BERT, each masked word (token) prediction is conditioned on the rest of the tokens in the sentence. These are received in the encoder which is why you don’t need a decoder.

GPT - Decoders

In language scenarios decoders are used to generate next words, for example when translating text or generating stories. The outputs are words with probabilities.

Decoders also use the attention concepts and even two times. First when training models, they use Masked Multi-Head Attention which means that only the first words of the target sentence are provided so that the model can learn without cheating. This mechanism is like the MASK concept from BERT.

After this the decoder uses Multi-Head Attention as it’s also used in the encoder. Transformer based models that utilize encoders and decoders use a trick to be more efficient. The output of the encoders is feed as input to the decoders, more precisely the keys and values. Decoders can invoke queries to find the closest keys. This allows, for example, to understand the meaning of the original sentence and translate it into other languages even if the number of resulting words and the order changes.

GPT doesn’t use this trick though and only use a decoder. This is possible since these types of models have been trained with massive amounts of data (Large Language Model). The knowledge of encoders is encoded in billions of parameters (also called weights). The same knowledge exists in decoders when trained with enough data.

Note that ChatGPT has evolved these techniques. To prevent hate, profanity and abuse, humans need to label some data first. Additionally Reinforcement Learning is applied to improve the quality of the model (see ChatGPT: Optimizing Language Models for Dialogue).

Resources

There are many good articles, videos and courses. Here are some of the ones I read or watched:

  • Course: Natural Language Processing Demystified
  • YouTube channel: CodeEmporium
  • Article: What Is ChatGPT Doing … and Why Does It Work?
  • Article: 10 Things You Need to Know About BERT and the Transformer Architecture That Are Reshaping the AI Landscape
  • Article: Transformer’s Encoder-Decoder: Let’s Understand The Model Architecture
  • NLP - BERT & Transformer

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/188036.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

经验分享:JMeter控制RPS

一、前言 ​ RPS (Request Per Second)一般用来衡量服务端的吞吐量,相比于并发模式,更适合用来摸底服务端的性能。我们可以通过使用 JMeter 的常数吞吐量定时器来限制每个线程的RPS。对于RPS,我们可以把他理解为我们的TPS,我们就…

C语言猜数字小游戏

本文将介绍如何使用C语言写一个猜数字的小游戏 具体代码如下&#xff1a; #include<stdio.h> #include<stdlib.h> #include<time.h>// 显示游戏菜单 void menu() {printf("**** 猜数字游戏! ****\n");printf("**** 按1开始游戏 ****\…

数组?NO 系Vector啊!

文章目录 前言一、vector的介绍二、vector的使用2.1 vector求容量的用法2.2 vector的增删查改用法2.2.1 尾插2.2.2 尾删2.2.3 头插2.2.4 任意位置删除 2.3 vector的iterator是什么以及失效问题 三、vector的模拟实现3.1 成员变量3.2 成员函数3.2.1 构造函数3.2.2 拷贝构造3.2.3…

Elasticsearch桶聚合和管道聚合

1. 根据名称统计数量 GET order/_search {"_source": false,"aggs": {"aggs_name": { // 自定义查询结果名称"terms": { // 使用的函数"field": "name.keyword"}}} }查询结果例子&#xff1a; "aggregat…

一起学docker系列之十五深入了解 Docker Network:构建容器间通信的桥梁

目录 1 前言2 什么是 Docker Network3 Docker Network 的不同模式3.1 桥接模式&#xff08;Bridge&#xff09;3.2 Host 模式3.3 无网络模式&#xff08;None&#xff09;3.4 容器模式&#xff08;Container&#xff09; 4 Docker Network 命令及用法4.1 docker network ls4.2 …

MSSQL注入

目录 基本的UNION注入&#xff1a; 错误基于的注入&#xff1a; 时间基于的盲注入&#xff1a; 堆叠查询&#xff1a; 理解MSSQL注入是学习网络安全的一部分&#xff0c;前提是您在合法、授权的环境中进行&#xff0c;用于了解如何保护您的应用程序免受此类攻击。以下是有关…

【linux】/etc/security/limits.conf配置文件详解、为什么限制、常见限制查看操作

文章目录 一. limits.conf常见配置项详解二. 文件描述符&#xff08;file descriptor&#xff09;简述三. 为什么限制四. 相关操作1. 展示当前资源限制2. 查看系统当前打开的文件描述符数量3. 查看某个进程打开的文件描述符数量4. 各进程占用的文件描述符 /etc/security/limits…

大势智慧与四川资源测绘签署战略合作协议

战略合作 11月27日上午&#xff0c;武汉大势智慧科技有限公司&#xff08;后简称“大势智慧”&#xff09;和四川省自然资源测绘地理信息有限责任公司&#xff08;后简称“测绘公司”&#xff09;在成都成功签订战略合作协议&#xff0c;大势智慧董事长黄先锋&#xff0c;测绘…

SpringBoot进行消息推送的的几种方式

Spring Boot进行消息推送的几种方式包括&#xff1a; WebSocket SockJS STOMP Server-Sent Events (SSE) Push Notifications 以下是WebSocket的案例代码&#xff1a; 添加WebSocket依赖 <dependency><groupId>org.springframework.boot</groupId>&l…

Android Service中弹出对话框

背景 dialog 对话框只提供Activity上下文显示环境&#xff0c;但是很多时候需要在后台服务中显示对话框的场景&#xff0c;例如后台收到哪个反馈时&#xff0c;弹出对应的对话框提示用户。 解决方案&#xff1a; 1、添加权限&#xff1a; <!--services权限--> <uses…

博文小调研

感谢信 很高兴认识各位盆友&#xff0c;天南地北一家人&#xff01; 无论身在行业差异&#xff0c;所处职位高低&#xff0c;工作年限长短&#xff0c;这个平台都为爱好学习的人们提供了很好的机会和进步的源动力。 博主今年自11月份开启了新的系列文章&#xff0c;每周发表6…

c MJPG

yuv格式的照片是纯yuv的数据&#xff0c;如果不告诉图片查看程序此数据流的长与宽&#xff0c;是无法显示图片的。 MJPG是由多帧jpg图片组成。jpg图片有文件头&#xff0c;里面就有必须的长&#xff0c;宽数据。jpg的图片数据是yuv压缩后的数据。所以jpg解码后的数据也是yuv&a…

容器安全是什么

容器安全是当前面临的重要挑战之一&#xff0c;但通过采取有效的应对策略&#xff0c;我们可以有效地保护容器的安全。在应对容器安全挑战时&#xff0c;我们需要综合考虑镜像安全、网络安全和数据安全等多个方面&#xff0c;并采取相应的措施来确保容器的安全性。 德迅蜂巢原…

Clion+Ubuntu(WSL)+MySQL8.0开发环境搭建

1. 下载 MySQL 源码 访问 MySQL 官方网站&#xff08;MySQL :: Download MySQL Community Server&#xff09;并下载 MySQL 8.0 的源码包&#xff08;mysql-boost-8.0.31.tar.gz&#xff09;。 2. 安装编译依赖 1&#xff09;更换镜像源 参考&#xff1a;Linux Ubuntu 修改…

java开发之个微群聊自动添加好友

请求URL&#xff1a; http://域名/addRoomMemberFriend 请求方式&#xff1a; POST 请求头Headers&#xff1a; Content-Type&#xff1a;application/jsonAuthorization&#xff1a;login接口返回 参数&#xff1a; 参数名必选类型说明wId是String登录实例标识chatRoom…

20 Go的命令行参数

概述 在上一节的内容中&#xff0c;我们介绍了Go的时间日期&#xff0c;包括&#xff1a;time包、格式化日期、日期字符串解析、计算日期差、时区操作、定时任务等。在本节中&#xff0c;我们将介绍Go的命令行参数。命令行参数在程序设计中扮演着重要的角色&#xff0c;它允许用…

4个Pycharm高效插件

大家好&#xff0c;Pycharm是Python最受欢迎的集成开发环境之一&#xff0c;它具有良好的代码助手、漂亮的主题和快捷方式&#xff0c;使编写代码变得简单快捷。话虽如此&#xff0c;开发者仍可以通过使用一些插件来提高在Pycharm中编写Python代码的效率和乐趣&#xff0c;在市…

【【FPGA 之Micro Blaze的串口中断实验】】

FPGA 之Micro Blaze的串口中断实验 我们在使用 MicroBlaze 进行嵌入式系统设计的时候&#xff0c;通常会用到 AXI Uartlite IP 核与外部设备通信。AXI UART IP 核实现了 RS-232 通讯协议&#xff0c;并使得大家可以设置串口通信相关的波特率、奇偶校验位、停止位和数据位等参数…

前端OFD文件预览(vue案例cafe-ofd)

0、提示 下面只有vue的使用示例demo &#xff0c;官文档参考 cafe-ofd - npm 其他平台可以参考 ofd - npm 官方线上demo: ofd 1、安装包 npm install cafe-ofd --save 2、引入 import cafeOfd from cafe-ofd import cafe-ofd/package/index.css Vue.use(cafeOfd) 3、使…

数据可视化工具APITable:实现强大的多维表格功能并随时随地远程访问

APITable免费开源的多维表格与可视化数据库公网远程访问 文章目录 APITable免费开源的多维表格与可视化数据库公网远程访问前言1. 部署APITable2. cpolar的安装和注册3. 配置APITable公网访问地址4. 固定APITable公网地址 前言 vika维格表作为新一代数据生产力平台&#xff0c…