A Brief History: from GPT-1 to GPT-3

This is my reading notes of 《Developing Apps with GPT-4 and ChatGPT》.

In this section, we will introduce the evolution of the OpenAI GPT medels from GPT-1 to GPT-4.

GPT-1

In mid-2018, OpenAI published a paper titled “Improving Language Understanding by Generative Pre-Training” by Radford, Alec, et al. in which they introduced the Generative Pre-trained Transformer, also known as GPT-1.

The full name of GPT is Generative Pre-trained Transformer.

Before GPT-1, the common approach to building high-performance NLP neural models relied on supervised learning which needs large amounts of manually labeled data. However, the need for large amounts of well-annotated supervised data has limited the performance of these techniques because such datasets are both difficult and expensive to generate.

The authors of GPT-1 proposed a new learning process where an unsupervised pre-training step is introduced. In this step, no labeled data is needed.Instead, the model is trained to predict what the next token is.

The GPT-1 model used the BooksCorpus dataset for the pre-training which is a dataset containing the text of approximately 11,000 unpublished books.

In the unsupervised learning phase, the model learned to predict the next item in the texts of the BookCorpus dataset.

However, because the model is small, it was unable to perform complex tasks without fine-tuning.

To adapt the model to a specific target task, a second supervised learning step, called fine-tuning, was performed on a small set fo manually labeled data.

The process of fine-tuning allowed the parameters learned in the initial pre-training phase to be modified to fit the task at hand better.

In contrast to other NLP neural models, GPT-1 showed remarkable performance on several NLP tasks using only a small amount of manually labeled data for fine-tuning.

NOTE

GPT-1 was trained in two stages:


Stage 1: Unsupervised Pre-training
Goal: To learn general language patterns and presentations.
Method: The model is trained to predict the next token in the sentence.
Data: A large unlabeled text dataset
Type of Learning: Unsupervised learning – no manual labels are needed.
Outcome: The model learns a strong general understanding of language, but it’s not yet specialized for specific tasks(e.g., sentiment analysis or question answering)


Stage 2: Supervise Fine-tuning
Goal: To adapt the pre-trained model to a specific downstream task.
Method: The model is further trained on a small labeled dataset.
Type of Learning: Supervised learning – the data includes input-output pairs(e.g., a sentence and its sentiment label).
Outcome: The model’s parameters are fine-tuned so it performs better on that particular task.


Summary:
  • Pre-training teaches the model how language works(general knowledge).
  • Fine-tuning teaches the model how to perform a specific task(specialized skills).

A good analogy would be:
The model first read lots of books to be smart(pre-training), and then takes a short course to learn a particular job(fine-tuning).

The architecture of GPT-1 was a similar encoder from the original transformer, introduced in 2017, with 117 million parameters.

This first GPT model paved the way for future models with larger datasets and more parameters to take better advantage of the potential of the transformer architectures.

GPT-2

In early 2019, OpenAI proposed GPT-2, a scaled-up version of the GPT-1 model, increasing the number of parameters and the size of the training dataset tenfold.

The number of parameters of this new version was 1.5 billion, trained on 40 GB of text.

In November 2019, OpenAI released the full version of the GPT-2 language model.

GPT-2 is publicly available and can be downloaded from Huggingface or GitHub.

GPT-2 showed that training a larger language model on a larger dataset improves the ability of a language model to understand tasks and outperforms the state-of-art on many jobs.

GPT-3

GPT-3 was released by OpenAI in June 2020.

The main differences between GPT-2 and GPT-3 are the size of the model and the quantity of data used for the training.

GPT-3 is a much larger model, with 175 billion parameters, allowing it to capture more complex pattern.

In addition, GPT-3 is trained on a more extensive dataset.

This includes Common Crawl, a large web archive containing text from billions of web pages and other sources, such as Wikipedia.

This training dataset, which includes content from websites, books, and articles, allows GPT-3 to develop a deeper understanding of the language and context.

As a result, GPT-3 improved performance on a variety of linguistic tasks.

GPT-3 eliminates the need for a fine-tuning step that was mandatory for its predecessors.

NOTE

How GPT-3 eliminates the need for fine-tuning:

GPT-3 is trained on a massive amount of data, and it’s much larger than GPT-1 and GPT-2 – with 175 billion parameters.
Because of the scale, GPT-3 learns very strong general language skills during pre-training alone.


Instead of fine-tuning, GPT-3 uses:
  1. Zero-shot learning
    Just give it a task description in plain text – no example needed.
  2. One-shot learning
    Give it one example in the prompt to show what kind of answer you want.
  3. Few-shot learning
    Give it a few examples in the prompt, and it learns the pattern on the fly.

So in short:

GPT-3 doesn’t need fine-tuning because it can understand and adapt to new tasks just by seeing a few examples in the input prompt — thanks to its massive scale and powerful pre-training.


GPT-3 is indeed capable of handling many tasks without traditional fine-tuning, but that doesn’t mean it completely lacks support for or never uses fine-tuning.

GPT-3’s default approach: Few-shot / Zero-shot Learning

What makes GPT-3 so impressive is that it can:

  • Perform tasks without retraining (fine-tuning)
  • Learn through prompts alone
Does GPT-3 support fine-tuning?

Yes! OpenAI eventually provided a fine-tuning API for GPT-3, which is useful in scenarios like:

  • When you have domain-specific data (e.g., legal, medical).

  • When you want the model to maintain a consistent tone or writing style.

  • When you need a stable and structured output format (e.g., JSON).

  • When prompt engineering isn’t sufficient.


To summarize:
  1. Does GPT-3 need fine-tuning?
    Usually nofew-shot/zero-shot learning is enough for most tasks.

  2. Does GPT-3 support fine-tuning?
    Yes, especially useful for domain-specific or high-requirement tasks.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/74629.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基于大数据的各品牌手机销量数据可视化分析系统(源码+lw+部署文档+讲解),源码可白嫖!

摘要 时代在飞速进步,每个行业都在努力发展现在先进技术,通过这些先进的技术来提高自己的水平和优势,各品牌手机销量数据可视化分析系统当然不能排除在外。基于大数据的各品牌手机销量数据可视化分析系统是在实际应用和软件工程的开发原理之…

人工智能-群晖Docker部署DB-GPT

人工智能-群晖Docker部署DB-GPT 0 环境及说明1 获取dbgpt的docker镜像2 下载向量模型3 下载配置文件4 修改配置文件5 创建dbgpt容器并运行6 访问dbgpt0 环境及说明 环境项说明DSM版本DSM 7.2.1-69057 update 3Container Manager版本24.0.2-1535当前 hub.docker.com 镜像仓库中的…

Netty——TCP 粘包/拆包问题

文章目录 1. 什么是 粘包/拆包 问题?2. 原因2.1 Nagle 算法2.2 滑动窗口2.3 MSS 限制2.4 粘包的原因2.5 拆包的原因 3. 解决方案3.1 固定长度消息3.2 分隔符标识3.3 长度前缀协议3.3.1 案例一3.3.2 案例二3.3.3 案例三 4. 总结 1. 什么是 粘包/拆包 问题&#xff1f…

JavaScript Fetch API

简介 fetch() API 是用于发送 HTTP 请求的现代异步方法,它基于 Promise,比传统的 XMLHttpRequest 更加简洁、强大 示例 基本语法 fetch(url, options).then(response > response.json()).then(data > console.log(data)).catch(error > con…

UMI-OCR Docker 部署

额外补充 Docker 0.前置条件 部署前,请检查主机的CPU是否具有AVX指令集 lscpu | grep avx 输出如下即可继续部署 Flags: ... avx ... avx2 ... 1.下载dockerfile wget https://raw.githubusercontent.com/hiroi-sora/Umi-OCR_runtime_linux/main/Do…

C++ --- 二叉搜索树

1 二叉搜索树的概念 ⼆叉搜索树⼜称⼆叉排序树,它或者是⼀棵空树,或者是具有以下性质的⼆叉树: 1 若它的左⼦树不为空,则左⼦树上所有结点的值都⼩于等于根结点的值 2 若它的右⼦树不为空,则右⼦树上所有结点的值都⼤于等于根结点…

跨语言语言模型预训练

摘要 最近的研究表明,生成式预训练在英语自然语言理解任务中表现出较高的效率。在本研究中,我们将这一方法扩展到多种语言,并展示跨语言预训练的有效性。我们提出了两种学习跨语言语言模型(XLM)的方法:一种…

文件描述符,它在哪里存的,exec()后还存在吗

学过计系肯定了解 寄存器、程序计数器、堆栈这些 程序运行需要的资源。 这些是进程地址空间。 而操作系统分配一个进程资源时,分配的是 PCB 进程控制块。 所以进程控制块还维护其他资源——程序与外部交互的资源——文件、管道、套接字。 文章目录 文件描述符进程管…

Slidev使用(一)安装

文章目录 1. **安装位置**2. **使用方式**3. **适用场景**4. **管理和维护** 全局安装1. **检查 Node.js 和 npm 是否已安装**2. **全局安装 Slidev CLI**3. **验证安装是否成功**4. **创建幻灯片文件**5. **启动 Slidev**6. **实时编辑和预览**7. **构建和导出(可选…

第二十一章:模板与继承_《C++ Templates》notes

模板与继承 重点和难点编译与测试说明第一部分:多选题 (10题)第二部分:设计题 (5题)答案与详解多选题答案:设计题参考答案 测试说明 重点和难点 21.1 空基类优化(EBCO) 知识点 空基类优化(Empty Base Cla…

AOA与TOA混合定位,MATLAB例程,自适应基站数量,三维空间下的运动轨迹,滤波使用EKF

本代码实现了一个基于 到达角(AOA) 和 到达时间(TOA) 的混合定位算法,结合 扩展卡尔曼滤波(EKF) 对三维运动目标的轨迹进行滤波优化。代码通过模拟动态目标与基站网络,展示了从信号测量、定位解算到轨迹滤波的全流程,适用于城市峡谷、室内等复杂环境下的定位研究。 文…

量子计算:开启未来计算的新纪元

一、引言 在当今数字化时代,计算技术的飞速发展深刻地改变了我们的生活和工作方式。从传统的电子计算机到如今的高性能超级计算机,人类在计算能力上取得了巨大的进步。然而,随着科技的不断推进,我们面临着越来越多的复杂问题&…

AMD机密计算虚拟机介绍

一、什么机密计算虚拟机 机密计算虚拟机 是一种基于硬件安全技术(如 AMD Secure Encrypted Virtualization, SEV)的虚拟化环境,旨在保护虚拟机(VM)的 ​运行中数据​(包括内存、CPU 寄存器等)免受外部攻击或未经授权的访问,即使云服务提供商或管理员也无法窥探。 AMD …

如何通过数据可视化提升管理效率

通过数据可视化提升管理效率的核心方法包括清晰展示关键指标、及时发现和解决问题、支持决策优化。其中,清晰展示关键指标尤为重要。通过数据可视化工具直观地呈现关键绩效指标(KPI),管理者能快速、准确地理解业务现状&#xff0c…

.git 文件夹

文件夹介绍 🍎 在 macOS 上如何查看 .git 文件夹? ✅ 方法一:终端查看(最推荐) cd /你的项目路径/ ls -a-a 参数表示“显示所有文件(包括隐藏的)”,你就能看到: .git…

MongoDB 与 Elasticsearch 使用场景区别及示例

一、核心定位差异 ‌MongoDB‌ ‌定位‌:通用型文档数据库,侧重数据的存储、事务管理及结构化查询,支持 ACID 事务‌。‌典型场景‌: 动态数据结构存储(如用户信息、商品详情)‌。需事务支持的场景&#xf…

【深度学习基础 2】 PyTorch 框架

目录 一、 PyTorch 简介 二、安装 PyTorch 三、PyTorch 常用函数和操作 3.1 创建张量(Tensor) 3.2 基本数学运算 3.3 自动求导(Autograd) 3.4 定义神经网络模型 3.5 训练与评估模型 3.6 使用模型进行预测 四、注意事项 …

uniapp中APP上传文件

uniapp提供了uni.chooseImage(选择图片), uni.chooseVideo(选择视频)这两个api,但是对于打包成APP的话就没有上传文件的api了。因此我采用了plus.android中的方式来打开手机的文件管理从而上传文件。 下面…

推陈换新系列————java8新特性(编程语言的文艺复兴)

文章目录 前言一、新特性秘籍二、Lambda表达式2.1 语法2.2 函数式接口2.3 内置函数式接口2.4 方法引用和构造器引用 三、Stream API3.1 基本概念3.2 实战3.3 优势 四、新的日期时间API4.1 核心概念与设计原则4.2 核心类详解4.2.1 LocalDate(本地日期)4.2…

树莓派5从零开发至脱机脚本运行教程——1.系统部署篇

树莓派5应用实例——工创视觉 前言 哈喽,各位小伙伴,大家好。最近接触了树莓派,然后简单的应用了一下,学习程度并不是很深,不过足够刚入手树莓派5的小伙伴们了解了解。后面的几篇更新的文章都是关于开发树莓派5的内容…