SentencePiece进行文本分类

SentencePieces

前言

Step1:故事

  • SentencePiece 是一个无监督的文本分词器和 detokenizer(还原回去的?)
  • 主要用于词汇表大小是预定的文本生成系统中
  • 它拓展了原始句子的训练,实现子词单元如 BPE 和 unigram language model
  • 技术亮点
    • 纯数据驱动,纯从句子中训练 tokenizer 和 detokenizer。不总是需要预训练模型
    • 语言独立:把句子视为 Unicode 字符,没有语言逻辑
    • 多个子词算法: BPE 和 Unigram LM
    • 子词正则化:实现了 子词正则化 和 BPE dropout 的子词采样,有助于提高鲁棒性和准确性。
    • 快、轻量级:每秒 50k 个句子,内存大概 6MB
    • 自包含:相同的模型文件相同的 tokenizer
    • 快速词汇 id 生成
    • NFKC 的正则化
      • NFC : 组合形式,字符被标准化为单个预组合字符(合成字符)
      • NFD : 分解模型,字符被标准化为基本字符加上组合符号的形式(分解模式)—— 原始字符:é —> NFD 形式:e + ´
      • NFKC : 兼容性组合模式,类似 NFC,但在标准化过程中可能会删除某些格式化信息
      • NFKD : 兼容性分解模式,类似 NFD,但在标准化过程中可能会删除某些格式化信息
  • **吐槽:**这些 HF 的 tokenizers 都能做。。。。。。而且 Tokenizers 做的更多

1.什么是 SentencePiece

  • 是子词单元的重新实现,缓解开放词汇表问题
  • 独一无二的 token 数量是预定的,例如 8k, 16k, 32k
  • 用未处理的句子训练
    • 以前的子词实现为了告诉训练,需要提前将输入句子 token 化。
    • SentencePiece 实现很快,可以使用原始句子训练模型。这对于中文或日语很有用
  • 空格被视为基本符号
    • 原来,(word.) == (word .)
    • 现在,(word.) != (word_.)
    • 因为空格被保存到了句子中,所以可以不含糊的 detokenize 回去;对比原来是不可你转的
    • 这让 tokenization 没有语言依赖成为了可能

2.子词正则化 和 BPE Dropout

  • 目的:用于子词分割和模型训练,旨在提高模型的泛化能力和鲁棒性
  • 子词正则化:
    • 远离:在训练时不会固定使用一种分割方法,而是从多种分割方案中,随机选择一种。增强模型应对多样性输入的能力
    • 优点:引入分词的不确定性,提高鲁棒性和泛海能力。对低资源等数据较少的场景友好
  • BPE Dropout
    • 原理:常规 BPE 中,每次选频率最高的字符进行合并,而 BPE Dropout 会随机丢弃一些合并步骤。意味着在训练中,同一个词语在不同的迭代中可能被分割成不同的子词序列。
    • 优点:引入随机性,鲁棒性,饭还行。对 OOV 问题友好

3.安装

  • pip 安装
pip install sentencepiece
  • c++ 源码安装
git clone https://github.com/google/sentencepiece.git 
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo update_dyld_shared_cache
# sudo ldconfig -v --> ubuntu

Step2:使用指南

1.训练 SentencePiece 模型

spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
--input: 每行一句的语料库文件。默认使用 NFKC。可以传递逗号分隔的文教列表。
--model_prefix: 输出模型名字前缀。生成 xx.model 和 xx.vocab
--vocab_size: 词汇表大小,如 8000, 16000, 32000
--character_coverage: 模型涵盖的字符数量,好的默认是 0.9995(中文或日语等丰富的字符集),小字符集可以是 1.0
--model_type: 模型类型,选择 unigram(默认), bpe, char, word剩下的...
--input (comma separated list of input sentences)  type: std::string default: ""
--input_format (Input format. Supported format is `text` or `tsv`.)  type: std::string default: ""
--model_prefix (output model prefix)  type: std::string default: ""
--model_type (model algorithm: unigram, bpe, word or char)  type: std::string default: "unigram"
--vocab_size (vocabulary size)  type: int32 default: 8000
--accept_language (comma-separated list of languages this model can accept)  type: std::string default: ""
--self_test_sample_size (the size of self test samples)  type: int32 default: 0
--character_coverage (character coverage to determine the minimum symbols)  type: double default: 0.9995
--input_sentence_size (maximum size of sentences the trainer loads)  type: std::uint64_t default: 0
--shuffle_input_sentence (Randomly sample input sentences in advance. Valid when --input_sentence_size > 0)  type: bool default: true
--seed_sentencepiece_size (the size of seed sentencepieces)  type: int32 default: 1000000
--shrinking_factor (Keeps top shrinking_factor pieces with respect to the loss)  type: double default: 0.75
--num_threads (number of threads for training)  type: int32 default: 16
--num_sub_iterations (number of EM sub-iterations)  type: int32 default: 2
--max_sentencepiece_length (maximum length of sentence piece)  type: int32 default: 16
--max_sentence_length (maximum length of sentence in byte)  type: int32 default: 4192
--split_by_unicode_script (use Unicode script to split sentence pieces)  type: bool default: true
--split_by_number (split tokens by numbers (0-9))  type: bool default: true
--split_by_whitespace (use a white space to split sentence pieces)  type: bool default: true
--split_digits (split all digits (0-9) into separate pieces)  type: bool default: false
--treat_whitespace_as_suffix (treat whitespace marker as suffix instead of prefix.)  type: bool default: false
--allow_whitespace_only_pieces (allow pieces that only contain (consecutive) whitespace tokens)  type: bool default: false
--control_symbols (comma separated list of control symbols)  type: std::string default: ""
--control_symbols_file (load control_symbols from file.)  type: std::string default: ""
--user_defined_symbols (comma separated list of user defined symbols)  type: std::string default: ""
--user_defined_symbols_file (load user_defined_symbols from file.)  type: std::string default: ""
--required_chars (UTF8 characters in this flag are always used in the character set regardless of --character_coverage)  type: std::string default: ""
--required_chars_file (load required_chars from file.)  type: std::string default: ""
--byte_fallback (decompose unknown pieces into UTF-8 byte pieces)  type: bool default: false
--vocabulary_output_piece_score (Define score in vocab file)  type: bool default: true
--normalization_rule_name (Normalization rule name. Choose from nfkc or identity)  type: std::string default: "nmt_nfkc"
--normalization_rule_tsv (Normalization rule TSV file. )  type: std::string default: ""
--denormalization_rule_tsv (Denormalization rule TSV file.)  type: std::string default: ""
--add_dummy_prefix (Add dummy whitespace at the beginning of text)  type: bool default: true
--remove_extra_whitespaces (Removes leading, trailing, and duplicate internal whitespace)  type: bool default: true
--hard_vocab_limit (If set to false, --vocab_size is considered as a soft limit.)  type: bool default: true
--use_all_vocab (If set to true, use all tokens as vocab. Valid for word/char models.)  type: bool default: false
--unk_id (Override UNK (<unk>) id.)  type: int32 default: 0
--bos_id (Override BOS (<s>) id. Set -1 to disable BOS.)  type: int32 default: 1
--eos_id (Override EOS (</s>) id. Set -1 to disable EOS.)  type: int32 default: 2
--pad_id (Override PAD (<pad>) id. Set -1 to disable PAD.)  type: int32 default: -1
--unk_piece (Override UNK (<unk>) piece.)  type: std::string default: "<unk>"
--bos_piece (Override BOS (<s>) piece.)  type: std::string default: "<s>"
--eos_piece (Override EOS (</s>) piece.)  type: std::string default: "</s>"
--pad_piece (Override PAD (<pad>) piece.)  type: std::string default: "<pad>"
--unk_surface (Dummy surface string for <unk>. In decoding <unk> is decoded to `unk_surface`.)  type: std::string default: " ⁇ "
--train_extremely_large_corpus (Increase bit depth for unigram tokenization.)  type: bool default: false
--random_seed (Seed value for random generator.)  type: uint32 default: 4294967295
--enable_differential_privacy (Whether to add DP while training. Currently supported only by UNIGRAM model.)  type: bool default: false
--differential_privacy_noise_level (Amount of noise to add for DP)  type: float default: 0
--differential_privacy_clipping_threshold (Threshold for clipping the counts for DP)  type: std::uint64_t default: 0
--help (show help)  type: bool default: false
--version (show version)  type: bool default: false
--minloglevel (Messages logged at a lower level than this don't actually get logged anywhere)  type: int default: 0

2.编码未处理的文本到 sentence pieces/ids

spm_encode --model=<model_file> --output_format=piece < input > output
spm_encode --model=<model_file> --output_format=id < input > output
  • 使用 --extra_options 去添加 BOS/EOS 或 反向输入序列
spm_encode --extra_options=eos (add </s> only)
spm_encode --extra_options=bos:eos (add <s> and </s>)
spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

3.解码 sentence pieces/ids

spm_decode --model=<model_file> --input_format=piece < input > output
spm_decode --model=<model_file> --input_format=id < input > output
  • 使用 --extra_options 在反向顺序中解码文本
spm_decode --extra_options=reverse < input > output

4.端到端的例子

spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
'''
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab
'''echo "I saw a girl with a telescope." | spm_encode --model=m.model
'''
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .
'''echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
'''
9 459 11 939 44 11 4 142 82 8 28 21 132 6
'''echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
'''
I saw a girl with a telescope.
'''

5.导出词汇列表

spm_export_vocab --model=<model_file> --output=<output file>

6.例子

notebook

Step3:实验

1.数据集

酒店评论数据集,处理成每行一句的形式

在这里插入图片描述

2.使用

  1. 训练(我弄的是12800 词汇表大小)

    在这里插入图片描述

生成了两个文件,一个是模型文件,一个是词表文件
在这里插入图片描述
词表文件如下在这里插入图片描述

  1. 分词

    • 直接分词就好了,因为任务是分类,不需要插入 eos 和 bos

    在这里插入图片描述

    在这里插入图片描述

    • 分成 id
      • note : 生成的词汇表的顺序正好是对应的 词 id 的自增顺序

    在这里插入图片描述

    在这里插入图片描述

  2. 并没有对应的词向量文件,看来还需要对这些词进行词嵌入训练,还是用fasttext好了。

    • 写个脚本变成 fasttext 需要的形式

    在这里插入图片描述

  3. id 和 词向量都有了,可以构造词嵌入矩阵了

    • 对应关系是

      • fast_vec —> word : vec
      • spm_vec —> id : word
      • 构造 embedding —> id:vec
      • 1.2w 数据中有 20 行为 空,不多,对空值处理为随机值吧

      在这里插入图片描述

      • 写个脚本,然后保存为词向量的 .npy 文件,留着模型用

      在这里插入图片描述

    • 思想

      • 用 sentencepiece 作为分词器,得到一系列 id
      • 把 id 为给模型
      • 模型训练
      • 推理的时候也是 sentencepiece 分词
    • 实践开始吧~

      • 代码
        上方资源处自取

      • 效果:基本收敛到了 96%

        在这里插入图片描述

      • 30之后连同嵌入层一起微调10轮,准确率又上去了一个百分点

        在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/880355.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Azure Kinect 人体跟踪关节

Azure Kinect 人体跟踪关节 azure kinect dk 提取人体骨骼 要在Azure Kinect DK上提取人体骨骼&#xff0c;你需要使用Azure Kinect SDK和OpenPose库。以下是一个简化的代码示例&#xff0c;展示如何集成这两个库来提取骨骼关键点&#xff1a; 首先&#xff0c;确保你已经安装…

Web3Auth 如何工作?

Web3Auth 用作钱包基础设施&#xff0c;为去中心化应用程序 (dApp) 和区块链钱包提供增强的灵活性和安全性。在本文档中&#xff0c;我们将探索 Web3Auth 的功能&#xff0c;展示它如何为每个用户和应用程序生成唯一的加密密钥提供程序。 高级架构 Web3Auth SDK 完全存在于用…

软件测试基础篇

&#x1f345; 点击文末小卡片&#xff0c;免费获取软件测试全套资料&#xff0c;资料在手&#xff0c;涨薪更快 “尽早的介入测试&#xff0c;遇到问题的解决成本就越低” 随着软件测试技术的发展&#xff0c;测试工作由原来单一的寻找缺陷逐渐发展成为预防缺陷&#xff0c;…

文章解析: 一不小心掉入了 Java Interface 的陷阱

一不小心掉入了 Java Interface 的陷阱_腾讯新闻 import org.springframework.util.CollectionUtils; import java.util.ArrayList; import java.util.Iterator; import java.util.List;// 方便起见就都放在一个文件中了 public class TestSimpleResult {public static void ma…

Rust和Go谁会更胜一筹

在国内&#xff0c;我认为Go语言会成为未来的主流&#xff0c;因为国内程序员号称码农&#xff0c;比较适合搬砖&#xff0c;而Rust对心智要求太高了&#xff0c;不适合搬砖。 就个人经验来看&#xff0c;Go语言简单&#xff0c;下限低&#xff0c;没有什么心智成本&#xff0c…

华为认证HCIA篇--网络通信基础

大家好呀&#xff01;我是reload。今天来带大家学习一下华为认证ia篇的网络通信基础部分&#xff0c;偏重一些基础的认识和概念性的东西。如果对网络通信熟悉的小伙伴可以选择跳过&#xff0c;如果是新手或小白的话建议还是看一看&#xff0c;先有个印象&#xff0c;好为后续的…

安卓Settings值原理源码剖析存储最大的字符数量是多少?

背景&#xff1a; 平常做rom相关开发时候经常需要与settings值打交道&#xff0c;需要独立或者存储一个settings的场景&#xff0c;群里有个学员朋友就问了一个疑问&#xff0c;那就是Settings的putString方式来存储字符&#xff0c;那么可以存储的最大字符是多少呢&#xff1…

Excel锁定单元格,使其不可再编辑

‌在Excel中&#xff0c;锁定单元格后仍然可以编辑‌&#xff0c;这主要涉及到对特定单元格或区域的锁定与保护工作表的设置。以下是实现这一功能的具体步骤&#xff1a; ‌解除工作表的锁定状态‌&#xff1a;首先&#xff0c;需要全选表格&#xff08;使用CtrlA快捷键&#x…

[数据集][目标检测]中草药类型识别检测数据集VOC+YOLO格式7976张45类别

数据集格式&#xff1a;Pascal VOC格式YOLO格式(不包含分割路径的txt文件&#xff0c;仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件) 图片数量(jpg文件个数)&#xff1a;7976 标注数量(xml文件个数)&#xff1a;7976 标注数量(txt文件个数)&#xff1a;7976 标注…

AI公司的妄念:招个AI产品经理来想idea

AI公司在探索方向时&#xff0c;一旦老板或负责人的想法陷入瓶颈&#xff08;或没时间想特别细分的方向&#xff09;&#xff0c;往往会希望招一个AI产品经理来想idea&#xff08;创新/探索新方向&#xff09;&#xff0c;预期他某天突然想出个特别好的idea。 一、这个思路&…

【机器学习】12-决策树1——概念、特征选择

机器学习10-决策树1 学习样本的特征&#xff0c;将样本划分到不同的类别&#xff08;分类问题&#xff09;或预测连续的数值&#xff08;回归问题&#xff09;。 选择特征&#xff0c;划分数据集&#xff0c;划分完成形成模型&#xff08;树结构&#xff09;&#xff0c;一个…

OSI 七层模型和TCP/IP 四层模型的区别

目录 OSI 七层模型 介绍 1. 物理层&#xff08;Physical Layer&#xff09; 2. 数据链路层&#xff08;Data Link Layer&#xff09; 3. 网络层&#xff08;Network Layer&#xff09; 4. 传输层&#xff08;Transport Layer&#xff09; 5. 会话层&#xff08;Session …

【网络安全】基础知识详解(非常详细)零基础入门到精通,收藏这一篇就够了

一、什么是网络安全&#xff1f; 百度上对“网络安全”是这么介绍的&#xff1a; 网络安全是指网络系统的硬件、软件及其系统中的数据受到保护&#xff0c;不因偶然的或者恶意的原因而遭受到破坏、更改、泄露、系统连续可靠正常地运行&#xff0c;网络服务不中断。” 嗯…是不…

地表最强开源大模型!Llama 3.2,如何让你的手机变身私人智能助理

你有没有想过&#xff0c;为什么现在的手机越来越像小型电脑&#xff1f;无论是拍照、看视频&#xff0c;还是用各种APP&#xff0c;甚至是AI助手&#xff0c;手机的功能几乎无所不能。其实&#xff0c;这一切的背后有一个技术正在悄悄改变我们的生活&#xff0c;那就是Llama 3…

开发手札:内网开发Unity导致操作和编译卡顿的问题

最近一个工程切换了最新的unity和packages&#xff0c;在外网开发没什么问题&#xff0c;切换到内网接入保密开发后&#xff0c;发现不论是操作编辑器还是编译代码&#xff0c;巨卡无比。 以上是仅仅写了一句int a 1;后&#xff0c;编译代码的速度。 经过分…

初试Bootstrap前端框架

文章目录 一、Bootstrap概述二、Bootstrap实例1、创建网页2、编写代码3、代码说明4、浏览网页&#xff0c;查看结果5、登录按钮事件处理6、浏览网页&#xff0c;查看结果 三、实战小结 一、Bootstrap概述 大家好&#xff0c;今天我们将一起学习一个非常流行的前端框架——Boot…

在虚幻引擎中实时显示帧率

引擎自带了显示帧率的功能 但是只能在编辑器中显示 , 在游戏发布后就没有了 , 所以我们要自己做一个 创建一个控件蓝图 创建画布和文本 , 修改文本 文本绑定函数 , 点击创建绑定 添加一个名为 FPS 的变量 格式化文本 用大括号把变量包起来 {FPS Int} FPS 然后转到事件图表…

【论文串烧】多媒体推荐中的模态平衡学习 | 音视频语音识别中丢失导致的模态偏差对丢失视频帧鲁棒性的影响

文章目录 一、多媒体推荐中的模态平衡学习1.1 研究背景1.2 解决问题1.3 实施方案1.4 文章摘要1.5 文章重点1.6 文章图示图 1&#xff1a;不同模型变体在 AmazonClothing 数据集上的初步研究图 2&#xff1a;CKD模型架构的说明图 3&#xff1a;在 Amazon-Clothing 数据集上训练过…

科研绘图系列:R语言多个AUC曲线图(multiple AUC curves)

文章目录 介绍加载R包导入数据数据预处理画图输出结果组图系统信息介绍 多个ROC曲线在同一张图上可以直观地展示和比较不同模型或方法的性能。这种图通常被称为ROC曲线图,它通过比较不同模型的ROC曲线下的面积(AUC)大小来比较模型的优劣。AUC值越大,模型的诊断或预测效果越…

介绍 Agent Q:迎接下一代 AI 自动化助手

引言 在科技领域&#xff0c;随着人工智能的不断进步&#xff0c;自动化工具日益成为提升效率的重要手段。今天&#xff0c;我将向大家介绍一款名为 Agent Q 的 AI 自动化助手。这款工具不仅能够完成复杂的任务&#xff0c;还支持交互式命令行操作&#xff0c;使得用户体验更为…