单词嵌入_神秘的文本分类:单词嵌入简介

单词嵌入

Natural language processing (NLP) is an old science that started in the 1950s. The Georgetown IBM experiment in 1954 was a big step towards a fully automated text translation. More than 60 Russian sentences were translated into English using simple reordering and replacing rules.

自然语言处理(NLP)是始于1950年代的一门古老科学。 1954年, 乔治敦IBM的实验是朝着全自动文本翻译迈出的一大步。 使用简单的重新排序和替换规则,将60多个俄语句子翻译成英语。

The statistical revolution in NLP started in late the 1980s. Instead of hand-crafting a set of rules, a large corpus of text was analyzed to create rules using statistical approaches. Different metrics were calculated for given input data, and predictions were made using decision trees or regression-based calculations.

NLP的统计革命始于1980年代后期。 与其手工制定一组规则,不如分析大量文本集以使用统计方法创建规则。 针对给定的输入数据计算了不同的指标,并使用决策树或基于回归的计算进行了预测。

Today, complex metrics are replaced by more holistic approaches that create better results and that are easier to maintain.

如今,复杂的指标已被更全面的方法所取代,这些方法可以产生更好的结果并且更易于维护。

This post is about word embeddings, which is the first part of my machine learning for coders series (with more to follow!).

这篇文章是关于单词嵌入的,这是我的机器学习编码器系列文章的第一部分(还有更多后续内容!)。

什么是词嵌入? (What are word embeddings?)

Traditionally, in natural language processing (NLP), words were replaced with unique IDs to do calculations. Let’s take the following example:

传统上,在自然语言处理(NLP)中,单词被替换为唯一的ID进行计算。 让我们来看下面的例子:

This approach has the disadvantage that you will need to create a huge list of words and give each element a unique ID. Instead of using unique numbers for your calculations, you can also use vectors to that represent their meaning, so-called word embeddings:

这种方法的缺点是您将需要创建大量单词并为每个元素赋予唯一的ID。 除了使用唯一的数字进行计算之外,您还可以使用向量来表示其含义,即所谓的词嵌入:

In this example, each word is represented by a vector. The length of a vector can be different. The bigger the vector is, the more context information it can store. Additionally, the calculation costs go up as vector size increases.

在此示例中,每个单词都由一个向量表示。 向量的长度可以不同。 向量越大,它可以存储的上下文信息越多。 此外,计算成本随着向量大小的增加而增加。

The element count of a vector is also called the number of vector dimensions. In the example above, the word example is expressed with (4 2 6), whereby 4 is the value of the first dimension, 2 of the 2nd, and 6 of the 3rd dimension.

向量的元素计数也称为向量维数。 在上面的示例中,单词example用(4 2 6)表示,其中4是第一维的值,2是第二维的值以及6是第三维的值。

In more complex examples, there might be more than 100 dimensions that can encode a lot of information. Things like:

在更复杂的示例中,可能有100多个维度可以编码很多信息。 像:

  • gender,

    性别,
  • race,

    种族,
  • age,

    年龄,
  • type of word

    字词类型

will be stored.

将被存储。

A word such as one is a word that is a quantity like many. Therefore, both vectors are closer compared to words that are more different in their usage.

一个这样的词是一个数量众多的词。 因此,两个向量与用法不同的词相比更接近。

Simplified, if vectors are similar, then the words have similarities in their usage. For other NLP tasks, this has a lot of advantages because calculations can be made based upon a single vector with only a few hundreds of parameters in comparison to a huge dictionary with hundreds of thousands of IDs.

简化后,如果向量相似,则单词的用法相似。 对于其他NLP任务,这具有很多优势,因为与具有数十万个ID的庞大字典相比,可以基于仅具有数百个参数的单个向量进行计算。

Additionally, if there are unknown words that were never seen before, then this is no problem. You just need a good word embedding of the new word, and calculations are similar. The same applies to other languages. This is basically the magic of word embeddings that enables things like fast learning, multi-language processing, and much more.

另外,如果有以前从未见过的未知单词,那么这没问题。 您只需要在新单词上嵌入一个好的单词即可,并且计算结果相似。 其他语言也一样。 基本上,这就是单词嵌入的魔力,它可以实现快速学习,多语言处理等功能。

创建单词嵌入 (Creation of word embeddings)

It’s very popular to extend the concept of word embeddings to other domains. For example, a movie rental platform can create movie embeddings and do calculations upon vectors instead of movie IDs.

将词嵌入的概念扩展到其他领域非常流行。 例如,电影租赁平台可以创建电影嵌入并根据矢量而不是电影ID进行计算。

但是,如何创建此类嵌入? (But how do you create such embeddings?)

There are various techniques out there, but all of them follow the key aspect that the meaning of a word is defined due to its usage.

那里有各种各样的技术,但是所有这些技术都遵循一个关键方面,即由于单词的用法而定义了单词的含义。

Let’s say we have a set of sentences:

假设我们有一组句子:

text_for_training = ['he is a king','she is a queen','he is a man','she is a woman','she is a daughter','he is a son'
]

The sentences contain 10 unique words, and we want to create a word embedding for each word.

句子包含10个唯一的单词,我们要为每个单词创建一个单词嵌入。

{0: 'he',1: 'a',2: 'is',3: 'daughter',4: 'man',5: 'woman',6: 'king',7: 'she',8: 'son',9: 'queen'
}

There are various approaches for how to create embeddings out of them. Let’s pick one of the most used approaches called word2vec. The concept behind this technique uses a very simple neural network to create vectors that represent meanings of words.

有多种方法可以用来创建嵌入。 让我们选择一种最常用的方法word2vec 。 该技术背后的概念使用非常简单的神经网络来创建代表单词含义的向量。

Let’s start with the target word “king”. It is used within the context of the masculine pronoun “he”. Context in this example means it just is part of the same sentence. The same applies to “queen” and “she”. It also makes sense to do the same approach for more generic words. The word “he“ can be the target word and “is” is the context word.

让我们从目标词“ king ”开始。 它在男性代词“ he ”的上下文中使用。 在此示例中,上下文意味着它只是同一句子的一部分。 “ 皇后 ”和“ ”也一样。 对更通用的单词执行相同的方法也很有意义。 “ ”可以是目标词,“ ”是上下文词。

If we do this for every combination, we can actually get simple word embeddings. More holistic approaches add more complexity and calculations, but they are all based on this approach.

如果对每种组合都执行此操作,则实际上可以得到简单的单词嵌入。 更具整体性的方法会增加更多的复杂性和计算量,但它们都是基于此方法的。

To use a word as an input for a neural network we need a vector. We can decode a word's unique id in a vector by putting a 1 at the position of the word of our dictionary and keep every other index at 0. This is called a one-hot encoded vector:

要将单词用作神经网络的输入,我们需要一个向量。 我们可以通过在字典的单词的位置放置1并将每个其他索引保持在0来解码矢量中单词的唯一ID,这称为单热编码矢量:

Between the input and the output is a single hidden layer. This layer contains as many elements as the word embedding should have. The more elements word embeddings have, the more information they can store.

在输入和输出之间是单个隐藏层。 该层包含的元素数量应与嵌入一词一样多。 单词嵌入的元素越多,它们可以存储的信息就越多。

You might think, then just make it very big. But we have to consider that we need to store an embedding for each existing word that quickly adds up to a decent amount of data to be stored. Additionally, bigger embeddings mean a lot more calculations for neural networks that use embeddings.

您可能会认为,然后使其变得非常大。 但是我们必须考虑到,我们需要为每个现有单词存储一个嵌入,以快速将大量的数据存储起来。 此外,更大的嵌入意味着使用嵌入的神经网络需要进行更多的计算。

In our example, we will just use 5 as an embedding vector size.

在我们的示例中,我们将仅使用5作为嵌入矢量大小。

The magic of neural networks lies in what's in between the layers, called weights. They store information between layers, where each node of the previous layer is connected with each node of the next layer.

神经网络的魔力在于层之间的权重。 它们在层之间存储信息,其中上一层的每个节点与下一层的每个节点连接。

Each connection between the layers is a so-called parameter. These parameters contain the important information of neural networks. 100 parameters - 50 between input and hidden layer, and 50 between hidden and output layer - are initialized with random values and adjusted by training the model.

层之间的每个连接都是所谓的参数。 这些参数包含神经网络的重要信息。 使用随机值初始化100个参数-输入层和隐藏层之间的50个参数,以及隐藏层和输出层之间的50个参数-并通过训练模型进行调整。

In this example, all of them are initialized with 0.1 to keep it simple. Let’s think through an example training round, also called an epoch:

在此示例中,所有这些都使用0.1进行了初始化以保持简单。 让我们通过一个示例训练回合(也称为纪元)来思考:

At the end of the calculation of the neural network, we don’t get our expected output that tells us for the given context “he” that the target is “king”.

在神经网络计算的最后,我们没有得到预期的输出,该预期的输出告诉我们在给定的上下文中“ ”,目标是“ 国王 ”。

This difference between the result and the expected result is called the error of a network. By finding better parameter values, we can adjust the neural network to predict for future context inputs that deliver the expected target output.

结果与预期结果之间的差异称为网络错误。 通过查找更好的参数值,我们可以调整神经网络,以预测提供预期目标输出的将来上下文输入。

The contents of our layer connections will change after we try to find better parameters that get us closer to our expected output vector. The error is minimized as soon as the network predicts correctly for different target and context words. The weights between the input and hidden layer will contain all our word embeddings.

尝试找到更好的参数使我们更接近预期的输出矢量后,层连接的内容将发生变化。 一旦网络针对不同的目标词和上下文词正确预测,就可以将错误最小化。 输入层和隐藏层之间的权重将包含我们所有的词嵌入。

You can find the complete example with executable code here. You can create a copy and play with it if you press “Open in playground.”

您可以在此处找到带有可执行代码的完整示例。 如果您按“在操场上打开”,则可以创建副本并进行播放。

If you are not familiar with notebooks, it’s pretty simple: it can be read from top to bottom, and you can click and edit the Python code directly.

如果您不熟悉笔记本,它非常简单:可以从上到下阅读,并且可以直接单击和编辑Python代码。

By pressing “SHIFT+Enter,” you can execute code snippets. Just make sure to start at the top by clicking in the first snipped and pressing SHIFT+Enter, wait a bit and press again SHIFT+Enter, and so on and so on.

通过按“ SHIFT + Enter”,您可以执行代码段。 只需确保通过单击第一个片段并单击SHIFT + Enter并从顶部开始,稍等片刻然后再次按SHIFT + Enter,依此类推。

结论 (Conclusion)

In a nutshell, word embeddings are used to create neural networks in a more flexible way. They can be built using neural networks that have a certain task, such as prediction of a target word for a given context word. The weights between the layers are parameters the are adjusted over time. Et voilà, there are your word embeddings.

简而言之,单词嵌入用于以更灵活的方式创建神经网络。 可以使用具有特定任务的神经网络来构建它们,例如预测给定上下文单词的目标单词。 层之间的权重是随时间调整的参数。 等等,这里有您的单词嵌入。



I hope you enjoyed the article. If you like it and feel the need for a round of applause, follow me on Twitter.

希望您喜欢这篇文章。 如果您喜欢它并感到需要掌声,请在Twitter上关注我 。

I am a co-founder of our revolutionary journey platform called Explore The World. We are a young startup located in Dresden, Germany and will target the German market first. Reach out to me if you have feedback and questions about any topic.

我是我们的创新旅程平台“ 探索世界”的共同创始人。 我们是一家年轻的初创公司,位于德国德累斯顿,并将首先瞄准德国市场。 如果您有关于任何主题的反馈和问题,请与我联系。

Happy AI exploring :)

快乐的AI探索:)



References

参考文献

  • Wikipedia natural language processing

    维基百科自然语言处理

    Wikipedia natural language processinghttps://en.wikipedia.org/wiki/Natural_language_processing

    Wikipedia自然语言处理https://en.wikipedia.org/wiki/Natural_language_processing

  • Great paper about text classification created by co-founders of fastai

    Fastai联合创始人撰写的有关文本分类的出色论文

    Great paper about text classification created by co-founders of fastaihttps://arxiv.org/abs/1801.06146

    fastai的联合创始人撰写的有关文本分类的出色论文https://arxiv.org/abs/1801.06146

  • Googles state of the art approach for NLP tasks

    Google最先进的NLP任务处理方法

    Googles state of the art approach for NLP taskshttps://arxiv.org/abs/1810.04805

    Google处理NLP任务的最新方法https://arxiv.org/abs/1810.04805

翻译自: https://www.freecodecamp.org/news/demystify-state-of-the-art-text-classification-word-embeddings/

单词嵌入

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390745.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

使用Hadoop所需要的一些Linux基础

Linux 概念 Linux 是一个类Unix操作系统,是 Unix 的一种,它 控制整个系统基本服务的核心程序 (kernel) 是由 Linus 带头开发出来的,「Linux」这个名称便是以 「Linus’s unix」来命名的。 Linux泛指一类操作系统,具体的版本有&a…

python多项式回归_Python从头开始的多项式回归

python多项式回归Polynomial regression in an improved version of linear regression. If you know linear regression, it will be simple for you. If not, I will explain the formulas here in this article. There are other advanced and more efficient machine learn…

《Linux命令行与shell脚本编程大全 第3版》Linux命令行---4

以下为阅读《Linux命令行与shell脚本编程大全 第3版》的读书笔记,为了方便记录,特地与书的内容保持同步,特意做成一节一次随笔,特记录如下: 《Linux命令行与shell脚本编程大全 第3版》Linux命令行--- Linux命令行与she…

彻底搞懂 JS 中 this 机制

彻底搞懂 JS 中 this 机制 摘要:本文属于原创,欢迎转载,转载请保留出处:https://github.com/jasonGeng88/blog 目录 this 是什么this 的四种绑定规则绑定规则的优先级绑定例外扩展:箭头函数this 是什么 理解this之前&a…

⚡如何在2分钟内将GraphQL服务器添加到RESTful Express.js API

You can get a lot done in 2 minutes, like microwaving popcorn, sending a text message, eating a cupcake, and hooking up a GraphQL server.您可以在2分钟内完成很多工作,例如微波炉爆米花,发送短信, 吃蛋糕以及连接GraphQL服务器 。 …

leetcode 1744. 你能在你最喜欢的那天吃到你最喜欢的糖果吗?

给你一个下标从 0 开始的正整数数组 candiesCount ,其中 candiesCount[i] 表示你拥有的第 i 类糖果的数目。同时给你一个二维数组 queries ,其中 queries[i] [favoriteTypei, favoriteDayi, dailyCapi] 。 你按照如下规则进行一场游戏: 你…

回归分析_回归

回归分析Machine learning algorithms are not your regular algorithms that we may be used to because they are often described by a combination of some complex statistics and mathematics. Since it is very important to understand the background of any algorith…

ruby nil_Ruby中的数据类型-True,False和Nil用示例解释

ruby niltrue, false, and nil are special built-in data types in Ruby. Each of these keywords evaluates to an object that is the sole instance of its respective class.true , false和nil是Ruby中的特殊内置数据类型。 这些关键字中的每一个都求值为一个对…

浅尝flutter中的动画(淡入淡出)

在移动端开发中,经常会有一些动画交互,比如淡入淡出,效果如图: 因为官方封装好了AnimatedOpacity Widget,开箱即用,所以我们用起来很方便,代码量很少,做少量配置即可,所以&#xff0…

数据科学还是计算机科学_何时不使用数据科学

数据科学还是计算机科学意见 (Opinion) 目录 (Table of Contents) Introduction 介绍 Examples 例子 When You Should Use Data Science 什么时候应该使用数据科学 Summary 摘要 介绍 (Introduction) Both Data Science and Machine Learning are useful fields that apply sev…

空间复杂度 用什么符号表示_什么是大O符号解释:时空复杂性

空间复杂度 用什么符号表示Do you really understand Big O? If so, then this will refresh your understanding before an interview. If not, don’t worry — come and join us for some endeavors in computer science.您真的了解Big O吗? 如果是这样&#xf…

leetcode 523. 连续的子数组和

给你一个整数数组 nums 和一个整数 k ,编写一个函数来判断该数组是否含有同时满足下述条件的连续子数组: 子数组大小 至少为 2 ,且 子数组元素总和为 k 的倍数。 如果存在,返回 true ;否则,返回 false 。 …

Docker学习笔记 - Docker Compose

一、概念 Docker Compose 用于定义运行使用多个容器的应用,可以一条命令启动应用(多个容器)。 使用Docker Compose 的步骤: 定义容器 Dockerfile定义应用的各个服务 docker-compose.yml启动应用 docker-compose up二、安装 Note t…

创建shell脚本

1.写一个脚本 a) 用touch命令创建一个文件:touch my_script b) 用vim编辑器打开my_script文件:vi my_script c) 用vim编辑器编辑my_script文件,内容如下: #!/bin/bash 告诉shell使用什么程序解释脚本 #My first script l…

线性回归算法数学原理_线性回归算法-非数学家的高级数学

线性回归算法数学原理内部AI (Inside AI) Linear regression is one of the most popular algorithms used in different fields well before the advent of computers. Today with the powerful computers, we can solve multi-dimensional linear regression which was not p…

您应该在2020年首先学习哪种编程语言? ɐʌɐɾdıɹɔsɐʌɐɾ:ɹǝʍsuɐ

Most people’s journey toward learning to program starts with a single late-night Google search.大多数人学习编程的旅程都是从一个深夜Google搜索开始的。 Usually it’s something like “Learn ______”通常它类似于“学习______” But how do they decide which la…

Linux 概述

UNIX发展历程 第一个版本是1969年由Ken Thompson(UNIX之父)在AT& T贝尔实验室实现Ken Thompson和Dennis Ritchie(C语言之父)使用C语言对整个系统进行了再加工和编写UNIX的源代码属于SCO公司(AT&T ->Novell …

课程一(Neural Networks and Deep Learning),第四周(Deep Neural Networks)—— 0.学习目标...

Understand the key computations underlying deep learning, use them to build and train deep neural networks, and apply it to computer vision. 学习目标 See deep neural networks as successive blocks put one after each otherBuild and train a deep L-layer Neura…

使用ActionTrail Python SDK

ActionTrail提供官方的Python SDK。本文将简单介绍一下如何使用ActionTrail的Python SDK。 安装Aliyun Core SDK。 pip install aliyun-python-sdk-core 安装ActionTrail Python SDK。 pip install aliyun-python-sdk-actiontrail 下面是测试的代码。调用LookupEventsRequest获…

泰坦尼克:机器从灾难中学习_用于灾难响应的机器学习研究:什么才是好的论文?...

泰坦尼克:机器从灾难中学习For the first time in 2021, a major Machine Learning conference will have a track devoted to disaster response. The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021) has a track on…