语义分析 文本矛盾点解析_关于解析文本的几点思考

语义分析 文本矛盾点解析

Yesterday I wrote about three course modules in Oslo, and the fact that most of the presentation material is online. Today I will be writing about one lesson in the curriculum about ‘Parsing’. First I will share a few general thoughts. Consider this in a format as learning notes, from this presentation.

昨天我在奥斯陆写了三个课程模块,而且大多数演示材料都在线。 今天,我将在“解析”课程中写一堂课。 首先,我将分享一些一般想法。 从此演示文稿中,将其视为学习笔记的格式。

In a discussion on Stackoverflow there were several interesting answers to this:

在关于Stackoverflow的讨论中,对此有几个有趣的答案:

“I’d explain parsing as the process of turning some kind of data into another kind of data. In practice, for me this is almost always turning a string, or binary data, into a data structure inside my Program. For example, turning:

“我将解析解释为将某种数据转换为另一种数据的过程。 实际上,对我而言,这几乎总是将字符串或二进制数据转换为程序内部的数据结构。 例如,转向:

":Nick!User@Host PRIVMSG #channel :Hello!"

into (C)

进入(C)

struct irc_line {
char *nick;
char *user;
char *host;
char *command;
char **arguments;
char *message;
} sample = { "Nick", "User", "Host", "PRIVMSG", { "#channel" }, "Hello!" }

Another user explained it as:

另一位用户解释为:

“Parsing is the process of analyzing text made of a sequence of tokens to determine its grammatical structure with respect to a given (more or less) formal grammar. The parser then builds a data structure based on the tokens. This data structure can then be used by a compiler, interpreter or translator to create an executable program or library.”

“解析 是分析由一系列标记组成的文本以确定相对于给定(或更少)形式语法的语法结构的过程。 然后,解析器基于令牌构建数据结构。 然后,编译器,解释器或翻译器可以使用此数据结构来创建可执行程序或库。”

Even providing a model:

甚至提供一个模型:

Image for post
wikimedia.orgwikimedia.org

Slightly more complicated, but perhaps accurate:

稍微复杂一点,但也许准确:

“In computer science, parsing is the process of analysing text to determine if it belongs to a specific language or not (i.e. is syntactically valid for that language’s grammar). It is an informal name for the syntactic analysis process.”

“在计算机科学中,解析是分析文本以确定其是否属于特定语言(即, 对于该语言的语法语法上有效 )的过程。 这是句法分析过程的非正式名称。”

The same user made an argument as to what is not:

同一用户对什么不是参数提出了争论:

  • Parsing is not transform one thing into another. Transforming A into B, is, in essence, what a compiler does. Compiling takes several steps, parsing is only one of them.

    解析不是将一件事变成另一件事。 本质上,将A转换为B是编译器的工作。 编译需要几个步骤,解析只是其中之一。

  • Parsing is not extracting meaning from a text. That is semantic analysis, a step of the compiling process.

    解析不是从文本中提取含义。 那就是语义分析 ,这是编译过程的一步。

On Wikipedia it is:

在Wikipedia上是:

“Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).”

解析语法分析句法分析是按照自然语法,计算机语言或数据结构分析一串符号的过程,符合形式语法规则。 解析一词来自拉丁语pars ( orationis ),意思是(语音的一部分)。”

As a general explanation that works, we have several, however we might want to consider this more closely.

作为可行的一般解释,我们有几种解释,但是我们可能需要更仔细地考虑。

There is additionally a video accompanying this, although this video is in Norwegian.

尽管此视频是挪威语的 ,但还附带有一个视频。

依赖解析 (Dependency parsing)

There is a relationships between words, i.e., dependency relations.

单词之间存在关系,即依赖关系。

A dependency structure can be defined as a labeled, directed graph G.

依赖项结构可以定义为标记的有向图G。

Image for post
IN2110IN2110中的演示文稿

The Principles outlined in the IN2110 presentation are as follows:

IN2110演示文稿中概述的原理如下:

  • “Syntactic structure is complete (Connectedness).

    “句法结构完整( 连通性 )。

  • Syntactic structure is hierarchical (Acyclicity).

    句法结构是分层的(非 循环性 )。

  • Every word has at most one syntactic head (Single-Head).”

    每个单词最多具有一个语法头( Single-Head )。”

Connectedness can be enforced by adding a special root node (node 0).

可以通过添加特殊的根节点(节点0)来增强连接性。

树库:普遍依赖 (Treebanks: Universal Dependencies)

Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 300 contributors producing more than 150 treebanks in 90 languages. If you’re new to UD, you should start by reading the first part of the Short Introduction and then browsing the annotation guidelines.”

通用依赖关系(UD) 是一个框架,用于在不同人类语言之间一致地注释语法(词性,词法特征和句法依赖关系)。 UD是一个开放的社区活动,有300多个贡献者以90种语言生成了150多个树库。 如果您不熟悉UD,则应先阅读简短介绍的第一部分,然后浏览注释准则。”

Image for post
IN2110IN2110中的演示文稿

(程度)跨语言一致性 ((Degrees of) Cross-Linguistic Consistency)

On this topic there is an interesting paper that may be worth checking out from Google Research.

关于这个主题,有一篇有趣的论文可能值得从Google Research查阅。

Sentences across certain languages could all for example start with a big letter and end with punctuation.

例如,某些语言的句子都可以以一个大字母开头并以标点符号结尾。

Image for post
IN2110IN2110中的演示文稿

回顾过去(90年代) (Back in the days (90s))

How were parsing different in the 1990's?

解析在1990年代有何不同?

  • Parsers assigned linguistically detailed syntactic structures (based on linguistic theories).

    解析器分配了详细的语言句法结构(基于语言理论)。
  • Grammar-driven parsing: possible trees defined by the grammar.

    语法驱动的解析:语法定义的可能树。
  • Problems with coverage.

    覆盖问题。
  • Only around 70% of all sentences were assigned an analysis.

    所有句子中只有大约70%被分配了分析。
  • Most sentences were assigned very many analyses by a grammar and there is no way of choosing between them.

    大多数句子都被一个语法分配了很多分析,并且没有办法在它们之间进行选择。

输入数据驱动的(统计)解析 (Enter data-driven (statistical) parsing)

Compared to this what is modern parsing like in 2020?

与此相比,2020年的现代解析是什么样的?

  • Today data-driven/statistical parsing is available for a range of languages and syntactic frameworks.

    如今,数据驱动/统计解析可用于多种语言和语法框架。
  • Data-driven approaches: possible trees defined by the treebank (may also involve a grammar).

    数据驱动的方法:由树库定义的可能的树(也可能涉及语法)。
  • Produce one analysis (hopefully the most likely one) for any sentence and get most of them correct.

    对任何句子进行一项分析(希望是最有可能的一项分析),并使其大部分正确。
  • Still an active field of research, improvements are still possible.

    仍然是一个活跃的研究领域,改进仍然是可能的。

Further to this what is data-driven dependency parsing?

除此之外,什么是数据驱动的依赖项解析?

Data-driven dependency parsing

数据驱动的依赖项解析

  • M defined by formal conditions on dependency graphs (labeled directed graphs that are): I connected I acyclic I single-head I (projective)

    M由依存关系图(带标签的有向图)的形式条件定义:I连接I非循环I单头I(射影)
  • I may be defined in different ways I parsing method (deterministic, non-deterministic) I machine learning algorithm, feature representations.

    我可能以不同的方式定义我的解析方法(确定性,非确定性),机器学习算法,特征表示。

Two main approaches:

两种主要方法:

  1. Graph-based.

    基于图。
  2. Transition-based models.

    基于过渡的模型。

The IN2110 lecture focus on transition-based approaches.

IN2110讲座重点介绍基于过渡的方法。

Transition-based approaches.

基于过渡的方法。

Basic idea: define a transition system for mapping a sentence to its dependency graph.

基本思想 :定义一个将句子映射到其依赖图的转换系统。

  • Learning: induce a model for predicting the next state transition, given the transition history.

    学习 :根据给定的转换历史,得出一个用于预测下一个状态转换的模型。

  • Parsing: Construct the optimal transition sequence, given the induced model.

    解析:给定诱导模型,构造最佳​​过渡序列

Shift-Reduce解析的改编。 (An Adaptation of Shift–Reduce Parsing.)

  • Originally developed for non-ambiguous languages: deterministic.

    最初是为非歧义语言开发的:确定性。
  • Shift (‘read’) tokens from input buffer, one at a time, left-to-right;

    从输入缓冲区从左到右一次移位(“读取”)令牌;
  • Compare top n symbols on stack against rule RHS: reduce to LHS.

    比较规则RHS堆栈上的前n个符号:简化为LHS。
  • Dependencies: create arcs between top of stack and front of buffer.

    相关性:在堆栈顶部和缓冲区前端之间创建弧。

Architecture: Stack and Buffer Configurations.

体系结构:堆栈和缓冲区配置。

Image for post
IN2110IN2110中的演示文稿

So within this workspace one has to navigate in parsing:

因此,在此工作空间中,必须进行解析:

  • Transition system ensures formal wellformedness of dependency trees;

    过渡系统确保依赖树的形式良好;
  • The arc-eager system can generate all projective trees (and only those);

    弧线渴望系统可以生成所有投射树(并且仅生成那些);
  • A specific sequence of transitions determines the final parsing result.

    特定的过渡顺序决定了最终的解析结果。

Towards a Parsing Algorithm:

迈向解析算法:

  • Abstract goal: Find transition sequence that yields the ‘correct’ tree.

    抽象目标:找到产生“正确”树的过渡序列。
  • Learn from treebanks: output dependency tree with high probability.

    向树库学习:高概率输出依赖树。
  • Probability distributions over transitions sequences (rather than trees).

    过渡序列(而不是树)上的概率分布。

架构摘要 (Architecture Summary)

Image for post
IN2110IN2110中的演示文稿

Data is labeled in the test set and attempted predictions are made.

在测试集中标记数据并进行尝试的预测。

数据驱动的依赖解析器 (Data-driven dependency parsers)

There are a number of freely available dependency parsers:

有许多免费的依赖项解析器:

  • Pre-trained models and trainable for any language (given available training data)

    预先训练的模型并且可以针对任何语言进行训练(如果有可用的训练数据)
  1. Stanford CoreNLP (English)

    斯坦福大学CoreNLP(英语)
  2. SpaCy (A number of languages)

    SpaCy(多种语言)
  3. Google SyntaxNet I UDParse I etc.

    Google SyntaxNet I UDParse I等

There does however need to be evaluation.

但是,确实需要进行评估。

I wrote about this previously in regards to how it might need to change in NLP:

之前我就如何更改NLP中的内容写过文章:

However, the status is that currently the metrics are often what counts.

但是,目前的状态是通常很重要的指标。

UAS: Unlabeled Attachment Score I For each token, does it have correct head (source of incoming edge)?

UAS :未标记的附着分数I对于每个令牌,它是否具有正确的头(传入边缘的来源)?

LAS: Labeled Attachment Score I In addition to the head, is the dependency type (edge label) correct?

LAS :带标签的附件分数I除头部之外,依存关系类型(边缘标签)是否正确?

结论 (In Conclusion)

Data-Driven Dependency Parsing:

数据驱动的依赖关系解析:

  • No notion of grammaticality (no rules): more or less probable trees.

    没有语法概念(没有规则):或多或少的可能树。
  • Much room for experimentation: Feature models and types of classifiers.

    有很大的实验空间:特征模型和分类器类型。
  • Decent results with Maximum Entropy or Support Vector Machines.

    使用最大熵或支持向量机可获得不错的结果。
  • In recent years, further advances with deep neural network classifiers.

    近年来,深度神经网络分类器取得了进一步的进步。

Variants on Data-Driven Dependency Parsing:

数据驱动的依赖项解析的变体:

  • Other transition systems (e.g. arc-standard; like ‘classic’ shift-reduce).

    其他过渡系统(例如,弧形标准;例如“经典”移位减少)。
  • Different techniques for non-projective trees; e.g. swap transitions.

    非投影树的不同技术; 例如交换过渡。
  • Can relax transition system further, to output general, non-tree graphs.

    可以进一步放松过渡系统,以输出一般的非树图。
  • Beam search: exploring the top-n transitions out of each configuration.

    光束搜索:探索每种配置的前n个过渡。

I currently need to work with a corpus of documents and thought it was interesting to consider parsing as a problem.

我目前需要处理大量文档,并且认为将解析视为问题很有趣。

This is #500daysofAI and you are reading article 433. I am writing one new article about or related to artificial intelligence every day for 500 days.

这是#500daysofAI,您正在阅读文章433。我连续500天每天都在撰写一篇有关人工智能或与人工智能有关的新文章。

翻译自: https://medium.com/swlh/a-few-thoughts-on-parsing-text-b496a0f99dde

语义分析 文本矛盾点解析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/242162.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

专家建议用南方的养老金拿去救济东北,网友炸锅了

随着我国老龄化的不断加剧,养老资金已经成为了社会要面临的一个艰巨问题,特别是在东北地区,这些年来东北地区经济发展比较缓慢,导致养老资金收入也跟着放慢,目前有的省份养老基金结余已经为负数,这对于如何…

明明知道银行存款会贬值,为什么还有那么多人把钱放在银行?

钱存在银行肯定是贬值的,但是对于那些风险意识比较低,或者从风险承受能力比较低的人来说,你除了存在银行,还有别的更好选择吗?没有!银行存款利率基本上跑不赢通货膨胀。最近几年,我国货币量的不…

pandas 机器学习_机器学习的PANDAS

pandas 机器学习Pandas is one of the tools in Machine Learning which is used for data cleaning and analysis. It has features which are used for exploring, cleaning, transforming and visualizing from data.Pandas是机器学习中用于数据清理和分析的工具之一。 它具…

家族信托是什么东东?为何受到富豪们的大力吹捧?

说到信托相信很多人都知道是怎么回事,但是说到家族信托就未必有人知道是啥回事了。不过大家对于家族信托不了解,并不妨碍家族信托的迅猛发展,最近几年家族信托这个概念在富豪圈已经成为了一个热门的话题,很多有钱的人都在尝试搞家…

安卓SlidingDrawer

本文由PurpleSword(jzj1993)原创&#xff0c;转载请注明原文网址 http://blog.csdn.net/jzj1993Layout<RelativeLayout xmlns:android"http://schemas.android.com/apk/res/android"android:layout_width"match_parent"android:layout_height"wrap…

鸭子目标检测数据集VOC格式300张

鸭子是一种较为常见的水禽&#xff0c;属于鸟纲鸭科动物&#xff0c;是人们生活中的常见动物之一。鸭子的外形特征独特&#xff0c;身体小巧灵活&#xff0c;身长约30-40厘米左右&#xff0c;体重在1-2千克左右。鸭子的嘴宽大而扁平&#xff0c;呈圆形&#xff0c;嘴边有着细小…

为什么支付宝不提供房贷业务?原因在这里

大家都知道房贷是银行的一块肥肉&#xff0c;也是银行利润最丰厚最稳定的一部分&#xff0c;目前房贷业务占到银行整体业务基本上都是在30%以上&#xff0c;有部分银行甚至达到50%以上。尽管房贷业务和很肥&#xff0c;但是并不是所有银行都会开展房贷业务&#xff0c;比如蚂蚁…

文本摘要提取_了解自动文本摘要-1:提取方法

文本摘要提取Text summarization is commonly used by several websites and applications to create news feed and article summaries. It has become very essential for us due to our busy schedules. We prefer short summaries with all the important points over read…

用户细分_基于购买历史的用户细分

用户细分介绍 (Introduction) The goal of this analysis was to identify different user groups based on the deals they have availed, using a discount app, in order to re-target them with offers similar to ones they have availed in the past.该分析的目的是使用折…

一个字节的网络漫游故事独白

大家好&#xff0c;给大家介绍一下&#xff0c;我是一个字节。相比于你们人类据说即将达到的百岁人生的寿命&#xff0c;我的一生简直不直一提&#xff08;我只能存活零点几个毫秒&#xff09;。也许只有那些码农才会了解我&#xff0c;而且也只有一部分码农。那些整天做业务的…

swap最大值和平均值_SWAP:Softmax加权平均池

swap最大值和平均值Blake Elias is a Researcher at the New England Complex Systems Institute.Shawn Jain is an AI Resident at Microsoft Research.布莱克埃里亚斯 ( Blake Elias) 是 新英格兰复杂系统研究所的研究员。 Shawn Jain 是 Microsoft Research 的 AI驻地 。 …

该酷的酷该飒的飒,穿出自己的潮流前线

精选匈牙利白鸭绒填充&#xff0c;柔软蓬松 舒适感很强&#xff0c;回弹性好 没有什么异味很干净安全 宝贝穿上去保暖又舒适 树脂拉链&#xff0b;金属按扣&#xff0c;松紧帽檐&#xff0b;袖口 下摆还做了可调节抽绳&#xff0c;细节满满防风保暖很nice 短款设计相较于…

pytorch卷积可视化_使用Pytorch可视化卷积神经网络

pytorch卷积可视化Filter and Feature map Image by the author筛选和特征图作者提供的图像 When dealing with image’s and image data, CNN are the go-to architectures. Convolutional neural networks have proved to provide many state-of-the-art solutions in deep l…

Golang之轻松化解defer的温柔陷阱

defer是Go语言提供的一种用于注册延迟调用的机制&#xff1a;让函数或语句可以在当前函数执行完毕后&#xff08;包括通过return正常结束或者panic导致的异常结束&#xff09;执行。深受Go开发者的欢迎&#xff0c;但一不小心就会掉进它的温柔陷阱&#xff0c;只有深入理解它的…

u-net语义分割_使用U-Net的语义分割

u-net语义分割Picture By Martei Macru On Unsplash图片由Martei Macru On Unsplash拍摄 Semantic segmentation is a computer vision problem where we try to assign a class to each pixel . Unlike the classic image classification task where only one class value is …

我国身家超过亿元的有多少人?

目前我国身家达到亿元以上的人数&#xff0c;从公开数据来看大概有13万人&#xff0c;但如果把那些统计不到的隐形亿万富翁计算在内&#xff0c;我认为至少有20万以上。公开资料显示目前我国亿万富翁人数达到133000人根据胡润2018财富报告显示&#xff0c;目前我国&#xff08;…

地理空间数据

摘要 (Summary) In this article, using Data Science and Python, I will show how different Clustering algorithms can be applied to Geospatial data in order to solve a Retail Rationalization business case.在本文中&#xff0c;我将使用数据科学和Python演示如何将…

嵌入式系统分类及其应用场景_词嵌入及其应用简介

嵌入式系统分类及其应用场景Before I give you an introduction on Word Embeddings, take a look at the following examples and ask yourself what is common between them:在向您介绍Word Embeddings之前&#xff0c;请看一下以下示例并问问自己它们之间的共同点是什么&…

山东男子5个月刷信用卡1800次,被银行处理后他选择29次取款100元

虽然我国实行的是存款自愿&#xff0c;取款自由的储蓄政策&#xff0c;客户想怎么取款&#xff0c;在什么时候取&#xff0c;取多少钱&#xff0c;完全是客户的权利&#xff0c;只要客户的账户上有钱&#xff0c;哪怕他每次取一毛钱取个100次都是客户的权利。但是明明可以一次性…

深发银行为什么要更名为平安银行?

深圳发展银行之所以更名为平安银行&#xff0c;最直接的原因是平安银行收购了深圳发展银行&#xff0c;然后又以平安集团作为主体&#xff0c;以深圳发展银行的名义收购了平安银行&#xff0c;最后两个人合并之后统一命名为平安银行。深圳发展银行更名为平安银行&#xff0c;大…