文本摘要提取
Text summarization is commonly used by several websites and applications to create news feed and article summaries. It has become very essential for us due to our busy schedules. We prefer short summaries with all the important points over reading a whole report and summarizing it ourselves. So, several attempts had been made to automate the summarizing process. In this article, we will talk about some of them and see how they work.
几个网站和应用程序通常使用文本摘要来创建新闻提要和文章摘要。 由于我们繁忙的日程安排,对我们来说这已经变得非常重要。 与阅读整个报告并自己进行总结相比,我们更喜欢具有所有重要要点的简短摘要。 因此,已经进行了几次尝试来使摘要过程自动化。 在本文中,我们将讨论其中的一些,并了解它们的工作原理。
什么是总结? (What is summarization?)
Summarization is a technique to shorten long texts such that the summary has all the important points of the actual document.
摘要是一种缩短长文本的技术,以便摘要具有实际文档的所有要点。
There are mainly four types of summaries:
摘要主要有四种类型:
- Single Document Summary: Summary of a Single Document 单个文档摘要:单个文档摘要
- Multi-Document Summary: Summary from multiple documents 多文档摘要:来自多个文档的摘要
- Query Focused Summary: Summary of a specific query 以查询为重点的摘要:特定查询的摘要
- Informative Summary: It includes a summary of the full information. 信息摘要:包括完整信息的摘要。
自动汇总的方法 (Approaches to Automatic summarization)
There are mainly two types of summarization:
摘要主要有两种类型:
Extraction-based Summarization: The extractive approach involves picking up the most important phrases and lines from the documents. It then combines all the important lines to create the summary. So, in this case, every line and word of the summary actually belongs to the original document which is summarized.
基于提取的摘要:提取方法涉及从文档中挑选最重要的短语和行。 然后,它将所有重要的行合并以创建摘要。 因此,在这种情况下,摘要的每一行和每个单词实际上都属于摘要的原始文档。
Abstraction-based Summarization: The abstractive approach involves summarization based on deep learning. So, it uses new phrases and terms, different from the actual document, keeping the points the same, just like how we actually summarize. So, it is much harder than the extractive approach.
基于抽象的摘要:抽象方法涉及基于深度学习的摘要。 因此,它使用与实际文档不同的新短语和术语,使要点保持不变,就像我们实际进行总结一样。 因此,这比提取方法难得多。
It has been observed that extractive summaries sometimes work better than the abstractive ones probably because extractive ones don’t require natural language generations and semantic representations.
已经观察到,提取摘要有时比抽象摘要更好,这可能是因为提取摘要不需要自然语言生成和语义表示。
评估方法 (Evaluation methods)
There are two types of evaluations:
评估有两种类型:
- Human Evaluation 人工评估
- Automatic Evaluation 自动评估
Human Evaluation: Scores are assigned by human experts based on how well the summary covers the points, answer the queries, and other factors like grammaticality and non-redundancy.
人工评估 :人工评估专家将根据摘要涵盖的分数,回答问题以及其他因素(如语法和非冗余)分配分数。
自动评估 (Automatic Evaluation)
ROUGE: ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is the method that determines the quality of the summary by comparing it to other summaries made by humans as a reference. To evaluate the model, there are a number of references created by humans and the generated candidate summary by machine. The intuition behind this is if a model creates a good summary, then it must have common overlapping portions with the human references. It was proposed by Chin-Yew Lin, University of California.
ROUGE: ROUGE代表针对召回评估的面向召回的本科。 通过将摘要与人为参考的其他摘要进行比较来确定摘要的质量的方法。 为了评估模型,人类创建了许多参考,机器生成了候选摘要。 这背后的直觉是,如果模型创建了一个好的摘要,则它必须与人工参考具有共同的重叠部分。 它是由加利福尼亚大学的Chin-Yew Lin提出的。
Common versions of ROUGE are:
ROUGE的常见版本是:
ROUGE-n: It is measure on the comparison between the machine-generated output and the reference output based on n-grams. An n-gram is a contiguous sequence of n items from a given sample of text or speech, i.e, it is simply a sequence of words. Bigrams mean two words, Trigrams mean 3 words and so on. We normally use Bigrams.
ROUGE-n:基于n-gram对机器生成的输出与参考输出之间的比较进行度量。 n-gram是来自给定文本或语音样本的n个项目的连续序列,即,它只是单词序列。 双字母组表示两个单词,三字母组表示三个单词,依此类推。 我们通常使用Bigrams。
“Where p is “the number of common n-grams between candidate and reference summary”, and q is “the number of n-grams extracted from the reference summary only”. -Source
“其中p是“候选者和参考摘要之间的常见n元语法数”,q是“仅从参考摘要中提取的n元语法数”。 - 来源
ROUGE-L: It states that the longer the longest common subsequence in two texts, the similar they are. So, it is flexible than n-gram. It assigns scores based on how long can be a sequence, which is common to the machine-generated candidate and the human reference.
ROUGE-L:它指出,两个文本中最长的公共子序列越长,它们相似。 因此,它比n-gram灵活。 它根据序列的长度来分配分数,这对于机器生成的候选者和人工参考是通用的。
ROUGE-SU: It brings a concept of skip bi-grams and unigrams. Basically it allows or considers a bigram if there are some other words between the two words, i.e, the bigrams don’t need to be consecutive words.
ROUGE-SU:它带来了跳过二字组和单字组的概念。 基本上,如果两个单词之间还有其他单词,则它允许或考虑一个双字母组,即,双字母组不需要是连续的单词。
ROUGE-2 is most popular and given by:
ROUGE-2最受欢迎,由以下提供:
Where for every bigram ‘i’ we calculate the minimum of the number of times it occurred in the generated document X and the reference document S, for all reference documents give, divided by the total number of times each bigram appears in all of the reference documents. It is based on BLEU scores.
对于每个二元组“ i”,我们计算生成的文档X和参考文档S中出现的最小次数,对于所有给定的参考文档,除以每个二元组出现在所有参考文献中的总次数文件。 它基于BLEU分数。
Feature-Based Summarization: Developed by H. P Luhan at IBM in 1958. The paper proposed that the importance of a sentence is a function of the high-frequency words in the document. Elaborately speaking, the algorithm measures the frequency of words and phrases in the document and decides the importance of the sentence considering the words in the sentence and their frequencies. It states if the sentence has words with higher frequencies, It is important, but here we do not include the common words like “a”,” the”. Etc.
基于特征的摘要 :由IBM的H. P Luhan在1958年开发。该论文提出,句子的重要性取决于文档中高频词的作用。 确切地说,该算法测量文档中单词和短语的频率,并根据句子中的单词及其频率来确定句子的重要性。 它说明句子中是否包含频率更高的单词,这很重要,但是在这里我们不包括“ a”,“ the”等常见单词。 等等。
Extractive Summarizations: “Extractive summarization techniques produce summaries by choosing a subset of the sentences in the original text”.-Source
摘录 摘要:“ 摘录摘要技术通过选择原始文本中句子的子集来产生摘要。”- 来源
The Extractive Summarizers first create an intermediate representation that has the main task of highlighting or taking out the most important information of the text to be summarized based on the representations. There are two main types of representations:
提取摘要器首先创建一个中间表示形式 ,其主要任务是突出显示或提取基于表示形式的要摘要文本的最重要信息。 表示有两种主要类型:
Topic representations: It focuses on representing the topics represented in the texts. There are several kinds of approaches to get this representation. We here are going to talk about two of them. Others include Latent Semantic Analysis and Bayesian Models. If you want to study others as well I will encourage you to go through the references.
主题表示:它专注于表示文本中表示的主题。 有几种方法可以得到这种表示。 我们在这里将讨论其中两个。 其他还包括潜在语义分析和贝叶斯模型。 如果您也想学习其他人,我将鼓励您阅读参考资料。
Frequency Driven Approaches: In this approach, we assign weights to the words. If the word is related to the topic we assign 1 or else 0. The weights may be continuous depending on the implementation. Two common techniques for topic representations are:
频率驱动方法 :在这种方法中,我们为单词分配权重。 如果单词与主题相关,我们将其分配为1或0。权重可以是连续的,具体取决于实现方式。 主题表示的两种常用技术是:
Word Probability: It simply uses the frequency of words as an indicator of the importance of the word. The probability of a word w is given by the frequency of occurrences of the word, f (w), divided by all words in the input which has a total of N words.
单词概率 :它仅使用单词的频率作为单词重要性的指标。 单词w的概率由单词出现的频率f(w)除以输入中总共有N个单词的所有单词得出。
For sentence importance using the word probabilities, the importance of a sentence is given by the average importance of the words in the sentence.
对于使用单词概率的句子重要性,句子的重要性由句子中单词的平均重要性给出。
TFIDF.(Tern Frequency Inver Document Frequency): This method is devised as an advancement to the word probability method. Here the TF-IDF method is used for assigning the weights. TFIDF is a method that assigns low weights to the words that occur very frequently in most of the documents under the intuitions that they are stopwords or words like “The”. Otherwise, due to the term frequency if a word appears in a document uniquely with a high frequency it is given high weightage.
TFIDF(Tern Frequency Inver Document Frequency):此方法是对单词概率方法的改进。 此处,TF-IDF方法用于分配权重。 TFIDF是一种对大多数文档中经常出现的单词赋予低权重的方法,因为它们是停用词或诸如“ The”之类的单词。 否则,由于术语“频率”,如果单词以高频率唯一地出现在文档中,则会被赋予较高的权重。
Topic word Approaches: This approach is similar to Luhan’s approach. “The topic word technique is one of the common topic representation approaches which aims to identify words that describe the topic of the input document”.-Source This method calculates the word frequencies and uses a frequency threshold to find the word that can potentially describe a topic. It classifies the importance of a sentence as the function of the number of topic words it contains.
主题词方法 :此方法与鹿Lu的方法相似。 “ 主题词技术是常见的主题表示方法之一,旨在识别描述输入文档主题的词。”- 来源 此方法计算单词频率并使用频率阈值来查找可能描述主题的单词。 它将句子的重要性根据其包含的主题词的数量进行分类。
Indicator Representations: This type of representation depend on the features of the sentences and rank them on the basis of the features. So, here the importance of the sentence is not dependent on the words it contains as we have seen in the Topic representations but directly on the sentence features. There are two methods for this type of representation. Let’s look at them.
指示符表示:这种表示形式取决于句子的特征,并根据这些特征对它们进行排名。 因此,这里句子的重要性并不取决于主题表示中所包含的单词,而是直接取决于句子的特征。 对于这种类型的表示,有两种方法。 让我们看看它们。
Graph-Based Methods: This is based on the Page Rank algorithm. It represents text documents as connected graphs. The sentences are represented as the nodes of the graphs and edges between the nodes or the sentences are a measure of similarity between the two sentences. We will talk about this in detail in the upcoming portions.
基于图的方法 :这是基于页面排名算法的。 它表示文本文档为连接图。 句子表示为图的节点,节点之间的边缘或句子是两个句子之间相似度的度量。 我们将在接下来的部分中详细讨论这一点。
Machine-Learning Methods: The machine learning methods approach the summarization problem as a classification problem. The models try to classify sentences based on their features into, summary or non-summary sentences. For training the models, we have a training set of documents and their corresponding human reference extractive summaries. Normally Naive Bayes, Decision Tree, and SVMs are used here.
机器学习方法:机器学习方法将汇总问题作为分类问题。 这些模型尝试根据句子的特征将句子分为摘要句子或非摘要句子。 为了训练模型,我们有一套训练文档和它们对应的人类参考摘要。 通常,此处使用朴素贝叶斯,决策树和SVM。
评分和句子选择 (Scoring and Sentences Selection)
Now, once we get the intermediate representations, we move to assign some scores to each sentence to specify their importance. For topic representations, a score to a sentence depends on the topic words it contains, and for an indicator representation, the score depends on the features of the sentences. Finally, the sentences having top scores, are picked and used to generate a summary.
现在,一旦获得了中间表示,我们便开始为每个句子分配一些分数以指定其重要性。 对于主题表示,句子的分数取决于其包含的主题词,对于指示符表示,分数取决于句子的特征。 最后,挑选出得分最高的句子,然后将其用于生成摘要。
基于图的方法 (Graph-Based methods)
The graph-based methods were first introduced by a paper by Rada Mihalcea and Paul Tarau, University of North Texas. The method is called the Text Rank algorithm and is influenced by Google’s Page Rank Algorithm. This algorithm primarily tries to find the importance of a vertex in a given graph.
基于图的方法首先由北德克萨斯大学的Rada Mihalcea和Paul Tarau发表。 该方法称为“文本排名算法”,受Google的“页面排名算法”影响。 该算法主要试图找到给定图中顶点的重要性。
Now, how the algorithm works?
现在,该算法如何工作?
We have learned in this algorithm, each sentence is represented as a vertex. An edge joining two vertices or two sentences denotes that the two sentences are similar. If the similarity of any two sentences is greater than a particular threshold, the nodes representing the sentences are joined by an edge.
我们已经在该算法中学习到,每个句子都表示为一个顶点。 连接两个顶点或两个句子的边表示两个句子相似。 如果任意两个句子的相似度大于特定阈值,则表示这些句子的节点将通过一条边连接。
When two vertices are joined, it portrays that, one vertex is casting a vote to the other one. More the number of votes to a particular node( vertex or sentence), more important is that node and apparently the sentence represented. Now, the votes are also kind of weighted, each vote is not of the same weight or importance. The importance of the vote also depends on the importance of the node or sentence casting the vote, higher the importance of the node casting the vote higher is the importance of the vote. So, the number of votes cast to a sentence and the importance of those votes determine the importance of the sentence. This is the same idea behind the google page rank algorithm, and how it decides and ranks webpages, just that the nodes represent the webpages.
当两个顶点连接在一起时,它表示,一个顶点正在向另一个顶点投票。 对特定节点(顶点或句子)的投票数越多,则该节点以及显然表示的句子越重要。 现在,投票也经过加权,每种投票的权重或重要性都不相同。 投票的重要性还取决于投票的节点或句子的重要性,投票的节点的重要性越高,投票的重要性越高。 因此,投给一个句子的票数和这些票的重要性决定了句子的重要性。 这与google页面排名算法,其如何决定和排名网页的想法相同,只是节点代表了网页。
If we have a paragraph, we will decompose it into a set of sentences. Now,say we represent each sentence as a vertex ‘vi’ so, we obtain a set of vertices V. As discussed, an edge joins a vertex with another vertex of the same set, so an edge E can be represented as a subset of (V x V). In the case of a directed graph say, In(V{i}) is the number of incoming edges to a node and the Out(v{j}) is the number of outgoing edges from a given node, and the importance score of a vertex is given by S{j}.
如果我们有一个段落,我们会将其分解为一组句子。 现在,假设我们将每个句子表示为一个顶点“ vi”,因此,我们获得了一组顶点V。如上所述,一条边将一个顶点与同一集合的另一个顶点连接在一起,因此,一条边E可以表示为一个顶点V的子集。 (V x V)。 在有向图的情况下,In(V {i})是节点进入边缘的数量,Out(v {j})是从给定节点离开边缘的数量,重要性得分为顶点由S {j}给定。
页面排名算法 (Page Rank Algorithm)
According to the google page rank algorithm,
根据Google页面排名算法,
Where S(V{i}) is the score of the subject node under consideration, and S(V(j)) represents all the nodes that have outgoing edges to V{i}. Now, the score of V{j} is divided by the out-degree of V{j}, which is the consideration of the probability that the user will choose that particular webpage.
其中S(V {i})是所考虑的主题节点的分数,而S(V(j))表示具有到V {i}的输出边缘的所有节点。 现在,将V {j}的分数除以V {j}的外度,这是用户选择该特定网页的可能性的考虑因素。
Elaborately, if this is the graph, standing at A, as a user, I can go to both B and C, so the chance of me going to C is ½, i.e, 1/( outdegree of A). The factor d is called the damping factor. In the original page rank algorithm, factor d incorporates randomness. 1-d denotes that the user will move to a random webpage, not going to the connected ones. The factor is generally set to 0.85. The same algorithm is implemented in the text-rank algorithm.
详细地讲,如果这是一个以用户身份站在A处的图形,那么我可以同时访问B和C,因此我访问C的机会是½,即1 /(超出A的度)。 因子d称为阻尼因子。 在原始页面等级算法中,因子d包含了随机性。 1-d表示用户将移动到随机网页,而不是连接的网页。 该系数通常设置为0.85。 文本秩算法中实现了相同的算法。
Now, the question arises, how we obtain the scores?
现在,问题来了,我们如何获得分数?
Let’s check for the page rank algorithm first, then transform it for text rank. As we can see above there are 4 vertices, first, we assign random scores to all the vertices, say, [0.8,0.9,0.9,0.9]. Then, probability scores are assigned to the edges.
让我们先检查页面排名算法,然后将其转换为文本排名。 正如我们在上面看到的,有4个顶点,首先,我们为所有顶点分配随机分数,例如[0.8,0.9,0.9,0.9]。 然后,将概率分数分配给边缘。
The matrix is the adjacent matrix of the graph. It can be observed that the values of the adjacent matrices are the probability values i.e, 1/outdegree of that node or vertex. So, actually the page rank graph becomes unweighted as the equation only contains the term that gives the weight.
矩阵是图形的相邻矩阵。 可以观察到,相邻矩阵的值是概率值,即该节点或顶点的1 / outdegree。 因此,实际上页面排名图没有加权,因为等式仅包含给出权重的项。
Now, the whole equation becomes,
现在,整个方程变为
We can see, the old score matrix is multiplied using the adjacency matrix to get the new score matrix. We will continue this until the L2 norm of the new score matrix and the old score matrix becomes less than a given constant mostly 1 x10^-8. This is a convergence property based on linear algebra and the theory of eigenvalues and vectors. We will skip the maths to keep it simple. Once the convergence is achieved we obtain the final importance scores from the score matrix.
我们可以看到,使用邻接矩阵将旧的得分矩阵相乘得到新的得分矩阵。 我们将继续进行此操作,直到新分数矩阵和旧分数矩阵的L2范数变得小于给定常数(大部分为1 x10 ^ -8)为止。 这是基于线性代数以及特征值和向量理论的收敛性。 为了简化起见,我们将跳过数学。 一旦收敛,就可以从得分矩阵中获得最终的重要性得分。
For the text rank algorithm, the equation and the graph are modified to a weighted graph, because here, just dividing with the out-degree won’t convey the full importance. As a result, the equation becomes:
对于文本等级算法,将等式和图形修改为加权图形,因为在这里,仅用出度进行除法并不能表达全部的重要性。 结果,方程变为:
W represents the weight factor.
W代表权重因子。
The implementation of text rank consists of two different natural language processes:
文本等级的实现包括两种不同的自然语言过程:
- A keyword extraction task, which selects keywords and phrases 关键字提取任务,用于选择关键字和短语
- A sentence extraction task, this identifies the most important sentences. 句子提取任务,用于识别最重要的句子。
Keyword extraction task
关键字提取任务
Previously this was done using the frequency factor, which gave poor results comparatively. The text rank paper introduced a fully unsupervised algorithm. According to the algorithm, the natural language text is tokenized and parts of speech are tagged, and single words are added to the word graph as nodes. Now, if two words are similar, the corresponding nodes are connected using an edge. The similarity is measured using the co-occurrences of words. If two words occur in a window of N words, N varying from 2 to 10, the two words are considered similar. The words with the maximum number of important incident edges are selected as the most important keywords.
以前,这是使用频率因子完成的,相对而言效果较差。 文本等级论文介绍了一种完全不受监督的算法。 根据该算法,对自然语言文本进行标记并标记语音部分,并将单个单词作为节点添加到单词图中。 现在,如果两个单词相似,则使用边连接相应的节点。 使用单词的共现来衡量相似度。 如果两个单词出现在N个单词的窗口中,N在2到10之间变化,则认为这两个单词相似。 选择具有最大重要入射边数的单词作为最重要的关键字。
Sentence extraction task
句子提取任务
It also works similar to keyword extraction, the only difference is in keyword extraction, the nodes represented keywords, here they represent entire sentences. Now, for the formation of the graph for sentence ranking, the algorithm creates a vertex for each sentence in the text and adds to the graph. The sentences are too large so, co-occurrence measures can not be applied. So, the paper uses a “similarity” between two sentences using the content overlap between two sentences, in simpler words, the similarity depends on the number of common word tokens present in the two sentences. The authors propose a very interesting “recommendation” insight here. They denote the joining of the edge between two similar sentences or vertices as if it is recommending the reader to read another line, which is similar to the current line he/she is reading. The similarity I feel therefore denotes the similar content or interest among the two sentences. To prevent long sentences from getting recommended the importances are multiplied with a normalizing factor.
它也类似于关键字提取,其唯一区别在于关键字提取,节点表示关键字,此处表示整个句子。 现在,为了形成用于句子排名的图形,该算法为文本中的每个句子创建一个顶点,并将其添加到图形中。 句子太大,因此不能使用共现措施。 因此,本文使用两个句子之间的内容重叠来使用两个句子之间的“相似性”,简单来说,相似性取决于两个句子中存在的常用单词标记的数量。 作者在这里提出了一个非常有趣的“建议”见解。 它们表示两个相似的句子或顶点之间的边的连接,就像建议读者阅读另一行一样,这类似于他/她正在阅读的当前行。 因此,我认为相似性表示两个句子之间的相似内容或兴趣。 为了防止长句被推荐,重要性与标准化因子相乘。
The similarity between two sentences is given by:
两个句子之间的相似性由下式给出:
Where given two sentences Si and Sj, with a sentence being represented by the set of Ni words that appear in the sentence:
给定两个句子Si和Sj,其中一个句子由出现在句子中的Ni个单词集合表示:
The most important sentences are obtained in the same way we did for Keyword extraction.
以与提取关键字相同的方式获得最重要的句子。
This is an overall view of how text rank operates please go through the original paper to explore more.
这是文本等级如何运作的整体视图,请仔细阅读原始文章以进行更多研究。
In practice, for summary extraction, we use cosine similarity, to decide the similarity between two sentences. Using this method, we may obtain several connected subgraphs that denote the number of important topics in the whole document. The connected component of the subgraphs gives the sentences important for the corresponding topics.
实际上,对于摘要提取,我们使用余弦相似度来确定两个句子之间的相似度。 使用这种方法,我们可以获得几个相连的子图,这些子图表示整个文档中重要主题的数量。 子图的连接部分提供了对于相应主题很重要的句子。
The “Pytextrank” library allows applying the text rank algorithm directly on python.
“ Pytextrank ”库允许直接在python上应用文本排名算法。
import spacy
import pytextrank
# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")
# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc = nlp(text)
# examine the top-ranked phrases in the document
for p in doc._.phrases:
print("{:.4f} {:5d} {}".format(p.rank, p.count, p.text))
print(p.chunks)
Implementation of Pytextrank library by Source.
Pytextrank库的Source实现 。
For application details, please refer to the GitHub link.
有关应用程序的详细信息,请参阅GitHub链接。
结论 (Conclusion)
In this article, we have seen basic extractive summarization approaches and the details of the Textrank algorithm. For abstractive methods, feel free to go through Part 2 of the article.
在本文中,我们了解了基本的提取摘要方法和Textrank算法的细节。 对于抽象方法,请随意阅读本文的第2部分 。
I hope this helps.
我希望这有帮助。
翻译自: https://towardsdatascience.com/understanding-automatic-text-summarization-1-extractive-methods-8eb512b21ecc
文本摘要提取
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/242153.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!