NLP之词的重要性

文章目录

何为重要词
TF*IDF
TF*IDF其他版本
- TF
- IDF
算法特点
- TF*IDF的优势
- TF*IDF劣势
TF*IDF的应用
- 搜索引擎
- 文本摘要
- 文本相似度计算

上一篇文章介绍了新词的发现，用内部凝固度和左右熵来发现新词。这时候机器对一篇文章有了对词的一定理解，这时我们让机器上升到对文章的理解。如何让机器识别出一篇文章中比较总要的词呢？是不是用这个词在这篇文章出现的次数呢？那这样会出现很多大众的词，被机器统计为重要的词。比如我们制作图云，只是想选出这篇文章中的主题词，那改怎么做呢？

在这里插入图片描述

何为重要词

假如一个词在某类文本（假设为A类）中出现次数很多，而在其他类别文本（非A类）出现很少，那么这个词是A类文本的重要词（高权重词）；反之，如果一个词在出现在很多领域，则其对于任意类别的重要性都很差。例如，恒星、黑洞这类词在天文领域的文章出现比较多次，这类词相对天文领域来说事比较重要的词，而对于你好、我们这类词在多数文章中出现的都比较多，可以不认为事一种重要的词。

TF*IDF

如何用数学刻画上面重要的词呢？有一种nlp的经典统计值：TF*IDF。TF：词频，某个词在某类别中出现的次数/该类别词总数，IDF：逆文档频率。 $\frac{N}{包含该词的文档数+1}$
$N$ 为文档数量。逆文档频率高说明该词很少出现在其他文档。每个词对于每个类别都会得到一个TF·IDF值
TF·IDF高说明该词对于该领域重要程度高，低则相反。

TF*IDF其他版本

TF

在这里插入图片描述

IDF

在这里插入图片描述

算法特点

1. $t f * i df$ 的计算非常依赖分词结果，如果分词出错，统计值的意义会大打折扣。
2.每个词，对于每篇文档，有不同的 $t f * i df$ 值，所以不能脱离数据讨论 $t f * i df$ 。
3.假如只有一篇文本，不能计算 $t f * i df$ 。
4.类别数据均衡很重要。，保持每篇文章的字数大概相同。
5.容易受各种特殊符号影响，最好做一些预处理。

TF*IDF的优势

1.可解释性好，可以清晰地看到关键词，即使预测结果出错，也很容易找到原因。
2.计算速度快，分词本身占耗时最多，其余为简单统计计算。
3.对标注数据依赖小，可以使用无标注语料完成一部分工作。
4.可以与很多算法组合使用，可以看做是词权重。

TF*IDF劣势

1.受分词效果影响大。
2.词与词之间没有语义相似度。
3.没有语序信息（词袋模型）。
4.能力范围有限，无法完成复杂任务，如机器翻译和实体挖掘等。
5.样本不均衡会对结果有很大影响。
6.类内样本间分布不被考虑。

TF*IDF的应用

搜索引擎

1.对于已有的所有网页（文本），计算每个网页中，词的 $TF * I D F$ 值。
2.对于一个输入query进行分词。
3.对于文档D，计算query中的词在文档D中的 $TF * I D F$ 值总和，作为query和文档的相关性得分。

import jieba
import math
import os
import json
from collections import defaultdict
from calculate_tfidf import calculate_tfidf, tf_idf_topk
"""
基于tfidf实现简单搜索引擎
"""jieba.initialize()#加载文档数据（可以想象成网页数据），计算每个网页的tfidf字典
def load_data(file_path):corpus = []with open(file_path, encoding="utf8") as f:documents = json.loads(f.read())for document in documents:corpus.append(document["title"] + "\n" + document["content"])tf_idf_dict = calculate_tfidf(corpus)return tf_idf_dict, corpusdef search_engine(query, tf_idf_dict, corpus, top=3):query_words = jieba.lcut(query)res = []for doc_id, tf_idf in tf_idf_dict.items():score = 0for word in query_words:score += tf_idf.get(word, 0)res.append([doc_id, score])res = sorted(res, reverse=True, key=lambda x:x[1])for i in range(top):doc_id = res[i][0]print(corpus[doc_id])print("--------------")return resif __name__ == "__main__":path = "news.json"tf_idf_dict, corpus = load_data(path)while True:query = input("请输入您要搜索的内容:")search_engine(query, tf_idf_dict, corpus)

文本摘要

1.通过计算 $TF * I D F$ 值得到每个文本的关键词。
2.将包含关键词多的句子，认为是关键句。
3.挑选若干关键句，作为文本的摘要。

import jieba
import math
import os
import random
import re
import json
from collections import defaultdict
from calculate_tfidf import calculate_tfidf, tf_idf_topk
"""
基于tfidf实现简单文本摘要
"""jieba.initialize()#加载文档数据（可以想象成网页数据），计算每个网页的tfidf字典
def load_data(file_path):corpus = []with open(file_path, encoding="utf8") as f:documents = json.loads(f.read())for document in documents:assert "\n" not in document["title"]assert "\n" not in document["content"]corpus.append(document["title"] + "\n" + document["content"])tf_idf_dict = calculate_tfidf(corpus)return tf_idf_dict, corpus#计算每一篇文章的摘要
#输入该文章的tf_idf词典，和文章内容
#top为人为定义的选取的句子数量
#过滤掉一些正文太短的文章，因为正文太短在做摘要意义不大
def generate_document_abstract(document_tf_idf, document, top=3):sentences = re.split("？|！|。", document)#过滤掉正文在五句以内的文章if len(sentences) <= 5:return Noneresult = []for index, sentence in enumerate(sentences):sentence_score = 0words = jieba.lcut(sentence)for word in words:sentence_score += document_tf_idf.get(word, 0)sentence_score /= (len(words) + 1)result.append([sentence_score, index])result = sorted(result, key=lambda x:x[0], reverse=True)#权重最高的可能依次是第10，第6，第3句，将他们调整为出现顺序比较合理，即3,6,10important_sentence_indexs = sorted([x[1] for x in result[:top]])return "。".join([sentences[index] for index in important_sentence_indexs])#生成所有文章的摘要
def generate_abstract(tf_idf_dict, corpus):res = []for index, document_tf_idf in tf_idf_dict.items():title, content = corpus[index].split("\n")abstract = generate_document_abstract(document_tf_idf, content)if abstract is None:continuecorpus[index] += "\n" + abstractres.append({"标题":title, "正文":content, "摘要":abstract})return resif __name__ == "__main__":path = "news.json"tf_idf_dict, corpus = load_data(path)res = generate_abstract(tf_idf_dict, corpus)writer = open("abstract.json", "w", encoding="utf8")writer.write(json.dumps(res, ensure_ascii=False, indent=2))writer.close()

文本相似度计算

1、对所有文本计算 $t f * i df$ 后，从每个文本选取 $t f * i df$ 较高的前n个词，得到一个词的集合S。
2、对于每篇文本D，计算S中的每个词的词频，将其作为文本的向量。
3、通过计算向量夹角余弦值，得到向量相似度，作为文本的相似度。
夹角余弦的计算：
在这里插入图片描述

#coding:utf8
import jieba
import math
import os
import json
from collections import defaultdict
from calculate_tfidf import calculate_tfidf, tf_idf_topk"""
基于tfidf实现文本相似度计算
"""jieba.initialize()#加载文档数据（可以想象成网页数据），计算每个网页的tfidf字典
#之后统计每篇文档重要在前10的词，统计出重要词词表
#重要词词表用于后续文本向量化
def load_data(file_path):corpus = []with open(file_path, encoding="utf8") as f:documents = json.loads(f.read())for document in documents:corpus.append(document["title"] + "\n" + document["content"])tf_idf_dict = calculate_tfidf(corpus)topk_words = tf_idf_topk(tf_idf_dict, top=5, print_word=False)vocab = set()for words in topk_words.values():for word, score in words:vocab.add(word)print("词表大小：", len(vocab))return tf_idf_dict, list(vocab), corpus#passage是文本字符串
#vocab是词列表
#向量化的方式：计算每个重要词在文档中的出现频率
def doc_to_vec(passage, vocab):vector = [0] * len(vocab)passage_words = jieba.lcut(passage)for index, word in enumerate(vocab):vector[index] = passage_words.count(word) / len(passage_words)return vector#先计算所有文档的向量
def calculate_corpus_vectors(corpus, vocab):corpus_vectors = [doc_to_vec(c, vocab) for c in corpus]return corpus_vectors#计算向量余弦相似度
def cosine_similarity(vector1, vector2):x_dot_y = sum([x*y for x, y in zip(vector1, vector2)])sqrt_x = math.sqrt(sum([x ** 2 for x in vector1]))sqrt_y = math.sqrt(sum([x ** 2 for x in vector2]))if sqrt_y == 0 or sqrt_y == 0:return 0return x_dot_y / (sqrt_x * sqrt_y + 1e-7)#输入一篇文本，寻找最相似文本
def search_most_similar_document(passage, corpus_vectors, vocab):input_vec = doc_to_vec(passage, vocab)result = []for index, vector in enumerate(corpus_vectors):score = cosine_similarity(input_vec, vector)result.append([index, score])result = sorted(result, reverse=True, key=lambda x:x[1])return result[:4]if __name__ == "__main__":path = "news.json"tf_idf_dict, vocab, corpus = load_data(path)corpus_vectors = calculate_corpus_vectors(corpus, vocab)passage = "魔兽争霸"for corpus_index, score in search_most_similar_document(passage, corpus_vectors, vocab):print("相似文章:\n", corpus[corpus_index].strip())print("得分：", score)print("--------------")