如何分析识别文章/内容中高频词和关键词？

theme: orange

要分析一篇文章的高频词和关键词，可以使用 Python 中的 nltk 库和 collections 库或者jieba库来实现，本篇文章介绍基于两种库分别实现分析内容中的高频词和关键词。

nltk 和 collections 库

首先，需要安装 nltk 库和 collections 库。可以使用以下命令来安装：

shell pip install nltk pip install collections 接下来，需要下载 nltk 库中的 stopwords 和 punkt 数据。可以使用以下代码来下载： ```python import nltk

nltk.download('stopwords') nltk.download('punkt') ```

下载完成后，可以使用以下代码来读取文章并进行分析： ```python import collections import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize

读取文章

with open('article.txt', 'r',encoding='utf-8') as f: article = f.read()

分词

tokens = word_tokenize(article)

去除停用词

stopwords = set(stopwords.words('english')) filteredtokens = [token for token in tokens if token.lower() not in stop_words]

统计词频

wordfreq = collections.Counter(filteredtokens)

输出高频词

print('Top 10 frequent words:') for word, freq in wordfreq.mostcommon(10): print(f'{word}: {freq}')

提取关键词

keywords = nltk.FreqDist(filtered_tokens).keys()

输出关键词

print('Keywords:') for keyword in keywords: print(keyword)

```

上述代码中，首先使用 open() 函数读取文章，然后使用 word_tokenize() 函数将文章分词。接着，使用 stopwords 数据集去除停用词，使用 collections.Counter() 函数统计词频，并输出高频词。最后，使用 nltk.FreqDist() 函数提取关键词，并输出关键词。

需要注意的是，上述代码中的 article.txt 文件需要替换为实际的文章文件路径。

结巴(jieba)库实现

```python

导入必要的库

import jieba import jieba.analyse from collections import Counter from wordcloud import WordCloud import matplotlib.pyplot as plt

读取文章

with open('./data/2.txt', 'r', encoding='utf-8') as f: article = f.read()

分词

words = jieba.cut(article)

统计词频

word_counts = Counter(words)

输出高频词

print('高频词：') for word, count in wordcounts.mostcommon(10): print(word, count)

输出关键词

print('关键词：') keywords = jieba.analyse.extract_tags(article, topK=10, withWeight=True, allowPOS=('n', 'nr', 'ns')) for keyword, weight in keywords: print(keyword, weight)

生成词云

wordcloud = WordCloud(fontpath='msyh.ttc', backgroundcolor='white', width=800, height=600).generate(article) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.show()

```