Simple, TfidfVectorizer and CountVectorizer recommendation system for beginner.
简单的TfidfVectorizer和CountVectorizer推荐系统,适用于初学者。
目标 (The Goal)
Recommendation system is widely use in many industries to suggest items to customers. For example, a radio station may use a recommendation system to create the top 100 songs of the month to suggest to audiences, or they might use recommendation system to identify song of similar genre that the audience has requested. Based on how recommendation system is widely being used in the industry, we are going to create a recommendation system for the anime data. It would be nice if anime followers can see an update of top 100 anime every time they walk into an anime store or receive an email suggesting anime based on genre that they like.
推荐系统在许多行业中广泛用于向客户推荐项目。 例如,广播电台可以使用推荐系统创建当月最流行的100首歌曲以向观众推荐,或者他们可以使用推荐系统来标识观众已请求的类似流派的歌曲。 基于推荐系统在行业中的广泛使用,我们将为动漫数据创建一个推荐系统。 如果动漫追随者每次走进动漫商店或收到一封根据他们喜欢的流派来推荐动漫的电子邮件时,都能看到前100名动漫的更新,那就太好了。
With the anime data, we will apply two different recommendation system models: simple recommendation system and content-based recommendation system to analyse anime data and create recommendation.
对于动漫数据 ,我们将应用两种不同的推荐系统模型:简单的推荐系统和基于内容的推荐系统来分析动漫数据并创建推荐。
总览 (Overview)
For simple recommendation system, we need to calculate weighted rating to make sure that the rating of the same score of different votes numbers will have unequal weight. For example, an average rating of 9.0 from 10 people will have lower weight from an average rating of 9.0 from 1,000 people. After we calculate the weighted rating, we can see a list of top chart anime.
对于简单的推荐系统,我们需要计算加权等级,以确保不同票数的相同分数的等级具有不相等的权重。 例如,每10个人获得9.0的平均评分将比每1,000个人获得9.0的平均评分降低。 在计算加权评分后,我们可以看到顶级动漫列表。
For content-based recommendation system, we will need to identify which features will be used as part of the analysis. We will apply sklearn to identify the similarity in the context and create anime suggestion.
对于基于内容的推荐系统,我们将需要确定哪些功能将用作分析的一部分。 我们将应用sklearn 识别上下文中的相似性并创建动漫建议。
资料总览 (Data Overview)
With the anime data that we have, there are a total of 12,294 anime of 7 different types of data including anime_id, name, genre, type, episodes, rating, and members.
根据我们拥有的动画数据,总共有12294种7种不同类型的数据的动画,包括anime_id,名称,类型,类型,剧集,评分和成员。
实作 (Implementation)
1. Import Data
1.导入数据
We need to import pandas as this well let us put data nicely into the dataframe format.
我们需要导入大熊猫,因为这样可以很好地将数据放入数据框格式中。
import pandas as pd
anime = pd.read_csv('…/anime.csv')
anime.head(5)
anime.info()
anime.describe()
We can see that the minimum rating score is 1.67 and the maximum rating score is 10. The minimum members is 5 and the maximum is 1,013,917.
我们可以看到最低评级分数是1.67,最大评级分数是10。最小成员是5,最大成员是1,013,917。
anime_dup = anime[anime.duplicated()]
print(anime_dup)
There is no duplicated data that need to be cleaned.
没有重复的数据需要清除。
type_values = anime['type'].value_counts()
print(type_values)
Most anime are broadcast of the TV, followed by OVA.
多数动漫在电视上播放,其次是OVA。
2. Simple Recommendation System
2.简单的推荐系统
Firstly, we need to know the calculation of the weighted rating (WR).
首先,我们需要知道加权等级(WR)的计算。
v is the number of votes for the anime; m is the minimum votes required to be listed in the chart; R is the average rating of the anime; C is the mean vote across the whole report.
v是动画的票数; m是图表中需要列出的最低投票数; R是动画的平均评分; C是整个报告中的平均票数。
We need to determine what data will be used in this calculation.
我们需要确定在此计算中将使用哪些数据。
m = anime['members'].quantile(0.75)
print(m)
From the result, we are going to use those data that have more than 9,437 members to create the recommendation system.
根据结果,我们将使用拥有超过9,437个成员的那些数据来创建推荐系统。
qualified_anime = anime.copy().loc[anime['members']>m]
C = anime['rating'].mean()def WR(x,C=C, m=m):
v = x['members']
R = x['rating']
return (v/(v+m)*R)+(m/(v+m)*C)qualified_anime['score'] = WR(qualified_anime)
qualified_anime.sort_values('score', ascending =False)
qualified_anime.head(15)
This is the list of top 15 anime based on weighted rating calculation.
这是根据加权评级计算得出的前15名动漫的列表。
3. Genre Based Recommendation System
3.基于体裁的推荐系统
With genre based recommendation, we will use sklearn package to help us analyse text context. We will need to compute the similarity of the genre. Two method that we are going to use is TfidfVectorizer and CountVectorizer.
通过基于体裁的推荐,我们将使用sklearn包来帮助我们分析文本上下文。 我们将需要计算体裁的相似性。 我们将使用的两种方法是TfidfVectorizer和CountVectorizer。
In TfidfVectorizer, it calculates the frequency of the word with the consideration on how often it occurs in all documents. While, CountVectorizer is more simpler, it only counts how many times the word has occured.
在TfidfVectorizer中,它会考虑单词在所有文档中出现的频率来计算单词的频率。 虽然CountVectorizer更简单,但它仅计算单词出现的次数。
from sklearn.feature_extraction.text import TfidfVectorizertf_idf = TfidfVectorizer(lowercase=True, stop_words = 'english')
anime['genre'] = anime['genre'].fillna('')
tf_idf_matrix = tf_idf.fit_transform(anime['genre'])tf_idf_matrix.shape
We can see that there are 46 different words from 12,294 anime.
我们可以看到,从12,294动漫中有46个不同的单词。
from sklearn.metrics.pairwise import linear_kernelcosine_sim = linear_kernel(tf_idf_matrix, tf_idf_matrix)
indices = pd.Series(anime.index, index=anime['name'])
indices = indices.drop_duplicates()def recommendations (name, cosine_sim = cosine_sim):
similarity_scores = list(enumerate(cosine_sim[indices[name]]))
similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
similarity_scores = similarity_scores[1:21]
anime_indices = [i[0] for i in similarity_scores]
return anime['name'].iloc[anime_indices]recommendations('Kimi no Na wa.')
Based of the TF-IDF calculation, this is the top 20 anime recommendations that are similar to Kimi no Na wa..
根据TF-IDF的计算,这是前20大动漫推荐,与《 Kimi no Na wa》相似。
Next, we are going to look at another model, CountVectorizer() and we are going to compare the result between cosine_similarity and linear_kernel.
接下来,我们将看看另一个模型CountVectorizer(),并将比较余弦相似度和linear_kernel之间的结果。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similaritycount = CountVectorizer(stop_words = 'english')
count_matrix = count.fit_transform(anime['genre'])
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)recommendations('Kimi no Na wa.', cosine_sim2)
cosine_sim2 = linear_kernel(count_matrix, count_matrix)
recommendations('Kimi no Na wa.', cosine_sim2)
Summary
摘要
In this article, we have look at the anime data and trying to build two types of recommendation systems. The simple recommendation system let us see the top chart anime. We have done this by using the weighted rating calculation on the voting and number of members. Then, we continue to build the recommendation system based on anime’s genre feature. With this, we apply both TfidfVectorizer and CountVectorizer to see the differences in their recommendation.
在本文中,我们研究了动画数据,并尝试构建两种类型的推荐系统。 简单的推荐系统让我们看到了热门动画。 我们通过对投票和成员数进行加权评级计算来完成此任务。 然后,我们将继续基于动漫的流派特征构建推荐系统。 这样,我们同时应用了TfidfVectorizer和CountVectorizer来查看其建议中的差异。
Hope that you enjoy this article!
希望您喜欢这篇文章!
1. https://www.datacamp.com/community/tutorials/recommender-systems-python
1. https://www.datacamp.com/community/tutorials/recommender-systems-python
翻译自: https://medium.com/analytics-vidhya/recommendation-system-for-anime-data-784c78952ba5
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388247.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!