This article is for those dummies like me, who’ve never tried to know what machine learning was or have left it halfway for the sole reason of being overwhelmed. Follow through every line and stay along. I promise you’d be quite acquainted with giving yourself your first project on your resume.
Ť他的文章是为那些傻瓜像我一样,谁从来没有试过才知道什么机器学习是或已经离开它对于中途被淹没的唯一原因。 遵循每一条线并保持下去。 我保证您会非常熟悉在简历上给自己的第一个项目。
基本要领 (Basic Essentials)
We’d be doing the whole project in Python. The only essential is for you to understand basic programming. Nothing else is required. I prefer using Jupyter Notebook as an IDE as it helps in visualizing easier.
我们将用Python完成整个项目。 唯一的必要条件是您了解基本编程。 没有其他要求。 我更喜欢将Jupyter Notebook用作IDE,因为它有助于简化可视化过程。
介绍 (Introduction)
In this project, we shall make it interesting by using the data of the newly released Amazon Echo. Various customers who’ve bought the new Amazon Echo have submitted their reviews. We aim to predict whether a given review is positive or negative, through sentiment analysis. Sounds cool right? We shall use a dataset from Kaggle developed by Manu Siddhartha.
在这个项目中,我们将使用新发布的Amazon Echo的数据来使其有趣。 购买了新的Amazon Echo的各种客户都提交了评论。 我们旨在通过情感分析来预测给定的评论是正面还是负面。 听起来不错吧? 我们将使用Manu Siddhartha开发的Kaggle的数据集。
导入库 (Importing Libraries)
Your first step is to import libraries. Libraries help us in increased functionality and reduce the bulkiness of the code. Type this into your first cell and press shift+enter.
第一步是导入库。 库可以帮助我们增加功能并减少代码的庞大性。 在您的第一个单元格中键入内容,然后按shift + enter 。
Pandas is a software library written for python. We shall use it for data manipulation and make our calculations easier.
Pandas是为python编写的软件库。 我们将使用它进行数据处理,并使我们的计算更加容易。
NumPy is a high-performance library which we shall use to perform functions on the datasets and array.
NumPy是一个高性能的库,我们将使用它来对数据集和数组执行功能。
Seaborn and Matplotlib are visualization libraries. They help in providing insane visualizations where we shall derive our insights.
Seaborn和Matplotlib是可视化库。 它们有助于提供疯狂的可视化效果,从中我们可以得出自己的见解。
When we say ‘as’ in the code above, we imply that the name of the library shall be called as the text mentioned, for ease of effort.
当我们在上面的代码中用“ as”表示时,我们的意思是为了方便起见,库的名称应称为所提到的文本。
导入数据集 (Importing Dataset)
We shall use a Kaggle dataset. Use the link below to download the dataset into your project folder.
我们将使用Kaggle数据集。 使用下面的链接将数据集下载到您的项目文件夹中。
After you download, type the following and press shift+enter to import the dataset.
下载后,键入以下内容,然后按shift + enter导入数据集。
echo = pd.read_csv(‘amazon_alexa.tsv’,sep=’\t’)
echo = pd.read_csv('amazon_alexa.tsv',sep ='\ t')
echo =: We assign the dataset to variable named echo
echo =:我们将数据集分配给名为echo的变量
sep=’\t’: The dataset contains values separated by a tab ( tsv files ).We use the sep argument to mention the separation.
sep ='\ t':数据集包含由制表符分隔的值(tsv文件)。我们使用sep参数来提及分隔。
‘amazon_alexa.tsv’: If the file is in any other place than the Jupyter Notebook folder, mention the path of the file followed by /amazon_alexa.tsv inside the quotes.
'amazon_alexa.tsv':如果文件位于Jupyter Notebook文件夹以外的任何其他位置,请在引号中提及文件的路径,后跟/amazon_alexa.tsv 。
Type echo in the cell below and you shall see the dataset you’ve imported.
在下面的单元格中键入echo,您将看到导入的数据集。
The dataset consists of the rating given, date of review, model variation, the reviews that the customers have written, and the feedback. Here, 1=positive and 0=negative. The column feedback shall be our target.
数据集包括给定的评级,审查日期,模型变化,客户撰写的审查以及反馈。 在此,1 =正,0 =负。 列反馈将是我们的目标。
查看您的数据集 (Viewing your dataset)
Technical information on your dataset is essential as it helps us understand what we’re dealing with. Use the following codes in the images below.
您的数据集上的技术信息至关重要,因为它可以帮助我们了解我们正在处理的内容。 在下面的图像中使用以下代码。
echo.info(): Gives us information on the null values and non-null values in the dataset. It mentions the datatype of the column
echo.info():为我们提供有关数据集中的空值和非空值的信息。 它提到了列的数据类型
echo.describe(): Given us information about the column’s average, count, mean and a lot of other mathematical deductions
echo.describe():为我们提供有关列的平均值,计数,均值和许多其他数学推论的信息
.head() and .tail() : Display us the first 5 and the last 5 rows of the dataset respectively. .head(n) where n is the number of rows you’d want to display. The same applies to a tail too.
.head()和.tail():分别显示数据集的前5行和后5行。 .head(n) ,其中n是您要显示的行数。 尾巴也是如此。
数据可视化 (Data Visualization)
Visualizing the data in a given dataset remains unique to each as the conclusions we draw and the opinion we generate would differ. But for the purpose to learn, let’s look at a few examples and explore the dataset. Each visualization shall have the code at the top. And I shall explain the code in detail, as it’d make it easier the next time to create your visualizations.
可视化给定数据集中的数据对于每个数据集仍然是唯一的,因为我们得出的结论和产生的观点会有所不同。 但是出于学习的目的,让我们看一些示例并探索数据集。 每个可视化文件的顶部均应具有代码。 我将详细解释代码,因为这将使下一次创建可视化文件变得更加容易。
每个型号的审查数量 (Review count for each model variant)
plt.figure(figsize=[40,14]) = Helps us in setting the dimensions of the image. 40,14 implies that the breadth shall be 40 units and the height shall be 14 units respective to scale.
plt.figure(figsize = [40,14])=帮助我们设置图像的尺寸。 40,14表示宽度应为40个单位,高度应为14个单位(按比例)。
sns.countplot(x=’variation’,data=echo,palette=’dark’) = countplot show us the count of each section on the x axis. palette = ‘dark’ is a color palette we would want our visualizations to be in.
sns.countplot(x ='variation',data = echo,palette ='dark')= countplot向我们展示x轴上每个部分的计数。 Palette ='dark'是我们希望可视化效果出现的调色板。
Some of the inferences that we could assume would be that, the variant Black Dot has the highest amount of reviews and the Walnut Fresh has the lowest amount of reviews. Assumptions could follow that the Black dot was purchased the most and the Walnut fresh the least.
我们可以假设的一些推论是,变体黑点的评论量最高,而核桃新鲜的评论量最低。 可以假设购买了最多的黑点,而购买了最少的核桃。
评级多数 (Rating Majority)
To understand the count of ratings that people have given we can generate a histogram. In the aforesaid image, we can deduce that majority of the people have given it a 5-star rating which can assure us product success. No product is perfect and obviously, we are going to have lower ratings too.
要了解人们给予的评分数量,我们可以生成直方图。 在上述图像中,我们可以推断出大多数人给它的5星评级可以确保我们的产品成功。 没有产品是完美的,显然,我们也将获得更低的评分。
.hist(bins=n) : It generates a histogram where bins is the measure of the average distancing and range. You are free to explore placing numbers of your choice for varied graphs.
.hist(bins = n):它生成一个直方图,其中bins是平均距离和范围的度量。 您可以自由探索各种图形的放置编号。
正面和负面评论 (Positive & Negative Reviews)
Now that we’ve had a basic overview, let’s get down to analyzing positive and negative reviews. To generate all the rows which contain both positive and negative reviews do the following.
现在我们已经有了基本概述,让我们开始分析正面和负面评论。 要生成同时包含正面和负面评论的所有行,请执行以下操作。
This shall store all the respective positive and negative reviews into the respective variables. To display the tables just type in the name of the variable into a new cell. The following shall be displayed.
这会将所有各自的正面和负面评论存储到各自的变量中。 要显示表,只需在新单元格中键入变量名称即可。 将显示以下内容。
负反馈计数wrt。 变异 (Negative feedback count wrt. variation)
The following set of codes shall depict us the count-plots of both positive and negative feedbacks with respect to different variants.
以下代码集将向我们描述有关不同变体的正反馈和负反馈的计数图。
plt.figure(figsize=[40,14])sns.countplot(x=’variation’,data=positive,palette=’Blues’)
plt.figure(figsize = [40,14])sns.countplot(x ='variation',data = positive,palette ='Blues')
plt.figure(figsize=[40,14])sns.countplot(x=’variation’,data=negative,palette=’Greens’)
plt.figure(figsize = [40,14])sns.countplot(x ='variation',data = negative,palette ='Greens')
特征工程 (Feature Engineering)
Feature Engineering plays a vital role in developing the right model. We can improve the performance of machine learning algorithms. Amongst the numerous steps of feature engineering, we shall concentrate on a few major steps which shall help us in developing our model.
功能工程在开发正确的模型中起着至关重要的作用。 我们可以提高机器学习算法的性能。 在特征工程的众多步骤中,我们将集中于几个主要步骤,这些步骤将有助于我们开发模型。
驱逐不必要的数据 (Evicting unnecessary data)
Our first step is to drop those columns which we think shall be unnecessary in deriving whether the reviews are positive or negative. You are free to experiment with the columns you like. But here’s what I’ve deduced. So let us have a look at all the columns we’ve got.
我们的第一步是删除那些我们认为在得出正面或负面评论时不必要的列。 您可以随意尝试自己喜欢的列。 但是,这就是我的推论。 因此,让我们看看我们拥有的所有列。
Rating: Important to know how well the customer likes the product. The review and the rating are majorly directly proportional to each other.
评分:了解客户对产品的满意程度很重要。 评论和评分主要是直接成比例的。
Variation: The model variant shall play an important role in because there might be opinions the color too.
变体:模型变体在其中起着重要作用,因为可能还会有颜色的意见。
Date: Not necessary. It has nothing to do with the emotion o the review.
日期:不需要。 它与评论的情感无关。
Verified Reviews: Obviously Needed!
验证评论:显然需要!
Feedback: Bah! Our target Column. We need it.
反馈:B ! 我们的目标专栏。 我们需要。
So to eliminate the date column use the following code. This shall remove the column ‘date’ from the dataset.
因此,要消除日期列,请使用以下代码。 这将从数据集中删除“日期”列。
echo.drop(‘date’,axis=1,inplace=True)
echo.drop('date',axis = 1,inplace = True)
inplace=True: inplace is generally used for two cases. When we mention True, it removes the column permanently, and when False it displays us the dataset without the column only for that iteration. The next time you view your dataset, the column re-appears.
inplace = True: inplace通常用于两种情况。 当我们提到True时,它将永久删除列,而当False时,它将向我们显示不包含该迭代的数据集。 下次查看数据集时,该列会重新出现。
axis = 1 : ensures the column is removed vertically
轴= 1:确保垂直卸下色谱柱
虚拟您的数据 (Dummy your data)
Imagine a huge table where one of the columns has only the values A, B, and C. In simple terms, a machine cannot understand these A, B, and C while we vectorize. So the cleverest way is to create 3 new additional columns A, B, and C and if that specific row contains the value, we shall mention binaries such as 1 or 0. This way we’ve depicted the value’s presence in the row. Neat?
想象一个巨大的表,其中的一列只有值A,B和C。简单来说,当我们进行矢量化处理时,机器无法理解这些A,B和C。 因此,最聪明的方法是创建3个新的附加列A,B和C,如果该特定行包含该值,我们将提及诸如1或0之类的二进制文件。这种方式我们已经描述了该值在行中的存在。 整齐?
The same shall be applied to our dataset. We shall dummy our ‘variation’ column through the same procedure. It’s a simple 3 step process.
这同样适用于我们的数据集。 我们将通过相同的程序来伪装“变化”列。 这是一个简单的三步过程。
var = pd.get_dummies(echo[‘variation’],drop_first=True)
var = pd.get_dummies(echo ['variation'],drop_first = True)
This creates a dataset with columns named after the variation names, whilst showing its presence through binaries.
这将创建一个数据集,其中包含以变体名称命名的列,同时通过二进制文件显示其存在。
The next obvious step shall be to merge this dataset to our main dataset. To do this let us type the following. And after we merge the datasets, it does not make any point to have the original column. So we shall drop the column.
下一步显然是将这个数据集合并到我们的主要数据集中。 为此,我们键入以下内容。 合并数据集后,拥有原始列毫无意义。 因此,我们将删除该列。
echo = pd.concat([echo,var],axis=1)
echo = pd.concat([echo,var],axis = 1)
echo.drop(‘variation’,axis=1,inplace=True)
echo.drop('variation',axis = 1,inplace = True)
最后步骤 (Last Steps)
We’re nearly there. All it takes is a few careful steps and understanding to get through. Now as I’ve mentioned earlier, the machine does not understand the text. The only column with text now is the reviews column. Here’s how we shall break it down. For example, if we have the first row as
我们快到了。 它所需要的只是一些谨慎的步骤和理解以实现。 现在,正如我之前提到的,机器不理解文本。 现在唯一带有文本的列是评论列。 这是我们将其分解的方法。 例如,如果我们有第一行作为
row 1: The doughnut is amazing: 1
第1行:甜甜圈很棒:1
Our idea here is to generate, columns for every word in the text and update its count through binaries. Very similar to dummying, but different. So after applying the same to several rows, the column ‘THE’ might have various 1s and 0s. The logic behind this is that the machine understands the frequency of usage of words in a particular order to understand the sentiment. Not clear? Which word comes after what and it’s frequency is what the machine analyzes. This can be achieved through a few steps. Follow through.
我们的想法是为文本中的每个单词生成列,并通过二进制更新其计数。 与虚拟非常相似,但有所不同。 因此,将其应用于多行之后,列“ THE”可能具有各种1和0。 其背后的逻辑是,机器以特定顺序理解单词使用的频率以理解情绪。 不清楚? 机器分析的内容是哪个单词紧随其后,它的出现频率。 这可以通过几个步骤来实现。 遵循。
In the first line, we are importing a CountVectorizer from the scikit-learn(sklearn) library
在第一行中,我们从scikit-learn (sklearn)库中导入CountVectorizer
We create an object — cv
我们创建一个对象-cv
On the column verified_reviews in the echo dataset, we shall apply the cv.fit_transform function to create column counts for each word.
在回显数据集中的authenticated_reviews列上,我们将应用cv.fit_transform函数为每个单词创建列数。
This stores all the transformed data into rows with counts. Our next step is to convert it into an array. Convert this array into a dataset. And then, merge it to the main dataset. The final step is to drop the column which had initially contained the reviews. Execute the following codes and the results should look something similar to this.
这会将所有转换后的数据存储到带有计数的行中。 我们的下一步是将其转换为数组。 将此数组转换为数据集。 然后,将其合并到主数据集中。 最后一步是删除最初包含评论的列。 执行以下代码,结果应与此类似。
reviews = pd.DataFrame(alexa.toarray()): This piece of code sees to it that a new data frame is created with the counts of all individual words.
reviews = pd.DataFrame(alexa.toarray()):这段代码可以看到它创建了一个带有所有单个单词计数的新数据框。
训练数据 (Training your Data)
The first most important step in training your data is to divide the data into input and output. We shall consider the variable X to be our input and variable Y to be our output.
训练数据的最重要的第一步是将数据分为输入和输出。 我们将变量X作为我们的输入,并将变量Y作为我们的输出。
X shall contain all the columns except the feedback columns, because that’s what we have to predict
X应该包含除反馈列之外的所有列,因为这是我们必须预测的
Y shall contain the feedback column as it is our dataset.
Y应该包含反馈列,因为它是我们的数据集。
Our next step is to split the data into testing and training. To check the accuracy of how well we’ve predicted we shall split our dataset itself because the reviews are from the same origin. You can experiment by creating your testing dataset too, but this is to make things work faster. Type the following code.
我们的下一步是将数据分为测试和培训。 为了检查预测的准确性,我们将拆分数据集本身,因为评论来自同一来源。 您也可以通过创建测试数据集进行试验,但这是为了使工作更快。 输入以下代码。
Now our data has split into input — training and testing, output — training and testing. test_size implies that we use 20% of the dataset as our testing dataset.
现在,我们的数据已分为输入(培训和测试),输出(培训和测试)。 test_size表示我们使用数据集的20%作为测试数据集。
The final step is to apply an algorithm that shall vectorize all the input data into one and use it for training. We shall use a Random Forest Classifier to do this. Type the following code into your cells. The output should look something similar to this.
最后一步是应用一种算法,该算法应将所有输入数据矢量化为一个并用于训练。 我们将使用随机森林分类器执行此操作。 在您的单元格中键入以下代码。 输出应类似于此。
This step shall train your model completely and now your model is ready to go for testing if you’ve got something similar to this. Understanding the Random Forest Classifier algorithm is beyond the scope of the article. The easiest method to understand is as follows.
此步骤将完全训练您的模型,如果您有类似的东西,现在您的模型已准备好进行测试。 了解随机森林分类器算法超出了本文的范围。 最容易理解的方法如下。
Given a set of inputs, the algorithms create n_estimators number of trees where each tree has a different combined connection where it trains itself amongst different iterations.
给定一组输入,算法将创建 n_estimators 数量的树,其中每棵树具有不同的组合连接,并在不同的迭代中训练自己。
Now that you’ve completed the penultimate step, let’s proceed to final step.
现在您已经完成了倒数第二步,让我们继续最后一步。
测试数据 (Testing your Data)
Remember how we’ve split our data into training and testing. It’s time to use our testing data. Type in the following code.
记住我们如何将数据分为训练和测试。 现在该使用我们的测试数据了。 输入以下代码。
y_predict = randomforest_classifier.predict(X_test)
y_predict = randomforest_classifier.predict(X_test)
This assigns all the predicted values to the variable y_predict. Now our only step left is to visualize and check how well our model has performed. For this, We shall compare the predicted values in comparison to the original tested values. We shall import a classification report and a confusion matrix for this. These shall be explained in forthcoming articles. For now, let’s stick to analyzing them. Type the following code.
这会将所有预测值分配给变量y_predict 。 现在,我们剩下的唯一步骤就是可视化并检查模型的性能。 为此,我们将比较预测值与原始测试值。 我们将为此导入分类报告和混淆矩阵。 这些将在以后的文章中进行解释。 现在,让我们继续分析它们。 输入以下代码。
from sklearn.metrics import confusion_matrix,classification_report
从sklearn.metrics导入confusion_matrix,classification_report
38 : How many values have been accurately predicted.
38:已经准确预测了多少个值。
5.8e+02 : No. of false values have been predicted right.
5.8e + 02:正确预测的错误值数量。
9 : Suggests us the number of right values predicted wrong. (Type 1 error)
9:向我们建议预测错误的正确值的数量。 (类型1错误)
1 : The number of wrong ones predicted right. ( type 2 error )
1:预测正确的错误数目。 (类型2错误)
To finally predict accuracy and the precision type the following code.print(classification_report(y_test, y_predict))
要最终预测精度和精度,请键入以下代码。 打印(classification_report(y_test,y_predict))
And there you go! Here’s to your first Machine Learning project.
然后你去了! 这是您的第一个机器学习项目。
Nikhilesh Garnepudy
尼希列什·加内普迪(Nikhilesh Garnepudy)
翻译自: https://medium.com/analytics-vidhya/your-first-simple-machine-learning-project-f1d427c61760
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390541.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!