自我接纳
现实世界中的数据科学 (Data Science in the Real World)
Students are often worried and unaware about their chances of admission to graduate school. This blog aims to help students in shortlisting universities with their profiles using ML model. The predicted output gives them a fair idea about their admission chances in a particular university.
学生通常担心并且不知道自己被研究生院录取的机会。 该博客旨在帮助使用ML模型的名单上的大学学生。 预测的输出使他们对特定大学的入学机会有一个清晰的认识。
Technical Review: Pooja Gramopadhye / ABCOM Team Copy Editor: Anushka DevasthaleLevel: IntermediateBanner Image Source : InternetDisclaimer:The purpose of this tutorial is to demonstrate the use of linear regression model on a multi-feature dataset and should not be used as is for predicting admissions.
技术评论: Pooja Gramopadhye / ABCOM团队复制编辑: Anushka Devasthale级别: 中级横幅图像来源: Internet 免责声明:本教程的目的是演示在多元数据集上使用线性回归模型,不应直接使用用于预测入学人数。
Are you applying for a Master’s degree program and knowing your chances of admission to your dream university? What GRE score, TOEFL score, or CGPA is required to get an admission in a University of your choice? Learn to apply Linear Regression to develop an ML model to answer these questions.
您是否正在申请硕士学位课程,并且知道自己升入梦想大学的机会? 要获得所选大学的录取资格,需要提供什么GRE成绩,TOEFL成绩或CGPA? 学习应用线性回归来开发ML模型来回答这些问题。
Many students who aspire to pursue a Master’s degree program from a suitably good university turn towards famous coaching institutes and let them take care of everything, like preparing for exams, building an SOP and LOR, and training for visa interviews and searching for the right universities as well. A few of you may prefer to do all these things on your own. In such situations, searching for the right university is a very daunting task. We search for universities that fit our profile on those so-called “university hunt” websites with all the data about universities around the world. These websites have a section known as “University Predictor,” which is most of the time a paid section you need to fill your information to make use of that section. I present how to build your own University Admit Predictor, which gives your chances of getting admitted to the desired university. You can also use this model before giving exams to know beforehand what the required score is to gain admission to your dream university. Accordingly, you can set your targets for studies.
许多渴望从一所合适的大学攻读硕士学位课程的学生转向著名的教练学院,并让他们照顾一切,例如准备考试,建立SOP和LOR,培训签证面试和寻找合适的大学也一样 你们中的一些人可能更愿意自己做所有这些事情。 在这种情况下,寻找合适的大学是一项非常艰巨的任务。 我们在那些所谓的“大学搜寻”网站上搜索与我们匹配的大学,其中包含有关世界各地大学的所有数据。 这些网站有一个称为“大学预测变量”的部分,在大多数情况下,这是您需要付费的部分,以填充您的信息以使用该部分。 我将介绍如何构建自己的大学入学预测器,从而为您提供被理想大学录取的机会。 您也可以在进行考试之前使用此模型,以预先了解达到理想大学要求的分数。 因此,您可以设置学习目标。
By the end of this tutorial, you will be able to build and train a linear regression model to predict the chance of admission to a particular university.
在本教程结束时,您将能够构建和训练线性回归模型,以预测入读特定大学的机会。
建立专案 (Creating Project)
Create a new Google Colab project and rename it to Admit Prediction. If you are new to Colab, then check out this short tutorial.
创建一个新的Google Colab项目,并将其重命名为“允许预测”。 如果您不熟悉Colab,请查看此简短教程 。
Import the following libraries in your Colab project:
在您的Colab项目中导入以下库:
# import statements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
We will use pandas for data handling, pyplot from matplotlib for charting, sklearn for preparing datasets, and using their predefined machine learning models.
我们将使用pandas进行数据处理, 使用 matplotlib的 pyplot进行图表绘制,使用sklearn来准备数据集,并使用其预定义的机器学习模型。
The dataset is taken from Kaggle competition Use the read_csv
function of pandas for reading the data file into your Colab project environment.
该数据集取自Kaggle竞赛。使用pandas的read_csv
函数将数据文件读取到Colab项目环境中。
# loading the data from csv file saved at the url
data = pd.read_csv("https://raw.githubusercontent.com/abcom-mltutorials/Admit-Prediction/master/Admission_Predict_Ver1.1.csv")
Examine the data by printing the first few records:
通过打印前几条记录来检查数据:
data.head()
This command gives the following output:
此命令提供以下输出:
As you can see, each row contains the fields such as GRE, TOEFL, SOP, LOR, CGPA scores, and the Research activity of any student along with the university ranking. The last column, Chance of Admit, indicates the chances (probability value) of admission to this school of given ranking. You can check out how many such records are provided in the dataset by calling the shape
method:
如您所见,每一行都包含诸如GRE,TOEFL,SOP,LOR,CGPA分数,任何学生的研究活动以及大学排名之类的字段。 最后一栏,“录取机会”,指示该学校获得给定排名的录取机会(概率值)。 您可以通过调用shape
方法来检查数据集中提供了多少这样的记录:
# observing the data with the first 5 rows
data.head()
This command gives an output:(500, 9)
Thus, we have a record of 500 students. We will now proceed to pre-process the data and make it ready for model training.
该命令的输出为: (500, 9)
因此,我们有500个学生的记录。 现在,我们将对数据进行预处理,以准备进行模型训练。
数据预处理 (Data Pre-processing)
We need to ensure that the data does not contain any null values. We do this by calling the isna
method on the data frame and then taking the sum of values on each column
我们需要确保数据不包含任何空值。 为此,我们在数据帧上调用isna
方法,然后在每一列上取值的总和
# checking null items
print(data.isna().sum())
This gives the following output:
这给出以下输出:
Serial No. 0
GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64
As all sums are zero, none of the columns have null values. From the above list of columns, you understand easily that Serial No. is of no significance to us in model training. We will drop this column from the dataset:
由于所有总和均为零,因此所有列均没有空值。 从上面的列列表中,您可以轻松地了解序列号对我们的模型培训没有意义。 我们将从数据集中删除此列:
data = data.drop(["Serial No."], axis = 1)
Next, we will prepare our data for building the model.
接下来,我们将准备用于构建模型的数据。
准备资料 (Preparing Data)
We will first extract the features and the target into two arrays X
and y
. You create X
features array using Python array slicing:
我们将首先将特征和目标提取到两个数组X
和y
。 您可以使用Python数组切片创建X
功能数组:
X = data.iloc[:,:7]
You can get the information on the extracted data by calling the info
method. This is the output:
您可以通过调用info
方法来获取有关提取数据的info
。 这是输出:
As you can see, it contains all our desired features. You now extract target data using the following slicing:
如您所见,它包含我们所有所需的功能。 现在,使用以下切片提取目标数据:
y = data.iloc[:,7:]
Print the information on y:
在y上打印信息:
y.info()
It shows the following output:
它显示以下输出:
We will now split the data into training and testing datasets by calling the train_test_split
method of sklearn.
现在,我们将通过调用train_test_split
方法将数据分为训练和测试数据集。
X_train,X_test,Y_train,Y_test = train_test_split(X, y,
random_state = 10,
shuffle = True,
test_size = 0.2)
I have split the dataset into the ratio 80:20. We use X_train
and Y_train
arrays for training and X_test
and Y_test
arrays for testing. The training dataset is shuffled to give us the randomness in data. The random_state
sets the seed for shuffling. Setting the random_state
ensures reproducible outputs on multiple runs.
我已将数据集拆分为比例80:20。 我们使用X_train
和Y_train
数组进行训练,并使用X_test
和Y_test
数组进行测试。 混合训练数据集以使我们具有数据的随机性。 random_state
设置改组的种子。 设置random_state
可确保多次运行时可重现输出。
We will now get some visualization on the training data so as to decide which model to be used.
现在,我们将获得关于训练数据的一些可视化信息,以便决定使用哪种模型。
可视化数据 (Visualizing Data)
We will create charts for each of our features versus the Chance of Admit. This will give us the idea of admission probabilities based on the feature value. For example, how the GRE score affects admission probability? We will be able to get answers to such questions by doing some charting. We first plot the GRE Score feature with the admit probability. We use matplotlib for plotting. The following code produces the desired plot.
我们将针对每个功能相对于准入机会创建图表。 这将为我们提供基于特征值的准入概率概念。 例如,GRE分数如何影响录取概率? 通过做一些图表,我们将能够找到这些问题的答案。 我们首先用入场概率绘制GRE分数特征。 我们使用matplotlib进行绘图。 以下代码生成所需的图。
# Visualize the effect of GRE Score on chance of getting an admit
plt.scatter(X_train["GRE Score"],Y_train, color = "red")
plt.xlabel("GRE Score")
plt.ylabel("Chance of Admission")
plt.legend(["GRE Score"])
plt.show()
The output is shown below:
输出如下所示:
You can see that a higher GRE score increases the chances of admission, and the relationship between the two is almost linear.
您可以看到更高的GRE分数增加了录取的机会,并且两者之间的关系几乎是线性的。
Now, try plotting a similar graph to see the relation between Chance of Admission and CGPA. You should get the following graph after successfully running the code:
现在,尝试绘制类似的图表,以查看入学机会与CGPA之间的关系。 成功运行代码后,您应该获得以下图形:
Like the first graph, we can see that a higher CGPA has a higher chance of admission, and the relationship is once again linear.
像第一个图一样,我们可以看到CGPA越高,接纳的机会就越高,并且该关系再次呈线性关系。
Likewise, try other features and you will see a linear relationship between each of those features, and the admission probability.
同样,尝试其他功能,您将看到每个功能与准入概率之间的线性关系。
Lastly, let us plot the university rating versus the chance of admission.
最后,让我们绘制大学等级与入学机会的关系图。
# Visualize the effect of CGPA on chance of getting an admit.
plt.scatter(X_train["CGPA"],Y_train, color = "green")
plt.xlabel("CGPA")
plt.ylabel("Chance of Admission")
plt.legend(["CGPA"])
plt.show()
In this chart, the relationship is concentrated into five bars. You observe that for university ratings of 2, 3, and 4, the number of admits is the maximum, as decided by the dots’ density in those three bars. The admission into universities with rating 1 is low. Similarly, the schools with ratings 5 have a low intake, probably due to their high selection criteria.
在此图表中,关系集中为五个条形。 您会观察到,对于大学等级2、3和4,录取数量是最大的,这取决于这三个条形图中的点密度。 等级为1的大学入学率很低。 同样,评级为5的学校入学率较低,可能是由于其选择标准较高。
We will now build our model.
现在,我们将建立模型。
模型制作/培训 (Model Building/Training)
From the data visualization, we conclude that the relationship between the features and the chances of admission is linear. So, we can try a linear regression model for fitting this dataset.
从数据可视化中,我们得出结论,特征与准入机会之间的关系是线性的。 因此,我们可以尝试使用线性回归模型来拟合此数据集。
Our model for this project would be a pre-defined classifier from sklearn library, which is open-source and contains many pre-tested collections of useful classifiers. We will use the LinearRegression from this collection.
我们针对该项目的模型将是sklearn库中的预定义分类器,该库是开源的,包含许多经过测试的有用分类器集合。 我们将使用此集合中的LinearRegression。
classifier = LinearRegression()
We call the fit
method on the classifier to train it. Note the two parameters to the fit
method.
我们在分类器上调用fit
方法进行训练。 注意fit
方法的两个参数。
classifier.fit(X_train,Y_train)
The classifier is now ready for testing.
分类器现在可以进行测试了。
测试中 (Testing)
To test the classifier, we use the test data generated in the earlier stage. We call the predict
method on the created object and pass the X_test
array of the test data, as shown in the following command:
为了测试分类器,我们使用在较早阶段生成的测试数据。 我们在创建的对象上调用predict
方法,并传递测试数据的X_test
数组,如以下命令所示:
prediction_of_Y = classifier.predict(X_test)
This generates a single-dimensional array for the entire testing data set, giving each row prediction in the X_test
array. Examine the first six entries of this array by using the following command:
这将为整个测试数据集生成一个一维数组,从而在X_test
数组中给出每个行的预测。 使用以下命令检查此数组的前六个条目:
prediction_of_Y = np.round(prediction_of_Y, decimals = 3) prediction_of_Y[:6]
The output is:
输出为:
If you want to compare the predicted value to the actual value, add the predicted value to Y_test
and print its contents on screen:
如果要将预测值与实际值进行比较,请将预测值添加到Y_test
并将其内容打印在屏幕上:
Y_test["Predicted chance of Admit"] = prediction_of_Y.tolist() print(Y_test)
The output is as follows:
输出如下:
As you can see, both the actual and predicted values almost match. We will now drop the added column for further visualizations.
如您所见,实际值和预测值几乎匹配。 现在,我们将删除添加的列以进行进一步的可视化。
Y_test = Y_test.drop(["Predicted chance of Admit"], axis = 1)
But just comparing values on our own is not enough to be sure about the accuracy. We need to verify the accuracy of the prediction.
但是仅仅比较我们自己的值还不足以确保准确性。 我们需要验证预测的准确性。
可视化预测 (Visualizing the Predictions)
Before verifying the accuracy of the model, we will visualize and compare the difference between the actual chance of admission and predicted chance of admission. This is important because most of the time, we see a model of Linear Regression predicting the result based on only one parameter, and the plot of that is a single line that fits the maximum number of data points. But in this tutorial, we are using multiple parameters, and the graph is complex. So, I have tried to show each parameter’s impact on the prediction individually, and I will explain the graphs to make it more evident.
在验证模型的准确性之前,我们将可视化并比较实际录取机会与预测录取机会之间的差异。 这很重要,因为在大多数情况下,我们会看到一个线性回归模型仅基于一个参数来预测结果,并且该图的绘制是一条单行,适合最大数据点数。 但是在本教程中,我们使用多个参数,并且图形很复杂。 因此,我试图单独显示每个参数对预测的影响,并且我将解释图表以使其更加明显。
Important things to note before we plot any graphs are plotting two plots in a single graph. The first is of particular parameter against the actual value of Chance of Admit from the testing dataset. The data points of this graph are either red or blue. The second plot is of that same parameter against the predicted value of Chance of Admit. The data points of this graph are purple and red in color.
在绘制任何图形之前要注意的重要事项是在单个图形中绘制两个图形。 第一个参数与来自测试数据集的准入机会的实际值相对应。 该图的数据点是红色或蓝色。 第二个图具有相同的参数与准入机会的预测值的对比。 该图的数据点为紫色和红色。
Let’s plot the first set of graphs for the parameter GRE Score. Use the following code to plot the graphs:
让我们为参数GRE Score绘制第一组图。 使用以下代码绘制图形:
# Visualize the difference in graph for same parameter "GRE Score" for actual chance & prediction chance.
plt.scatter(X_test["GRE Score"],Y_test, color = "red")
plt.scatter(X_test["GRE Score"], prediction_of_Y, color='purple')
plt.xlabel("GRE Score")
plt.ylabel("Chance of Admission")
plt.legend(["Actual chance for GRE Score","Predicted chance for GRE Score"])
plt.show()
Notice that the code contains two calls to scatter function for plotting the two variables.
请注意,该代码包含对散布函数的两次调用,以绘制两个变量。
The output is as follows:
输出如下:
Remember that we are plotting the graph from the testing dataset, which contains fewer values than the training dataset. Hence the density of data points in the graph will be less compared to the visualizations on the training dataset. In the above plot, we understand how the GRE Score parameter, which is the same for both plots, produces a different effect for predicted value than the actual value.
请记住,我们是从测试数据集中绘制图形,该数据包含的值少于训练数据集。 因此,与训练数据集上的可视化相比,图形中数据点的密度会更低。 在以上图表中,我们了解了两个图表相同的GRE Score参数如何对预测值产生与实际值不同的影响。
Our model’s outliers are the red dots at the bottom of the graph because they don’t have any corresponding purple dots around them. How did I infer this from the graph? Considering the error-margin of 5%, a red dot represents a correctly predicted data point if and only if it has a purple dot very near to it, which represents its predicted value. So, the red dots that are isolated are outliers for the model, and the secluded purple dots are poorly predicted values by the model.
我们模型的离群值是图形底部的红点,因为它们周围没有对应的紫色点。 我是如何从图中推断出来的? 考虑到5%的误差范围,当且仅当紫色点非常靠近紫色点(代表其预测值)时,红色点代表正确预测的数据点。 因此,孤立的红点是该模型的异常值,而隐蔽的紫色点是该模型的较差的预测值。
This is how you visualize when you are building a Linear Regression model with multiple parameters. The above logic applies to most of the parameters in the model.
这是构建带有多个参数的线性回归模型时的可视化方式。 以上逻辑适用于模型中的大多数参数。
Let’s plot another set of graphs for the parameter SOP. Use the following code to plot the graphs:
让我们为参数SOP绘制另一组图形。 使用以下代码绘制图形:
plt.scatter(X_test["SOP"],Y_test, color = "blue") plt.scatter(X_test["SOP"], prediction_of_Y, color='orange') plt.xlabel("SOP") plt.ylabel("Chance of Admission") plt.legend(["Actual chance for SOP","Predicted chance for SOP"]) plt.show()
The output is as follows:
输出如下:
Let me explain how to interpret the graph and relate it to the real-world scenarios.
让我解释一下如何解释图形并将其与实际场景关联起来。
Consider SOP with rating 1.5: The actual chance of admission (blue dots) is near 60%, and predicted chance (orange dots) is near 50%.
考虑等级1.5的SOP:实际入场机会(蓝色点)接近60%,预测机会(橙色点)接近50%。
Consider SOP with rating 2.5: The actual chance of admission is a lower than the predicted chance.
考虑等级为2.5的SOP:实际入学机会低于预期机会。
And this continues for higher SOP as well. Hence this model shows lower chance of admission than an actual for low values of SOP and higher than actual chance for high values of SOP, which is true as SOP is a pivotal factor in getting admission.
对于更高的SOP来说,这种情况也将继续。 因此,对于低SOP值,此模型显示出比实际机会低的机会,而对于SOP高值,则显示出高于实际机会的机会,这是正确的,因为SOP是获得准入的关键因素。
Note that these observations are based on the graphs that I have produced with the values of the parameters provided in the tutorial. By changing the values of shuffle
and random_state
parameters, all the graphs will also change. You may find some facts if you study your newly produced graphs, and I encourage you to experiment with the code.
请注意,这些观察结果基于我使用教程中提供的参数值生成的图形。 通过更改shuffle
和random_state
参数的值,所有图形也将更改。 如果您研究新生成的图形,可能会发现一些事实,我鼓励您尝试使用该代码。
Now, we will verify the accuracy of our prediction.
现在,我们将验证预测的准确性。
验证准确性 (Verifying Accuracy)
To test the accuracy of the model, use the score
method on the classifier, as shown below:
要测试模型的准确性,请使用分类器上的score
方法,如下所示:
print('Accuracy: {:.2f}'.format(classifier.score(X_test, Y_test)))
The output is:Accuracy: 0.80
输出为: Accuracy: 0.80
It shows that the accuracy of our model is 80%, which is considered good. Thus, no further tuning is required. You can safely try this model with real values to check the chance of getting admission in the desired university. So, now that we know that our model is substantially accurate, we should try the inference on arbitrary values or be more precise real-world values specified by the user.
它表明我们模型的准确性为80%,被认为是很好的。 因此,不需要进一步的调整。 您可以安全地使用实际值尝试该模型,以检查获得所需大学录取的机会。 因此,既然我们知道我们的模型基本上是准确的,那么我们应该尝试对任意值进行推断,或者尝试由用户指定更精确的实际值。
推断看不见的数据 (Inference on Unseen Data)
Let’s assume that I have a GRE score of 332, TOEFL score of 107, SOP and LOR of 4.5 and 4.0 respectively, my CGPA is 9.34, but I have not done any research. Let’s see what the chances of me getting an admit in a 5.0 rated university are. Use the following code to add all the parameter values in the testing dataset:
假设我的GRE分数为332,TOEFL分数为107,SOP和LOR分别为4.5和4.0,我的CGPA为9.34,但是我没有做任何研究。 让我们看看我进入5.0级大学的机会。 使用以下代码在测试数据集中添加所有参数值:
my_data = X_test.append(pd.Series([332, 107, 5, 4.5, 4.0, 9.34, 0], index = X_test.columns), ignore_index = True)
Check the added row by printing its value:
通过打印其值来检查添加的行:
print(my_data[-1:])
Remember that the testing dataset already has some values present in it, and our data will be added in the last row. The following image shows the output of the above code:
请记住,测试数据集中已经存在一些值,我们的数据将添加到最后一行。 下图显示了以上代码的输出:
Now use the following code to get the chance of admission for the given data:
现在使用以下代码来获取给定数据的机会:
my_chance = classifier.predict(my_data)
my_chance[-1]
The output is as follows:array([0.8595167])
输出如下: array([0.8595167])
According to our model’s inference, I have an 85.95% chance of getting the admission.
根据我们模型的推论,我有85.95%的机会被录取。
Similarly, you can check admission chances for more than one record as well. Use the following code to add all the parameter values for a bunch of records in the testing dataset:
同样,您也可以查看多个记录的录取机会。 使用以下代码为测试数据集中的一堆记录添加所有参数值:
list_of_records = [pd.Series([309, 90, 4, 4, 3.5, 7.14, 0], index = X_test.columns),
pd.Series([300, 99, 3, 3.5, 3.5, 8.09, 0], index = X_test.columns),
pd.Series([304, 108, 4, 4, 3.5, 7.91, 0], index = X_test.columns),
pd.Series([295, 113, 5, 4.5, 4, 8.76, 1], index = X_test.columns)]
user_defined = X_test.append(list_of_records, ignore_index= True)
print(user_defined[-4:])
We use the series data structure of pandas and append all the series to our testing dataset. The code to see the records and predictions is included in the above code. The following image displays the output of the above code:
我们使用熊猫的系列数据结构,并将所有系列附加到测试数据集中。 上面的代码中包含查看记录和预测的代码。 下图显示了以上代码的输出:
Note that the first record is at index 50, and in the previous example with the single record, the index was also 50. This is because when we use the append
function on data frames, it makes a copy of the original data frame, and changes are made in that copy, leaving the original data frame intact.
请注意,第一个记录在索引50处,而在上一个示例的单个记录中,索引也是50。这是因为当我们在数据帧上使用append
函数时,它会复制原始数据帧,并且对该副本进行更改,使原始数据帧保持完整。
By observing the above results, I can assume that CGPA and Research are more important factors than GRE score for getting an admit. Try experimenting with the record values and check the impact it has on the chance of admission. Maybe you will land on a different assumption of your own, or perhaps you will prove me wrong.
通过观察以上结果,我可以认为CGPA和Research是获得GRE的重要因素,而不是GRE分数。 尝试试验记录值,并检查其对录取机会的影响。 也许您会以自己不同的假设着陆,或者您可能会证明我错了。
Finally, if you just want to do the inference on a single record without adding it to the test dataset, you would use the following code:
最后,如果您只想对单个记录进行推理而又不将其添加到测试数据集中,则可以使用以下代码:
#Checking chances of single record without appending to previous record
single_record_values = {"GRE Score" : [327], "TOEFL Score" : [95], "University Rating" : [4.0], "SOP": [3.5], "LOR" : [4.0], "CGPA": [7.96], "Research": [1]}
single_rec_df = pd.DataFrame(single_record_values, columns = ["GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR", "CGPA", "Research"])
print(single_rec_df)
single_chance = classifier.predict(single_rec_df)
single_chance
This is the output:
这是输出:
Add more values to the list of each parameter in the dictionary to get a chance of multiple records without appending it to X_test
.
将更多值添加到字典中每个参数的列表中,以获得多个记录的机会,而无需将其附加到X_test
。
摘要 (Summary)
In this tutorial, you learned how to develop a linear regression model to create an admission predictor. The first step was selecting an appropriate dataset with all the necessary data needed to build the model. The second step is cleansing the data, eliminating the unwanted rows, fields, and selecting the appropriate fields for your model development. After this was done, you used the train_test_split
function to map the data into a format that your classifier demands training. For building the model, you used a linear regression classifier provided in the sklearn library. For training the classifier, you used 80% of the data. You used the rest of the data for testing. Then you saw how to visualize the training data by using graphs with the matplotlib library. In the next step, we tested the accuracy of the model. Fortunately, our model had good accuracy. Then you saw how to visualize the results when you are building a Linear Regression model with multiple parameters. Then you saw how to enter user-defined records and predict the chance of admission. This is a very easy model and can be built using many different algorithms, each of which has its pros and cons. Try using some other algorithm for solving this problem.
在本教程中,您学习了如何开发线性回归模型来创建入学预测变量。 第一步是选择一个适当的数据集,其中包含构建模型所需的所有必要数据。 第二步是清理数据,消除不需要的行,字段,并为模型开发选择适当的字段。 完成此操作后,您使用了train_test_split
函数将数据映射为分类器需要训练的格式。 为了构建模型,您使用了sklearn库中提供的线性回归分类器。 为了训练分类器,您使用了80%的数据。 您将其余数据用于测试。 然后,您了解了如何通过使用带有matplotlib库的图形来可视化训练数据。 在下一步中,我们测试了模型的准确性。 幸运的是,我们的模型具有良好的准确性。 然后,您了解了在构建具有多个参数的线性回归模型时如何可视化结果。 然后,您了解了如何输入用户定义的记录并预测准入的机会。 这是一个非常简单的模型,可以使用许多不同的算法来构建,每种算法各有利弊。 尝试使用其他算法来解决此问题。
Source: Download the project source from our Repository.
来源 :从我们的存储库下载项目源。
Originally published at http://education.abcom.com on August 17, 2020.
最初于 2020年8月17日 发布在 http://education.abcom.com 上。
翻译自: https://medium.com/swlh/admit-predictor-97f29d4f0373
自我接纳
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389860.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!