ibm python db
Suppose you are exploring a dataset and you want to examine if two categorical variables are dependent on each other.
假设您正在探索一个数据集,并且想要检查两个分类变量是否相互依赖。
The motivation could be a better understanding of the relationship between an outcome variable and a predictor, identification of dependent predictors, etc.
动机可能是更好地理解结果变量与预测变量之间的关系,识别依赖的预测变量等。
In this case, a Chi-square test can be an effective statistical tool.
在这种情况下, 卡方检验可能是有效的统计工具。
In this post, I will discuss how to do this test in Python (both from scratch and using SciPy) with examples on a popular HR analytics dataset — the IBM Employee Attrition & Performance dataset.
在这篇文章中,我将讨论流行的HR分析数据集(IBM Employee Attrition&Performance数据集)上的示例,如何使用Python(从头开始并使用SciPy)进行此测试。
好奇心表 (Table of Curiosities)
What is Chi-square test?
什么是卡方检验?
What are the categorical variables that we want to examine?
我们要检查的分类变量是什么?
How to perform this test from scratch?
如何从头开始执行此测试?
Is there a shortcut to do this?
有捷径可做吗?
What else can we do?
我们还能做什么?
What are the limitations?
有什么限制?
总览 (Overview)
Chi-square test is a statistical hypothesis test to perform when the test statistic is Chi-square distributed under the null hypothesis and particularly the Chi-square test for independence is often used to examine independence between two categorical variables [1].
卡方检验是一种统计假设检验 ,当检验统计量为原假设下的卡方分布时,特别是卡方检验的独立性通常用于检验两个类别变量之间的独立性[1]。
The key assumptions associated with this test are: 1. random sample from the population. 2. each subject cannot be in more than 1 group in any variable.
与该测试相关的主要假设是:1.从总体中随机抽样。 2.每个主题的任何变量都不能超过1组。
To better illustrate this test, I have chosen the IBM HR dataset from Kaggle (link), which includes a sample of employee HR information regarding attrition, work satisfaction, performance, etc. People often use it to uncover insights about the relationship between employee attrition and other factors.
为了更好地说明此测试,我从Kaggle( 链接 )中选择了IBM HR数据集,其中包括有关员工流失,工作满意度,绩效等方面的员工HR信息的样本。人们经常使用它来揭示有关员工流失之间关系的见解。和其他因素。
Note that this is a fictional data set created by IBM data scientists [2].
请注意,这是由IBM数据科学家创建的虚拟数据集[2]。
To see the full Python code, check out my Kaggle kernel.
要查看完整的Python代码,请查看我的Kaggle内核 。
Without further ado, let’s get to the details!
事不宜迟,让我们来谈谈细节!
勘探 (Exploration)
Let’s first check out the number of employees and the number of attributes:
首先让我们检查一下雇员人数和属性数目:
data.shape
--------------------------------------------------------------------
(1470, 35)
There are 1470 employees and 35 attributes.
有1470名员工和35个属性。
Next, we can check what these attributes are and see if there is any missing value associated with each of them:
接下来,我们可以检查这些属性是什么,并查看与每个属性相关联的缺失值:
data.isna().any()
--------------------------------------------------------------------
Age False
Attrition False
BusinessTravel False
DailyRate False
Department False
DistanceFromHome False
Education False
EducationField False
EmployeeCount False
EmployeeNumber False
EnvironmentSatisfaction False
Gender False
HourlyRate False
JobInvolvement False
JobLevel False
JobRole False
JobSatisfaction False
MaritalStatus False
MonthlyIncome False
MonthlyRate False
NumCompaniesWorked False
Over18 False
OverTime False
PercentSalaryHike False
PerformanceRating False
RelationshipSatisfaction False
StandardHours False
StockOptionLevel False
TotalWorkingYears False
TrainingTimesLastYear False
WorkLifeBalance False
YearsAtCompany False
YearsInCurrentRole False
YearsSinceLastPromotion False
YearsWithCurrManager False
dtype: bool
Identify Categorical Variables
识别类别变量
Suppose we want to examine if there is a relationship between ‘Attrition’ and ‘JobSatisfaction’.
假设我们要检查“损耗”和“工作满意度”之间是否存在关系。
Counts for the two categories of ‘Attrition’:
计算“损耗”的两个类别:
data['Attrition'].value_counts()
--------------------------------------------------------------------
No 1233
Yes 237
Name: Attrition, dtype: int64
Counts for the four categories of ‘JobSatisfaction’ ordered by frequency:
按频率对“工作满意度”的四个类别进行计数:
data['JobSatisfaction'].value_counts()
--------------------------------------------------------------------
4 459
3 442
1 289
2 280
Name: JobSatisfaction, dtype: int64
Note that for ‘JobSatisfaction’, 1 is ‘Low’, 2 is ‘Medium’, 3 is ‘High’, and 4 is ‘Very High’.
请注意,对于“工作满意度”,1为“低”,2为“中”,3为“高”,4为“非常高”。
Null Hypothesis and Alternate Hypothesis
零假设和替代假设
For our Chi-square test for independence here, the null hypothesis is that there is no significant relationship between ‘Attrition’ and ‘JobSatisfaction’.
对于此处的独立性卡方检验,零假设是“损耗”与“工作满意度”之间没有显着关系。
The alternative hypothesis is that there is significant relationship between ‘Attrition’ and ‘JobSatisfaction’.
另一种假设是 ,有“磨损”和“工作满意度”之间的关系显著。
Contingency Table
列联表
In order to compute the Chi-square test statistic, we would need to construct a contingency table.
为了计算卡方检验统计量,我们需要构造一个列联表。
We can do that using the ‘crosstab’ function from pandas:
我们可以使用pandas的'crosstab'函数来做到这一点:
pd.crosstab(data.Attrition, data.JobSatisfaction, margins=True)
The numbers in this table represent frequencies. For example, the ‘46’ shown under both ‘2’ in ‘JobSatisfaction’ and ‘Yes’ in ‘Attrition’ means that out of the 1470 employees, 46 of them rated their job satisfaction as ‘Medium’ and they did leave the company.
该表中的数字代表频率。 例如,“ JobSatisfaction”中的“ 2”和“ Attrition”中的“ Yes”同时显示的“ 46”表示在1470名员工中,有46名员工的工作满意度为“中级”,他们确实离开了公司。
Chi-square Statistic
卡方统计
The formula for calculating the Chi-square statistic (X²) is shown as follows:
卡方统计量(X²)的计算公式如下所示:
X² = sum of [(observed-expected)² / expected]
X²= [(观察到的期望值)²/期望值的总和
The term ‘observed’ refers to the numbers we have seen in the contingency table, and the term ‘expected’ refers to the expected numbers when the null hypothesis is true.
术语“ 观察到 ”是指我们在列联表中看到的数字,术语“ 预期 ”是指当零假设为真时的预期数字。
Under the null hypothesis, there is no significant relationship between ‘Attrition’ and ‘JobSatisfaction’, which means the percentage of attrition should be consistent across the four categories of job satisfaction. As an example, the expected frequency for ‘4’ and ‘Attrition’ should be the number of employees that rate their job satisfactions as ‘Very High’ * (total attrition/total employee count), which is 459*237/1470, or about 74.
在原假设下,“减员”与“工作满意度”之间没有显着关系,这意味着在四个工作满意度类别中,减员百分比应保持一致。 例如,“ 4”和“减员”的预期频率应为将其工作满意度评为“非常高” *(总减员/雇员总数)的雇员数,即459 * 237/1470,或者大约74
Let’s compute all the expected numbers and store them in a list called ‘exp’:
让我们计算所有预期数字并将它们存储在名为“ exp”的列表中:
row_sum = ct.iloc[0:2,4].values
exp = []
for j in range(2):
for val in ct.iloc[2,0:4].values:
exp.append(val * row_sum[j] / ct.loc['All', 'All'])
print(exp)
--------------------------------------------------------------------
[242.4061224489796,
234.85714285714286,
370.7387755102041,
384.99795918367346,
46.593877551020405,
45.142857142857146,
71.26122448979592,
74.00204081632653]
Note that the last term (74) verifies that our calculation is correct.
请注意,最后一项(74)验证我们的计算正确。
Now we can compute X²:
现在我们可以计算X²:
((obs - exp)**2/exp).sum()
--------------------------------------------------------------------
17.505077010348
Degree of Freedom
自由度
One parameter we need apart from X² is the degree of freedom, which is computed as (number of categories in the first variable-1)*(number of categories in the second variable-1), and it is (2–1)*(4–1) in this case, or 3.
除X²之外,我们需要的另一个参数是自由度,它的计算方式是(第一个变量-1中的类别数)*(第二个变量-1中的类别数),它是(2-1)*在这种情况下为(4-1),或3。
(len(row_sum)-1)*(len(ct.iloc[2,0:4].values)-1)
--------------------------------------------------------------------
3
Interpretation
解释
With both X² and degrees of freedom, we can use a Chi-square table/calculator to determine its corresponding p-value and conclude if there is a significant relationship given a specified significance level of alpha.
对于X²和自由度,我们可以使用卡方表/计算器来确定其对应的p值,并得出在指定的显着性水平α下是否存在显着关系。
In another word, given the degrees of freedom, we know that the ‘observed’ should be close to ‘expected’ under the null hypothesis which means X² should be reasonably small. When X² is larger than a threshold, we know the p-value (probability of having a such as large X² given the null hypothesis) is extremely low, and we would reject the null hypothesis.
换句话说,给定自由度,我们知道在零假设下,“观察到的”应该接近“预期”,这意味着X²应该相当小。 当X²大于阈值时,我们知道p值(给定原假设的情况下具有X2这样大的概率)极低,我们将拒绝原假设。
In Python, we can compute the p-value as follows:
在Python中,我们可以如下计算p值:
1 - stats.chi2.cdf(chi_sq_stats, dof)
--------------------------------------------------------------------
0.000556300451038716
Suppose the significance level is 0.05. We can conclude that there is a significant relationship between ‘Attrition’ and ‘JobSatisfaction’.
假设显着性水平为0.05。 我们可以得出结论,“损耗”与“工作满意度”之间存在显着的关系。
Using SciPy
使用SciPy
There is a shortcut to perform this test in Python, which leverages the SciPy library (documentation).
有一个捷径可以在Python中执行此测试,它利用了SciPy库( 文档 )。
obs = np.array([ct.iloc[0][0:4].values,
ct.iloc[1][0:4].values])
stats.chi2_contingency(obs)[0:3]
--------------------------------------------------------------------
(17.505077010348, 0.0005563004510387556, 3)
Note that the three terms are X² statistic, p-value, and degree of freedom, respectively. These results are consistent with the ones we computed by hand earlier.
请注意,这三个项分别是X²统计量,p值和自由度。 这些结果与我们之前手工计算的结果一致。
‘Attrition’ and ‘Education’
“减员”与“教育”
It is somewhat intuitive that whether the employee leaves the company is related to the job satisfaction. Now let’s look at another example where we examine if there is significant relationship between ‘Attrition’ and ‘Education’:
从某种程度上说,员工是否离开公司与工作满意度有关。 现在让我们看另一个示例,在该示例中我们检查“损耗”和“教育”之间是否存在显着关系:
ct = pd.crosstab(data.Attrition, data.Education, margins=True)
obs = np.array([ct.iloc[0][0:5].values,
ct.iloc[1][0:5].values])
stats.chi2_contingency(obs)[0:3]
--------------------------------------------------------------------
(3.0739613982367193, 0.5455253376565949, 4)
The p-value is over 0.5, so at the significance level of 0.05, we fail to reject that there is no relationship between ‘Attrition’ and ‘Education’.
p值超过0.5,因此在显着性水平0.05时,我们不能拒绝“损耗”与“教育”之间没有任何关系。
Break Down the Analysis by Department
按部门细分分析
We can also check if a significant relationship exists breaking down by department. For example, we know there is a significant relationship between ‘Attrition’ and ‘WorkLifeBalance’ but we want to examine if that is agnostic to departments. First, let’s see what are the departments and the number of employees in each of them:
我们还可以按部门检查是否存在重大关系。 例如,我们知道“损耗”和“ WorkLifeBalance”之间存在显着的关系,但是我们想检查一下这是否与部门无关。 首先,让我们看看每个部门中的部门和员工人数:
data['Department'].value_counts()
--------------------------------------------------------------------
Research & Development 961
Sales 446
Human Resources 63
Name: Department, dtype: int64
To ensure enough samples for the Chi-square test, we will only focus on R&D and Sales in this analysis.
为了确保有足够的样本用于卡方检验,在此分析中,我们将仅关注研发和销售。
alpha = 0.05
for i in dep_counts.index[0:2]:
sub_data = data[data.Department == i]
ct = pd.crosstab(sub_data.Attrition, sub_data.WorkLifeBalance, margins=True)
obs = np.array([ct.iloc[0][0:4].values,ct.iloc[1][0:4].values])
print("For " + i + ": ")
print(ct)
print('With an alpha value of {}:'.format(alpha))
if stats.chi2_contingency(obs)[1] <= alpha:
print("Dependent relationship between Attrition and Work Life Balance")
else:
print("Independent relationship between Attrition and Work Life Balance")
print("")
--------------------------------------------------------------------
For Research & Development:
WorkLifeBalance 1 2 3 4 All
Attrition
No 41 203 507 77 828
Yes 19 32 68 14 133
All 60 235 575 91 961
With an alpha value of 0.05:
Dependent relationship between Attrition and Work Life Balance
For Sales:
WorkLifeBalance 1 2 3 4 All
Attrition
No 10 78 226 40 354
Yes 6 24 50 12 92
All 16 102 276 52 446
With an alpha value of 0.05:
Independent relationship between Attrition and Work Life Balance
From these output, we can see that there is a significant relationship in the R&D department, but not in the Sales department.
从这些输出中,我们可以看到R&D部门之间存在重要关系,而Sales部门则没有。
注意事项和局限性 (Caveats and Limitations)
There are a few caveats when conducting this analysis as well as some limitations of this test:
进行此分析时需要注意一些事项,以及此测试的一些局限性:
- In order to draw a meaningful conclusion, the number of samples in each scenario needs to be sufficiently large, which might not be the case in reality. 为了得出有意义的结论,每种情况下的样本数量必须足够大,实际上可能并非如此。
A significant relationship does not imply causality.
一个显著的关系并不意味着因果关系。
- The Chi-square test itself does not provide additional insights besides ‘significant relationship or not’. For example, the test does not inform that as job satisfaction increases, the proportion of employees who leave the company tends to decrease. 卡方检验本身除了“是否存在重要关系”外,不提供其他见解。 例如,该测试并未告知随着工作满意度的提高,离开公司的员工比例趋于下降。
摘要 (Summary)
Let’s quickly recap.
让我们快速回顾一下。
We performed a Chi-square test for independence to examine the relationship between variables in the IBM HR Analytics dataset. We discussed two ways to do it in Python, both from scratch and using SciPy. Last, we showed that when a significant relationship exists, we can also stratify it and check if it is true for each level.
我们针对独立性执行卡方检验,以检查IBM HR Analytics数据集中变量之间的关系。 我们从头开始和使用SciPy讨论了两种在Python中执行此操作的方法。 最后,我们证明了当存在重要关系时,我们还可以对其进行分层,并检查每个级别的关系是否正确。
I hope you enjoyed this blog post and please share any thoughts that you may have :)
我希望您喜欢这篇博客文章,并请分享您可能有的任何想法:)
Check out my other post on building an image classification through Streamlit and PyTorch:
查看我关于通过Streamlit和PyTorch建立图像分类的其他文章:
翻译自: https://towardsdatascience.com/chi-square-test-for-independence-in-python-with-examples-from-the-ibm-hr-analytics-dataset-97b9ec9bb80a
ibm python db
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388110.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!