使用Python进行员工流失分析

员工流失分析是指分析离开公司、企业的员工的行为，并将他们与公司中的现有员工进行比较。它有助于找出哪些员工可能很快离开。所以，如果你想学习如何分析员工流失，这篇文章适合你。本文中，将带您完成使用Python进行员工流失分析的任务。

员工流失分析

员工流失分析是一种行为分析，我们研究离开公司的员工的行为和特征，并将其特征与现有员工进行比较，以找到可能很快离开公司的员工。

就招聘和培训成本、生产力损失和员工士气下降而言，员工的高流失率对任何公司来说都是昂贵的。通过识别员工流失的原因，企业可以采取措施减少员工流失，留住宝贵的员工。

使用Python进行员工流失分析

导入必要的Python库和数据集来开始此任务：

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "plotly_white"data = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
print(data.head())

输出

   Age Attrition     BusinessTravel  DailyRate              Department  \
0   41       Yes      Travel_Rarely       1102                   Sales   
1   49        No  Travel_Frequently        279  Research & Development   
2   37       Yes      Travel_Rarely       1373  Research & Development   
3   33        No  Travel_Frequently       1392  Research & Development   
4   27        No      Travel_Rarely        591  Research & Development   DistanceFromHome  Education EducationField  EmployeeCount  EmployeeNumber  \
0                 1          2  Life Sciences              1               1   
1                 8          1  Life Sciences              1               2   
2                 2          2          Other              1               4   
3                 3          4  Life Sciences              1               5   
4                 2          1        Medical              1               7   ...  RelationshipSatisfaction StandardHours  StockOptionLevel  \
0  ...                         1            80                 0   
1  ...                         4            80                 1   
2  ...                         2            80                 0   
3  ...                         3            80                 0   
4  ...                         4            80                 1   TotalWorkingYears  TrainingTimesLastYear WorkLifeBalance  YearsAtCompany  \
0                  8                      0               1               6   
1                 10                      3               3              10   
2                  7                      3               3               0   
3                  8                      3               3               8   
4                  6                      3               3               2   YearsInCurrentRole  YearsSinceLastPromotion  YearsWithCurrManager  
0                  4                        0                     5  
1                  7                        1                     7  
2                  0                        0                     0  
3                  7                        3                     0  
4                  2                        2                     2  [5 rows x 35 columns]

让我们看看这个数据集是否包含任何缺失值：

print(data.isnull().sum())

输出

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

现在让我们来看看数据集中年龄的分布：

sns.displot(data['Age'], kde=True)
plt.title('Distribution of Age')
plt.show()

在这里插入图片描述
让我们来看看各部门的流失率：

# Filter the data to show only "Yes" values in the "Attrition" column
attrition_data = data[data['Attrition'] == 'Yes']# Calculate the count of attrition by department
attrition_by = attrition_data.groupby(['Department']).size().reset_index(name='Count')# Create a donut chart
fig = go.Figure(data=[go.Pie(labels=attrition_by['Department'],values=attrition_by['Count'],hole=0.4,marker=dict(colors=['#3CAEA3', '#F6D55C']),textposition='inside'
)])# Update the layout
fig.update_layout(title='Attrition by Department', font=dict(size=16), legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1
))# Show the chart
fig.show()

在这里插入图片描述
我们可以看到研发部门的人员流失率很高。现在让我们来看看按专业领域划分的流失百分比：

attrition_by = attrition_data.groupby(['EducationField']).size().reset_index(name='Count')# Create a donut chart
fig = go.Figure(data=[go.Pie(labels=attrition_by['EducationField'],values=attrition_by['Count'],hole=0.4,marker=dict(colors=['#3CAEA3', '#F6D55C']),textposition='inside'
)])# Update the layout
fig.update_layout(title='Attrition by Educational Field', font=dict(size=16), legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1
))# Show the chart
fig.show()

在这里插入图片描述
我们可以看到，以生命科学为专业的员工流失率很高。现在让我们来看看在公司工作的年数中的自然减员百分比：

attrition_by = attrition_data.groupby(['YearsAtCompany']).size().reset_index(name='Count')# Create a donut chart
fig = go.Figure(data=[go.Pie(labels=attrition_by['YearsAtCompany'],values=attrition_by['Count'],hole=0.4,marker=dict(colors=['#3CAEA3', '#F6D55C']),textposition='inside'
)])# Update the layout
fig.update_layout(title='Attrition by Years at Company', font=dict(size=16), legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1
))# Show the chart
fig.show()

在这里插入图片描述
我们可以看到，大多数员工在工作一年后离开公司。现在让我们来看看自上次晋升以来的年数的流失百分比：

attrition_by = attrition_data.groupby(['YearsSinceLastPromotion']).size().reset_index(name='Count')# Create a donut chart
fig = go.Figure(data=[go.Pie(labels=attrition_by['YearsSinceLastPromotion'],values=attrition_by['Count'],hole=0.4,marker=dict(colors=['#3CAEA3', '#F6D55C']),textposition='inside'
)])# Update the layout
fig.update_layout(title='Attrition by Years Since Last Promotion', font=dict(size=16), legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1
))# Show the chart
fig.show()

在这里插入图片描述
我们可以看到，与获得晋升的员工相比，没有获得晋升的员工离开公司的次数更多。现在让我们来看看按性别划分的流失百分比：

attrition_by = attrition_data.groupby(['Gender']).size().reset_index(name='Count')# Create a donut chart
fig = go.Figure(data=[go.Pie(labels=attrition_by['Gender'],values=attrition_by['Count'],hole=0.4,marker=dict(colors=['#3CAEA3', '#F6D55C']),textposition='inside'
)])# Update the layout
fig.update_layout(title='Attrition by Gender', font=dict(size=16), legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1
))# Show the chart
fig.show()

在这里插入图片描述
男性的自然流失率高于女性。现在让我们通过分析月收入与员工年龄之间的关系来看看流失情况：

fig = px.scatter(data, x="Age", y="MonthlyIncome", color="Attrition", trendline="ols")
fig.update_layout(title="Age vs. Monthly Income by Attrition")
fig.show()

在这里插入图片描述
我们可以看到，随着年龄的增长，每月收入增加。我们还可以看到，低月收入的员工流失率很高。

这就是我们分析员工流失的方法。

员工流失预测模型

现在，让我们准备一个机器学习模型来预测员工流失。该数据集具有许多具有分类值的特征。我将这些分类变量转换为数值：

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['Attrition'] = le.fit_transform(data['Attrition'])
data['BusinessTravel'] = le.fit_transform(data['BusinessTravel'])
data['Department'] = le.fit_transform(data['Department'])
data['EducationField'] = le.fit_transform(data['EducationField'])
data['Gender'] = le.fit_transform(data['Gender'])
data['JobRole'] = le.fit_transform(data['JobRole'])
data['MaritalStatus'] = le.fit_transform(data['MaritalStatus'])
data['Over18'] = le.fit_transform(data['Over18'])
data['OverTime'] = le.fit_transform(data['OverTime'])

现在让我们来看看相关性：

correlation = data.corr()
print(correlation["Attrition"].sort_values(ascending=False))

输出

Attrition                   1.000000
OverTime                    0.246118
MaritalStatus               0.162070
DistanceFromHome            0.077924
JobRole                     0.067151
Department                  0.063991
NumCompaniesWorked          0.043494
Gender                      0.029453
EducationField              0.026846
MonthlyRate                 0.015170
PerformanceRating           0.002889
BusinessTravel              0.000074
HourlyRate                 -0.006846
EmployeeNumber             -0.010577
PercentSalaryHike          -0.013478
Education                  -0.031373
YearsSinceLastPromotion    -0.033019
RelationshipSatisfaction   -0.045872
DailyRate                  -0.056652
TrainingTimesLastYear      -0.059478
WorkLifeBalance            -0.063939
EnvironmentSatisfaction    -0.103369
JobSatisfaction            -0.103481
JobInvolvement             -0.130016
YearsAtCompany             -0.134392
StockOptionLevel           -0.137145
YearsWithCurrManager       -0.156199
Age                        -0.159205
MonthlyIncome              -0.159840
YearsInCurrentRole         -0.160545
JobLevel                   -0.169105
TotalWorkingYears          -0.171063
EmployeeCount                    NaN
Over18                           NaN
StandardHours                    NaN
Name: Attrition, dtype: float64

为这个数据添加一个新特征，满意度评分：

data['SatisfactionScore'] = data['EnvironmentSatisfaction'] + data['JobSatisfaction'] + data['RelationshipSatisfaction']

现在让我们将数据分为训练集和测试集：

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score# Split the data into training and testing sets
X = data.drop(['Attrition'], axis=1)
y = data['Attrition']
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=42)

下面是我们如何训练员工流失预测模型：

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(xtrain, ytrain)# Evaluate the model's performance
ypred = model.predict(xtest)
accuracy = accuracy_score(ytest, ypred)
print("Accuracy:", accuracy)