2019年，互联网寒冬真的来了吗？我们用决策树来做一个员工离职率的预测分析。

sklearn 中的决策树

决策树的应用场景是非常广泛的，在各行各业都有应用，并且有非常良好的表现。金融行业的风险贷款评估，医疗行业的疾病诊断，电商行业的销售预测等等。

首先我们先来了解下如何在 sklearn 中使用决策树模型。在 sklearn 中，可以使用如下方式来构建决策树分类器。




from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier(criterion='entropy')

其中的 criterion 参数，就是决策树算法，可以选择 entropy，就是基于信息熵；而 gini，就是基于基尼系数。

构建决策树的一些重要参数，整理如下：

员工离职率预测

现在开始员工离职率的预测分析实战，相关数据集可以在这里下载

https://github.com/zhouwei713/DataAnalyse/tree/master/Decision_tree

— 数据预处理 —

导入数据并查看是否存在缺失值






import pandas 
as pd


import numpy 
as np


import matplotlib.pyplot 
as plt


import matplotlib 
as matplot


import seaborn 
as sns





df = pd.read_csv(
'HR.csv', index_col=
None)


# 检测是否有缺失数据

df.isnull().any()

>>>

satisfaction_level       
False

last_evaluation          
False

number_project           
False

average_montly_hours     
False

time_spend_company       
False

Work_accident            
False

left                     
False

promotion_last_5years    
False

sales                    
False

salary                   
False

dtype: bool

数据很完整，没有缺失值

查看数据

df.head()

satisfaction_level：员工对公司满意度last_evaluation：上一次公司对员工的评价number_project：该员工同时负责多少项目average_montly_hours：每个月工作的时长time_spend_company：在公司工作多久Work_accident：是否有个工作事故left：是否离开公司（1表示离职）promotion_last_5years：过去5年是否又被升职sales：部门salary：工资水平

— 数据分析 —

查看数据形状

print(df.shape)>>>(14999, 10)

共有14999条数据，每条数据10个特征。不过需要把 left 作为标签，实际为9个特征。

查看离职率

turnover_rate = df.left.value_counts() / len(df)print(turnover_rate)>>>0    0.7619171    0.238083Name: left, dtype: float64

离职率为0.24

分组统计

turnover_Summary = df.groupby('left')print(turnover_Summary.mean())

— 相关性分析 —

我们可以通过热力图，来了解下哪些特征的影响最大，哪些特征之间的相关性又是最强的。

corr = df.corr()sns.heatmap(corr,            xticklabels=corr.columns.values,            yticklabels=corr.columns.values)

两个特征的交叉处，颜色越浅，说明相关性越大。

— 数据集字符串转换成数字 —

将数据集中的字符串数据转换成数字类型数据

print(df.dtypes)>>>satisfaction_level       float64last_evaluation          float64number_project             int64average_montly_hours       int64time_spend_company         int64Work_accident              int64left                       int64promotion_last_5years      int64sales                     objectsalary                    objectdtype: object

sales 和 salary 两个特征需要转换

这里要介绍一个 Pandas 超级强大的功能，accessor。该方法可以返回属性，让我们知道某个数据有哪些属性可以使用

import pandas as pdpd.Series._accessors>>>{'cat', 'dt', 'sparse', 'str'}

可以看到，对于 series 类型数据，有如下四个属性可用。
cat：用于分类数据（Categorical data）
str：用于字符数据（String Object data）
dt：用于时间数据（datetime-like data）
sparse：用于系数矩阵



test = pd.Series([
'str1', 
'str2', 
'str3'])


print(test.str.upper())


print(dir(test.str))

>>>

0    STR1

1    STR2

2    STR3

dtype: object

[...
'_get_series_list', 
'_inferred_dtype', 
'_is_categorical', 
'_make_accessor', 
'_orig', 
'_parent', 
'_validate', 
'_wrap_result', 
'capitalize', 
'casefold', 
'cat', 
'center', 
'contains', 
'count', 
'decode', 
'encode', 
'endswith', 
'extract', 
'extractall', 
'find', 
'findall', 
'get', 
'get_dummies', 
'index'...]

现在我们要做的是把字符串转换成数字，所以可用使用 cat 这个属性，因为对于 sales 和 salary 两个特征，它们都是类别类型的数据，比如 sales 的 support，product_mng 和 salary 的 low，medium 等。

而 cat 属性的使用，是需要和数据类型 Category 相配合的，故我们需要先将上面两列的数据类型做下转换

df["sales"] = df["sales"].astype('category')df["salary"] = df["salary"].astype('category')然后再使用 cat.codes 来实现对整数的映射df["sales"] = df["sales"].cat.codesdf["salary"] = df["salary"].cat.codes

— 模型训练 —

划分标签和特征

target_name = 'left'X = df.drop('left', axis=1)y = df[target_name]划分训练集和测试集X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)

stratify 参数的作用是在训练集和测试集中，不同标签（离职与非离职）所占比例相同，即原数据集中比例是多少，训练集和测试集中比例也为多少。

模型训练和预测





from sklearn.metrics import roc_auc_score





clf = DecisionTreeClassifier(

    criterion='entropy',

    min_weight_fraction_leaf=0.01

    )

clf = clf.fit(X_train,y_train)

clf_roc_auc = roc_auc_score(y_test, clf.predict(X_test))

print (
"决策树 AUC = %2.2f" % clf_roc_auc)

>>>

 ---决策树---

决策树 AUC = 0.93

这里使用了 ROC 和 AUC 来检查分类器的准确率，我们来看看它们的含义。

— ROC —

ROC（Receiver Operating Characteristic）曲线，用来评价二值分类器的优劣。 ROC 曲线下的面积，表示模型准确率，用 AUC 来表示。

ROC 曲线有个很好的特性：当测试集中的正负样本的分布变化的时候，ROC 曲线能够保持不变。在实际的数据集中经常会出现类不平衡(class imbalance)现象，即负样本比正样本多很多(或者相反)，而且测试数据中的正负样本的分布也可能随着时间变化。

ROC 曲线






from sklearn.metrics 
import roc_curve

clf_fpr, clf_tpr, clf_thresholds = roc_curve(y_test, clf.predict_proba(X_test)[:,
1])





plt.figure()






# 决策树 ROC

plt.plot(clf_fpr, clf_tpr, label=
'Decision Tree (area = %0.2f)' % clf_roc_auc)





plt.xlim([
0.0, 
1.0])

plt.ylim([
0.0, 
1.05])

plt.xlabel(
'False Positive Rate')

plt.ylabel(
'True Positive Rate')

plt.title(
'ROC Graph')

plt.legend(loc=
"lower right")

plt.show()

图中的曲线，越接近左上，说明模型准确率越高，也即是曲线下面的面积越大。

— 决策树应用 —

在构建好决策树之后，我们可以通过决策树来分析出不同特征的重要性，进而帮助做出决定。

在当前员工离职率分析的例子中，我们可以分析出哪几个特征是对员工离职起到觉得性作用的，那么公司就可以对想要留下的员工重点提高对应的特征。





importances = clf.feature_importances_

feat_names = df.drop([
'left'],axis=
1).columns









indices = np.argsort(importances)[::
-1]

plt.figure(figsize=(
12,
6))

plt.title(
"Feature importances by Decision Tree")

plt.bar(
range(
len(indices)), importances[indices], color=
'lightblue',  align=
"center")

plt.step(
range(
len(indices)), np.cumsum(importances[indices]), where=
'mid', label=
'Cumulative')

plt.xticks(
range(
len(indices)), feat_names[indices], rotation=
'vertical',fontsize=
14)

plt.xlim([
-1, 
len(indices)])

plt.show()

这图中可以非常直观的看出，satisfaction_level 起到了最大的作用，大概占到了0.5左右，其次就是 time_spend_company 和 last_evaluation。

那么这些影响是正相关还是负相关呢，就可以结合上面的热力图来分析

比如对于 satisfaction_level，从热力图中可以看到他与 left 是负相关的，所以我们就可以得出结论：满意度越高，越不容易离职。实际上，在真实生活中也是这样的，对吧。

最后再来看下那条折线，它的意思是说，每一个拐折的地方，都是前面特征影响度之和，比如图中红框的地方，就是前两个特征加一起对于离职的影响度，大概占到了0.7左右。

而前五个特征，基本已经占到了1，所以其他特征的影响几乎可以忽略不计了。

十月份，预测下年末的离职情况