欺诈行为识别
背景 (Background)
Online recruitment fraud (ORF) is a form of malicious behaviour that aims to inflict loss of privacy, economic damage or harm the reputation of the stakeholders via fraudulent job advertisements.
在线招聘欺诈(ORF)是一种恶意行为,旨在通过欺诈性的招聘广告造成隐私的丧失,经济损失或损害利益相关者的声誉。
The aim of the analytics task was to identify fraudulent job advertisements from the data, determine key indicators of fraud and make recommendations on how to identify fraudulent job advertisements in the future.
分析任务的目的是从数据中识别欺诈性工作广告,确定欺诈的关键指标,并就未来如何识别欺诈性工作广告提出建议。
数据集 (Dataset)
We will use the Employment Scam Aegean Dataset (EMSCAD), which can be downloaded at http://icsdweb.aegean.gr/emscad. A description of how the data was collected and a data dictionary is available on this page.
我们将使用“就业骗局爱琴海数据集(EMSCAD)”,该数据集可从http://icsdweb.aegean.gr/emscad下载。 此页面上提供了有关如何收集数据和数据字典的描述。
The dataset contains 17,880 real-life job ads. Variables within the dataset include:
数据集包含17,880个现实工作广告。 数据集中的变量包括:
方法 (Methodology)
It is firstly important to understand how the dataset can be utilized to distinguish between fraudulent and non-fraudulent ads as this will govern the type of analytical method will be employed. The response variable which is the binary field “fraudulent” is what we are trying to predict where t = “Yes” and f = “No”.
首先,重要的是要了解如何利用数据集来区分欺诈性广告和非欺诈性广告,因为这将决定采用何种分析方法。 我们试图预测的响应变量是二进制字段“欺诈性”,其中t =“是”而f =“否”。
了解数据集以选择分析方法 (Understanding the dataset to choose the analytical approach)
We have a set of variables which are not categorized and are essentially HTML strings — Benefits, Company profile, Description and Requirements. The type of analysis that will be required of textual data is Sentiment & Emotion analysis or Frequency analysis.
我们有一组未分类的变量,它们实质上是HTML字符串-收益,公司简介,描述和要求。 文本数据所需的分析类型为情感与情感分析或频率分析。
The categorical or factor variables of which there are 11 — location, company logo, industry, function, salary range, department, required education, required experience, employment type, telecommuting, and questions will be inputs into a machine learning algorithm such as the gradient boosting machine (GBM), distributed random forest (DRF) and generalized linear model (GLMNET) to determine the top predictors that can be used to distinguish between fraudulent and non-fraudulent ads.
其中包含11个类别或类别变量-位置,公司徽标,行业,职能,薪资范围,部门,所需教育,所需经验,就业类型,远程办公和问题将输入到机器学习算法中,例如渐变提升机(GBM),分布式随机森林(DRF)和广义线性模型(GLMNET)来确定可用于区分欺诈性广告和非欺诈性广告的最佳预测指标。
As two analytical approaches will be used — one for the string variables and the other for the factor variables, there will be two sets of output as follows:
由于将使用两种分析方法(一种用于字符串变量,另一种用于因子变量),因此将有两组输出,如下所示:
- HTML variables: Sentiment, emotion and word frequency plots HTML变量:情感,情感和单词频率图
- Nominal & Binary Variables: Top predictors, coefficients 标称和二进制变量:最佳预测变量,系数
Some variables do not contribute to any information and as such, they were excluded from the analysis. These include title and in-balanced as title is identifying information and in-balanced is used to include and exclude records to balance the dataset.
一些变量不会提供任何信息,因此将其从分析中排除。 这些包括标题和不平衡,因为标题是标识信息,不平衡用于包括和排除记录以平衡数据集。
数据ETL (Data ETL)
Prior to modelling, a number of steps have to be carried out to cleanse the dataset. The flow diagram below shows the steps that were carried out to prepare the dataset for modelling.
在建模之前,必须执行许多步骤来清理数据集。 下面的流程图显示了为准备建模数据集而执行的步骤。
结果 (Results)
输出-文本分析 (Output — Text Analytics)
词云 (Word clouds)
Word clouds were created for each of the HTML strings — company profile, job description, requirements and benefits.
为每个HTML字符串创建了文字云-公司简介,职位描述,要求和收益。
The word clouds below are for company profile for non-fraudulent ads (left) and fraudulent ads (right).
下面的词云是针对非欺诈性广告(左)和欺诈性广告(右)的公司资料。
- Non-fraudulent ads emphasize a work life balance (“home”, “life”, “care”) and company culture (“team”, “experience”) 不欺诈的广告强调工作与生活之间的平衡(“家庭”,“生活”,“护理”)和公司文化(“团队”,“体验”)
- Fraudulent ads are largely missing the company profile with an emphasis on monetary perks (“cell phones”, “money”, “cost”) 欺诈性广告在很大程度上缺少公司形象,而侧重于金钱利益(“手机”,“金钱”,“成本”)
The word clouds below are for job description for non-fraudulent ads (left) and fraudulent ads (right).
以下单词云用于描述非欺诈性广告(左)和欺诈性广告(右)的工作。
- Non-fraudulent ads emphasize company offerings (“gas”, “oil”, “operations”) 非欺诈性广告强调公司产品(“天然气”,“石油”,“运营”)
- Fraudulent ads emphasize monetary value (“money”, “financially”, “discounts”) Non-fraudulent ads emphasize company offerings (“gas”, “oil”, “operations”) 欺诈性广告强调货币价值(“金钱”,“财务”,“折扣”)欺诈性广告强调公司产品(“天然气”,“石油”,“运营”)
Below are word clouds for job requirements — non-fraudulent on the left and fraudulent on the right.
以下是工作要求的词云-左侧为欺诈性质,右侧为欺诈性质。
- Non-fraudulent ads emphasize years of experience, skills, degree qualifications and project orientation 非欺诈性广告强调多年的经验,技能,学位资格和项目方向
- Fraudulent ads emphasize the above attributes to a lesser extent Non-fraudulent ads emphasize years of experience, skills, degree qualifications and project orientation 欺诈性广告较少强调上述属性欺诈性广告强调较少的经验,技能,学位资格和项目方向
Finally, the word clouds below are based on the test for job benefits.
最后,下面的“云”一词基于对工作福利的测试。
- Non-fraudulent ads emphasize benefits such as “sick leave”, “hours” and “vacation” 不欺诈的广告会强调诸如“请病假”,“工作时间”和“假期”之类的好处
- Fraudulent ads appear to offer monetary perks such as accommodation, holidays, food, competitive salary, visa, and food among others. 欺诈性广告似乎提供金钱福利,例如住宿,假期,食物,有竞争力的薪水,签证和食物等。
情绪分析 (Sentiment Analysis)
Another way to analyse text is via sentiment analysis, which is type of emotion (positive or negative) associated with each word in text.
分析文本的另一种方法是通过情感分析,这是与文本中每个单词相关的情感类型(正面或负面)。
For instance, looking at the emotion categories below for non-fraudulent and fraudulent ads for job requirements, we can see that a greater proportion of non-fraudulent ads (left) are positive (“joy”, “surprise”), whereas the contrast is true for fraudulent ads (right).
例如,查看下面针对工作要求的不欺诈和欺诈广告的情感类别,我们可以看到,较大比例的不欺诈广告(左)是积极的(“欢乐”,“惊奇”),而对比对于欺诈性广告是正确的(右)。
We can also look at the polarity of these ads that is the orientation towards a specific emotion category, positive or negative. A greater proportion of non-fraudulent ads are positive than fraudulent ads.
我们还可以查看这些广告的极性,即针对特定情绪类别(正面或负面)的方向。 与欺诈性广告相比,非欺诈性广告中肯定的比例更大。
As shown from the examples of word clouds and bar graphs of textual sentiment, we can see that text information is very useful in predicting certain behaviour. The next logical step would be to tag these ads as positive or negative based on their emotion/ polarity and introduce this information as binary variables into a machine learning model for prediction to determine the importance of these variables for prediction.
从单词云和文本情感条形图的示例中可以看出,文本信息对于预测某些行为非常有用。 下一步的逻辑步骤是根据广告的情感/极性将这些广告标记为肯定或否定,并将此信息作为二进制变量引入到机器学习模型中进行预测,以确定这些变量对预测的重要性。
For instance, you would create four variables, job requirements, description, benefits and company profile. For each variable, each ad would be assigned a “0” or “1” to signify “positive” or “negative” sentiment.
例如,您将创建四个变量,工作要求,描述,福利和公司简介。 对于每个变量,将为每个广告分配“ 0”或“ 1”,以表示“积极”或“消极”情绪。
Now, let’s move on to utilizing the numerical variables in a model for predicting which ads are fraudulent and non-fraudulent.
现在,让我们继续使用模型中的数字变量来预测哪些广告是欺诈性和非欺诈性的。
机器学习模型 (Machine Learning Models)
总览 (Overview)
It is always good to run a number of different types of models and then select the one or combination of models that provide you not only with the highest accuracy but also meaningful results that can readily be explained to business stakeholders and are likely to be accepted by them.
最好运行多个不同类型的模型,然后选择一个或多个模型组合,这些模型不仅可以为您提供最高的准确性,而且还可以向业务利益相关方解释并可以为您所接受的有意义的结果。他们。
For this problem, I ran three types of models:
对于这个问题,我运行了三种类型的模型:
Distributed random forest (DRF): Essentially a random forest which is an ensemble of classification trees but run in parallel on the h2o server, hence, the word distributed.
分布式随机森林 ( DRF ):本质上是一个随机森林,它是分类树的集合,但是在h2o服务器上并行运行,因此是分布式的。
Gradient boosting machine (GBM): Like the random forest, it is also a classification method consisting of an ensemble of trees. The difference is that random forests are used to build deep independent trees (i.e. each tree is run on a random set of variables on a random subset of the data — the “bagging” method), whereas GBMs built lots of shallow and weak, dependent, successive trees. In this approach, each tree learns from the previous tree and tries to improve on it by reducing the amount of error and increasing the amount of variation in the response variable explained by the predictive variables.
梯度提升机 ( GBM ):与随机森林一样,它也是一种由树木集合组成的分类方法。 不同之处在于,随机森林用于构建深层独立的树(即,每棵树都在数据的随机子集上的随机变量集上运行-“装袋”方法),而GBM则构建了许多浅层和弱层,相关的,连续的树木。 在这种方法中,每棵树都从前一棵树中学习,并尝试通过减少错误量和增加由预测变量解释的响应变量的变化量来对其进行改进。
Generalized linear model (GLM): GLMs are just an extension of linear models that can be run on a non-normally distributed dependent variable. As this is a classification problem, the link function used is for logistic regression. The output of a logistic regression algorithm are coefficients for the predictor in logits, where a one unit change in the predictor variable leads to the coefficient value change in the log odds. These logits can be converted to odds ratio to provide more meaningful information.
广义线性模型 ( GLM ): GLM只是线性模型的扩展,可以在非正态分布的因变量上运行。 由于这是一个分类问题,因此使用的链接函数用于逻辑回归 。 Logistic回归算法的输出是logits中预测变量的系数,其中预测变量的单位变化导致对数赔率的系数值变化。 可以将这些logit转换为优势比,以提供更多有意义的信息。
To calculate the odds ratio, we need to exponentiate each coefficient by raising it to the power of e i.e. e^b
要计算比值比,我们需要通过将每个系数提高到e的幂( 即e ^ b)来取幂
Now that you have some understanding of the three types of models, let’s compare their model accuracy.
现在您已经对这三种类型的模型有了一定的了解,让我们比较它们的模型准确性。
方法 (Methodology)
The dataset was split into training (80% of dataset) and test (20% of dataset) sets using a random seed where the goal is to train the model on the training set and test its accuracy on the test set.
使用随机种子将数据集分为训练集(占数据集的80%)和测试集(占数据集的20%),其目的是在训练集上训练模型并在测试集上测试其准确性。
The GBM was run with the following parameters where the max depth of the tree was set to 4 (4 levels), a small learn rate, and five fold cross validation.
使用以下参数运行GBM,其中树的最大深度设置为4(4个级别),学习率小,交叉验证五倍。
Cross-validation is a technique used to validate our training model before we apply it to the test set. By specifying five folds, it means that we build five different models where each model is trained on four parts and tested on the fifth. So, the first model is trained on parts 1, 2, 3, and 4 and tested on 5. The second model is trained on parts 1, 3, 4, and 5 and tested on part 2 and so on.
交叉验证是一种用于将训练模型应用于测试集之前对其进行验证的技术。 通过指定五折,这意味着我们建立了五个不同的模型,其中每个模型分为四个部分进行训练,并在第五个部分进行测试。 因此,第一个模型在零件1、2、3和4上进行训练,并在5上进行测试。第二个模型在零件1、3、4和5上进行训练,并在第2部分上进行测试,依此类推。
This method is called k-fold cross-validation and allows us to be more confident in the performance of the modelling method utilised. When we create five different models, we are testing it on five different/unseen datasets. If we only test the model once, for example, on our test set, then we only have a single evaluation which may be a biased results.
这种方法称为k折交叉验证,它使我们对所使用的建模方法的性能更有信心。 当我们创建五个不同的模型时,我们正在五个不同/看不见的数据集上对其进行测试。 例如,如果仅在测试集上对模型进行一次测试,则只有一个评估,这可能是有偏差的结果。
gbm_model <-h2o.gbm(y=y_dv, x=x_iv, training_frame = model_train.h2o,
ntrees =500, max_depth = 4, distribution="bernoulli", #for 0-1 outcomes
learn_rate = 0.01, seed = 1234, nfolds = 5, keep_cross_validation_predictions = TRUE)
To measure model accuracy, I used the ROC-AUC metrics. ROC or Receiver Operating Characteristic is a probability curve and the AUC, Area Under Curve, is a measure of the degree of separation between classes. In our case, the AUC is how accurately can the given model distinguish between non-fraudulent and fraudulent ads. The higher the AUC, the more accurate the model is at classifying the ads correctly.
为了测量模型的准确性,我使用了ROC-AUC指标。 ROC或接收器工作特性是一条概率曲线,而AUC(曲线下面积)是对类别之间分离程度的度量。 在我们的案例中,AUC是给定模型在非欺诈性广告和欺诈性广告之间的区分精度。 AUC越高,模型正确分类广告的准确性就越高。
fpr <- h2o.fpr( h2o.performance(gbm_model, newdata=model_test.h2o) )[['fpr']]
tpr <- h2o.tpr( h2o.performance(gbm_model, newdata=model_test.h2o) )[['tpr']]
ggplot( data.table(fpr = fpr, tpr = tpr), aes(fpr, tpr) ) +
geom_line() + theme_bw() + ggtitle( sprintf('AUC: %f', gbm.auc) )
AUC is made up of a couple of metrics to test model accuracy which are:
AUC由几个衡量模型准确性的指标组成:
- True Positives (TP): Fraudulent ads that were correctly predicted as fraudulent 真实肯定(TP):被正确预测为欺诈的欺诈性广告
- True Negatives (TN): Non-fraudulent ads that were correctly predicted as non-fraudulent 真实否定词(TN):正确预测为非欺诈性的非欺诈性广告
- False Positives (FP): Non-fraudulent ads that were incorrectly predicted as fraudulent 误报(FP):被误认为是欺诈的非欺诈性广告
- False Negatives (FN): Fraudulent ads that were incorrectly predicted as non-fraudulent 假阴性(FN):被错误地预测为非欺诈的欺诈广告
These metrics can then be combined to calculate sensitivity and specificity.
然后可以将这些指标进行组合以计算敏感性和特异性。
Sensitivity is a measure of what proportion of fraudulent ads were correctly classified.
敏感性衡量正确分类欺诈广告的比例。
Sensitivity = count (TP) / sum(count(TP) + count(FP))
灵敏度=计数(TP)/总和(计数(TP)+计数(FP))
Specificity is a measure of what proportion of non-fraudulent ads were correctly identified.
特异性是衡量正确识别非欺诈性广告比例的一种方法。
Specificity = count (FP)/sum (count(TP) + count(FP))
特异性=计数(FP)/总和(计数(TP)+计数(FP))
When determining which measure is more important for your analysis, ask yourself the question whether it is more important for you to identify the number of correctly classified positives (sensitivity is more important) or negatives (specificity is more important). In our case, we want a model with higher sensitivity as we are more interested in correctly distinguishing fraudulent ads.
在确定哪种量度对您的分析更重要时,问自己一个问题,即确定正确分类的阳性(敏感性更重要)或阴性(特异性更重要)的数量对您来说更重要。 在我们的案例中,我们希望模型具有更高的灵敏度,因为我们对正确区分欺诈性广告更加感兴趣。
All these metrics can be summarized in a confusion matrix which is a table comparing number of cases that were correctly and incorrectly predicted against the actual number of fraudulent and non-fraudulent cases. This information can be used to supplement our understanding of the ROC and AUC metrics.
所有这些指标都可以汇总在一个混淆矩阵中 ,该矩阵是一个表格,该表格将正确和错误地预测的案件数量与欺诈和非欺诈案件的实际数量进行比较。 此信息可用于补充我们对ROC和AUC指标的理解。
Another aspect of the ROC-AUC metrics is the threshold used to determine whether an ad is fraudulent or non-fraudulent. To determine the best threshold t that maximizes the number of TPs positives, we can use the ROC curve, where we plot the TPR (True Positive Rate) on the y-axis against the FPR (False Positive Rate) on the x-axis.
ROC-AUC指标的另一个方面是用于确定广告是欺诈还是不欺诈的阈值 。 为了确定使TP阳性数最大化的最佳阈值t,我们可以使用ROC曲线,在该曲线上我们绘制y轴上的TPR(真阳性率)相对于x轴上的FPR(假阳性率)。
The AUC allows for comparison of models where we can compare their ROC curves for model accuracy on the test set as shown in the model output below.
AUC允许对模型进行比较,我们可以在测试集上比较其ROC曲线以确保模型准确性,如下面的模型输出所示。
模型输出 (Model Output)
模型精度比较 (Model Accuracy Comparison)
The table below shows that the DRF produces a model with the highest AUC of 0.962 on the test set. All three models have high AUC values (> 0.5 or random prediction).
下表显示了DRF在测试集上生成的AUC最高为0.962的模型。 这三个模型均具有较高的AUC值(> 0.5或随机预测)。
However, let’s dig deeper into what this AUC means in terms of correctly classified ads as fraudulent by looking at the confusion matrix below for the GLM model as an example.
但是,让我们通过以GLM模型为例,查看下面的混淆矩阵,进一步深入了解该AUC在将广告正确分类为欺诈广告方面的含义。
The confusion matrix for GLM on the test set indicates an error rate of 8.15% in classifying fraudulent cases incorrectly. The sensitivity for this model is 327/(327+29) = 92% which is very good.
测试集上的GLM混淆矩阵表明,错误地对欺诈案件进行分类的错误率为8.15%。 该模型的灵敏度为327 /(327 + 29)= 92%,非常好。
Now, let’s look at the remaining output from the models, more specifically what are the top predictors in classifying fraudulent and non-fraudulent ads.
现在,让我们看一下模型的其余输出,更具体地说,是对欺诈性和非欺诈性广告进行分类的最佳预测指标是什么。
最重要的预测因子 (Most Important Predictors)
The variable importance rank in a classification problem tells us how accurately can a predictor variable classify fraudulent ads over non-fraudulent ads relative to all other predictors that were used int he mode.
分类问题中变量的重要性等级告诉我们,与在模式中使用的所有其他预测变量相比,预测变量可以如何准确地将欺诈性广告分类为非欺诈性广告。
For both (a) GBM and (b) DRF, the top three variables — location, company logo and industry — in terms of how useful they are in classifying job ads into fraudulent or non-fraudulent are the same. This is also true for the has questions and telecommuting variables as being least important
对于(a)GBM和(b)DRF,就它们在将招聘广告分类为欺诈或不欺诈方面的有用程度而言,前三个变量(位置,公司徽标和行业)是相同的。 对于具有最不重要的问题和远程办公变量也是如此
Now, let’s plot the dataset to better understand how the top predictors vary for fraudulent and non-fraudulent ads.
现在,让我们绘制数据集,以更好地了解欺诈性和非欺诈性广告的主要预测变量如何变化。
Let’s look at the top variable, location, where we can see that a greater proportion of fraudulent than non-fraudulent ads are from the USA and Australia as indicated by the circled bars.
让我们看一下最上面的变量location ,在该变量中,如带圆圈的条所示,我们发现来自美国和澳大利亚的欺诈广告比非欺诈广告更大。
A greater proportion of fraudulent than non-fraudulent ads do not display a company logo in their job ads.
欺诈性广告中比不欺诈性广告更大的比例在其招聘广告中不显示公司徽标。
了解模型系数 (Understanding the model coefficients)
Now, let’s try to numerically understand the relationship between the predictor variables and the classification of ads.
现在,让我们尝试从数字上了解预测变量与广告分类之间的关系。
As shown in the table below, the highlighted predictors are best at distinguishing fraudulent and non-fraudulent ads.
如下表所示,突出显示的预测变量最能区分欺诈性和非欺诈性广告。
767 variables entered into model, only 48 have a non-zero coefficient (Top predictors shown)
输入模型的767个变量中,只有48个具有非零系数(显示了最高预测变量)
The greater the probability, the higher the chance of the ad being fraudulent
可能性越大,则广告被欺诈的可能性越高。
结论和后续步骤 (Conclusion and Next Steps)
Now that you have a good understanding of using both textual and numerical predictors in a classification problem with the employment of both textual analytics tools and machine learning classification algorithms.
现在,您已经对使用文本分析工具和机器学习分类算法同时使用文本预测器和数字预测器在分类问题中有了很好的了解。
So, what can we do next?
那么,下一步我们该怎么做?
- A combination of textual analysis and predictive modelling should be used to classify job ads into fraudulent and non-fraudulent job ads 应结合使用文本分析和预测模型来将求职广告分为欺诈性和非欺诈性职业广告
- To improve accuracy of textual analysis the following methods can be introduced: 为了提高文本分析的准确性,可以引入以下方法:
N-grams modelling: Look at the combination of words that occur together to identify patterns
N-gram建模 :查看一起出现的单词组合以识别模式
- Look for trends in capitalization and punctuation 寻找大写和标点符号的趋势
- Look for trends in emphasized text (bold, italicized) 在强调的文本中查找趋势(粗体,斜体)
- Look for trends in types of HTML tags used (raw text lists vs. list text wrapped in list elements) 寻找使用HTML标签类型的趋势(原始文本列表与列表元素中包裹的列表文本)
Predictive model accuracy can be improved by:
预测模型的准确性可以通过以下方法提高:
- Working with a larger dataset 处理更大的数据集
- Splitting dataset into three: training, test and validation sets 将数据集分为三部分:训练集,测试集和验证集
- Splitting salary range into numerical variables: minimum & maximum 将薪水范围分为数字变量:最小和最大
- Removing variables that are associated with each other (i.e. use of chi-squared test of independence) 删除相互关联的变量(即,使用卡方独立性检验)
- Splitting location into country, state, and city 将位置分为国家,州和城市
- Reducing number of variables by grouping industry and function categories 通过对行业和职能类别进行分组来减少变量数量
- Expanding the dataset to include online behaviour — i.e. number of times ad was clicked on, IP location, time ad was uploaded, etc. 扩展数据集以包含在线行为,例如,广告被点击的次数,IP地址,广告被上传的时间等。
For all the code used to generate results, see my GitHub repository — https://github.com/shedoesdatascience/fraudanalytics
有关用于生成结果的所有代码,请参见我的GitHub存储库— https://github.com/shedoesdatascience/fraudanalytics
翻译自: https://towardsdatascience.com/identifying-fraudulent-job-advertisements-using-r-programming-230daa20aec7
欺诈行为识别
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388537.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!