欺诈行为识别_使用R(编程)识别欺诈性的招聘广告

欺诈行为识别

背景 (Background)

Online recruitment fraud (ORF) is a form of malicious behaviour that aims to inflict loss of privacy, economic damage or harm the reputation of the stakeholders via fraudulent job advertisements.

在线招聘欺诈(ORF)是一种恶意行为,旨在通过欺诈性的招聘广告造成隐私的丧失,经济损失或损害利益相关者的声誉。

The aim of the analytics task was to identify fraudulent job advertisements from the data, determine key indicators of fraud and make recommendations on how to identify fraudulent job advertisements in the future.

分析任务的目的是从数据中识别欺诈性工作广告,确定欺诈的关键指标,并就未来如何识别欺诈性工作广告提出建议。

数据集 (Dataset)

We will use the Employment Scam Aegean Dataset (EMSCAD), which can be downloaded at http://icsdweb.aegean.gr/emscad. A description of how the data was collected and a data dictionary is available on this page.

我们将使用“就业骗局爱琴海数据集(EMSCAD)”,该数据集可从http://icsdweb.aegean.gr/emscad下载。 此页面上提供了有关如何收集数据和数据字典的描述。

The dataset contains 17,880 real-life job ads. Variables within the dataset include:

数据集包含17,880个现实工作广告。 数据集中的变量包括:

方法 (Methodology)

It is firstly important to understand how the dataset can be utilized to distinguish between fraudulent and non-fraudulent ads as this will govern the type of analytical method will be employed. The response variable which is the binary field “fraudulent” is what we are trying to predict where t = “Yes” and f = “No”.

首先,重要的是要了解如何利用数据集来区分欺诈性广告和非欺诈性广告,因为这将决定采用何种分析方法。 我们试图预测的响应变量是二进制字段“欺诈性”,其中t =“是”而f =“否”。

了解数据集以选择分析方法 (Understanding the dataset to choose the analytical approach)

We have a set of variables which are not categorized and are essentially HTML strings — Benefits, Company profile, Description and Requirements. The type of analysis that will be required of textual data is Sentiment & Emotion analysis or Frequency analysis.

我们有一组未分类的变量,它们实质上是HTML字符串-收益,公司简介,描述和要求。 文本数据所需的分析类型为情感与情感分析或频率分析。

The categorical or factor variables of which there are 11 — location, company logo, industry, function, salary range, department, required education, required experience, employment type, telecommuting, and questions will be inputs into a machine learning algorithm such as the gradient boosting machine (GBM), distributed random forest (DRF) and generalized linear model (GLMNET) to determine the top predictors that can be used to distinguish between fraudulent and non-fraudulent ads.

其中包含11个类别或类别变量-位置,公司徽标,行业,职能,薪资范围,部门,所需教育,所需经验,就业类型,远程办公和问题将输入到机器学习算法中,例如渐变提升机(GBM),分布式随机森林(DRF)和广义线性模型(GLMNET)来确定可用于区分欺诈性广告和非欺诈性广告的最佳预测指标。

As two analytical approaches will be used — one for the string variables and the other for the factor variables, there will be two sets of output as follows:

由于将使用两种分析方法(一种用于字符串变量,另一种用于因子变量),因此将有两组输出,如下所示:

  • HTML variables: Sentiment, emotion and word frequency plots

    HTML变量:情感,情感和单词频率图
  • Nominal & Binary Variables: Top predictors, coefficients

    标称和二进制变量:最佳预测变量,系数

Some variables do not contribute to any information and as such, they were excluded from the analysis. These include title and in-balanced as title is identifying information and in-balanced is used to include and exclude records to balance the dataset.

一些变量不会提供任何信息,因此将其从分析中排除。 这些包括标题和不平衡,因为标题是标识信息,不平衡用于包括和排除记录以平衡数据集。

数据ETL (Data ETL)

Prior to modelling, a number of steps have to be carried out to cleanse the dataset. The flow diagram below shows the steps that were carried out to prepare the dataset for modelling.

在建模之前,必须执行许多步骤来清理数据集。 下面的流程图显示了为准备建模数据集而执行的步骤。

Image for post
Figure 1: Data ETL Flow Diagram
图1:数据ETL流程图

结果 (Results)

输出-文本分析 (Output — Text Analytics)

词云 (Word clouds)

Word clouds were created for each of the HTML strings — company profile, job description, requirements and benefits.

为每个HTML字符串创建了文字云-公司简介,职位描述,要求和收益。

The word clouds below are for company profile for non-fraudulent ads (left) and fraudulent ads (right).

下面的词云是针对非欺诈性广告(左)和欺诈性广告(右)的公司资料。

  • Non-fraudulent ads emphasize a work life balance (“home”, “life”, “care”) and company culture (“team”, “experience”)

    不欺诈的广告强调工作与生活之间的平衡(“家庭”,“生活”,“护理”)和公司文化(“团队”,“体验”)
  • Fraudulent ads are largely missing the company profile with an emphasis on monetary perks (“cell phones”, “money”, “cost”)

    欺诈性广告在很大程度上缺少公司形象,而侧重于金钱利益(“手机”,“金钱”,“成本”)
Image for post
Figure 2.1: Company Ads — Non-Fraudulent
图2.1:公司广告-非欺诈性
Image for post
Figure 2.2: Company Ads — Fraudulent
图2.2:公司广告-欺诈

The word clouds below are for job description for non-fraudulent ads (left) and fraudulent ads (right).

以下单词云用于描述非欺诈性广告(左)和欺诈性广告(右)的工作。

  • Non-fraudulent ads emphasize company offerings (“gas”, “oil”, “operations”)

    非欺诈性广告强调公司产品(“天然气”,“石油”,“运营”)
  • Fraudulent ads emphasize monetary value (“money”, “financially”, “discounts”) Non-fraudulent ads emphasize company offerings (“gas”, “oil”, “operations”)

    欺诈性广告强调货币价值(“金钱”,“财务”,“折扣”)欺诈性广告强调公司产品(“天然气”,“石油”,“运营”)
Image for post
Figure 3.1: Job description — Non-fraudulent ads
图3.1:职位描述-非欺诈性广告
Image for post
Figure 3.2: Job description: Fraudulent Ads
图3.2:职位描述:欺诈性广告

Below are word clouds for job requirements — non-fraudulent on the left and fraudulent on the right.

以下是工作要求的词云-左侧为欺诈性质,右侧为欺诈性质。

  • Non-fraudulent ads emphasize years of experience, skills, degree qualifications and project orientation

    非欺诈性广告强调多年的经验,技能,学位资格和项目方向
  • Fraudulent ads emphasize the above attributes to a lesser extent Non-fraudulent ads emphasize years of experience, skills, degree qualifications and project orientation

    欺诈性广告较少强调上述属性欺诈性广告强调较少的经验,技能,学位资格和项目方向
Image for post
Figure 4.1: Job requirements — Non-fraudulent ads
图4.1:职位要求-不欺诈的广告
Image for post
Figure 4.2: Job requirements — Fraudulent ads
图4.2:职位要求-欺诈性广告

Finally, the word clouds below are based on the test for job benefits.

最后,下面的“云”一词基于对工作福利的测试。

  • Non-fraudulent ads emphasize benefits such as “sick leave”, “hours” and “vacation”

    不欺诈的广告会强调诸如“请病假”,“工作时间”和“假期”之类的好处
  • Fraudulent ads appear to offer monetary perks such as accommodation, holidays, food, competitive salary, visa, and food among others.

    欺诈性广告似乎提供金钱福利,例如住宿,假期,食物,有竞争力的薪水,签证和食物等。
Image for post
Figure 5.1: Job benefits — Non-fraudulent ads
图5.1:工作收益-非欺诈性广告
Image for post
Figure 5.2: Job benefits — Fraudulent ads
图5.2:工作收益-欺诈性广告

情绪分析 (Sentiment Analysis)

Another way to analyse text is via sentiment analysis, which is type of emotion (positive or negative) associated with each word in text.

分析文本的另一种方法是通过情感分析,这是与文本中每个单词相关的情感类型(正面或负面)。

For instance, looking at the emotion categories below for non-fraudulent and fraudulent ads for job requirements, we can see that a greater proportion of non-fraudulent ads (left) are positive (“joy”, “surprise”), whereas the contrast is true for fraudulent ads (right).

例如,查看下面针对工作要求的不欺诈和欺诈广告的情感类别,我们可以看到,较大比例的不欺诈广告(左)是积极的(“欢乐”,“惊奇”),而对比对于欺诈性广告是正确的(右)。

Image for post
Figure 6: Emotion sentiment analysis: Non-fraudulent vs. Fraudulent ads
图6:情绪情感分析:非欺诈性广告与欺诈性广告

We can also look at the polarity of these ads that is the orientation towards a specific emotion category, positive or negative. A greater proportion of non-fraudulent ads are positive than fraudulent ads.

我们还可以查看这些广告的极性,即针对特定情绪类别(正面或负面)的方向。 与欺诈性广告相比,非欺诈性广告中肯定的比例更大。

Image for post
Figure 7: Emotion (Polarity) analysis for non-fraudulent vs. fraudulent ads
图7:非欺诈性广告与欺诈性广告的情感(极性)分析

As shown from the examples of word clouds and bar graphs of textual sentiment, we can see that text information is very useful in predicting certain behaviour. The next logical step would be to tag these ads as positive or negative based on their emotion/ polarity and introduce this information as binary variables into a machine learning model for prediction to determine the importance of these variables for prediction.

从单词云和文本情感条形图的示例中可以看出,文本信息对于预测某些行为非常有用。 下一步的逻辑步骤是根据广告的情感/极性将这些广告标记为肯定或否定,并将此信息作为二进制变量引入到机器学习模型中进行预测,以确定这些变量对预测的重要性。

For instance, you would create four variables, job requirements, description, benefits and company profile. For each variable, each ad would be assigned a “0” or “1” to signify “positive” or “negative” sentiment.

例如,您将创建四个变量,工作要求,描述,福利和公司简介。 对于每个变量,将为每个广告分配“ 0”或“ 1”,以表示“积极”或“消极”情绪。

Now, let’s move on to utilizing the numerical variables in a model for predicting which ads are fraudulent and non-fraudulent.

现在,让我们继续使用模型中的数字变量来预测哪些广告是欺诈性和非欺诈性的。

机器学习模型 (Machine Learning Models)

总览 (Overview)

It is always good to run a number of different types of models and then select the one or combination of models that provide you not only with the highest accuracy but also meaningful results that can readily be explained to business stakeholders and are likely to be accepted by them.

最好运行多个不同类型的模型,然后选择一个或多个模型组合,这些模型不仅可以为您提供最高的准确性,而且还可以向业务利益相关方解释并可以为您所接受的有意义的结果。他们。

For this problem, I ran three types of models:

对于这个问题,我运行了三种类型的模型:

  • Distributed random forest (DRF): Essentially a random forest which is an ensemble of classification trees but run in parallel on the h2o server, hence, the word distributed.

    分布式随机森林 ( DRF ):本质上是一个随机森林,它是分类树的集合,但是在h2o服务器上并行运行,因此是分布式的。

  • Gradient boosting machine (GBM): Like the random forest, it is also a classification method consisting of an ensemble of trees. The difference is that random forests are used to build deep independent trees (i.e. each tree is run on a random set of variables on a random subset of the data — the “bagging” method), whereas GBMs built lots of shallow and weak, dependent, successive trees. In this approach, each tree learns from the previous tree and tries to improve on it by reducing the amount of error and increasing the amount of variation in the response variable explained by the predictive variables.

    梯度提升机 ( GBM ):与随机森林一样,它也是一种由树木集合组成的分类方法。 不同之处在于,随机森林用于构建深层独立的树(即,每棵树都在数据的随机子集上的随机变量集上运行-“装袋”方法),而GBM则构建了许多浅层和弱层,相关的,连续的树木。 在这种方法中,每棵树都从前一棵树中学习,并尝试通过减少错误量和增加由预测变量解释的响应变量的变化量来对其进行改进。

  • Generalized linear model (GLM): GLMs are just an extension of linear models that can be run on a non-normally distributed dependent variable. As this is a classification problem, the link function used is for logistic regression. The output of a logistic regression algorithm are coefficients for the predictor in logits, where a one unit change in the predictor variable leads to the coefficient value change in the log odds. These logits can be converted to odds ratio to provide more meaningful information.

    广义线性模型 ( GLM ): GLM只是线性模型的扩展,可以在非正态分布的因变量上运行。 由于这是一个分类问题,因此使用的链接函数用于逻辑回归 。 Logistic回归算法的输出是logits中预测变量的系数,其中预测变量的单位变化导致对数赔率的系数值变化。 可以将这些logit转换为优势比,以提供更多有意义的信息。

  • To calculate the odds ratio, we need to exponentiate each coefficient by raising it to the power of e i.e. e^b

    要计算比值比,我们需要通过将每个系数提高到e的幂( 即e ^ b)来取幂

Now that you have some understanding of the three types of models, let’s compare their model accuracy.

现在您已经对这三种类型的模型有了一定的了解,让我们比较它们的模型准确性。

方法 (Methodology)

The dataset was split into training (80% of dataset) and test (20% of dataset) sets using a random seed where the goal is to train the model on the training set and test its accuracy on the test set.

使用随机种子将数据集分为训练集(占数据集的80%)和测试集(占数据集的20%),其目的是在训练集上训练模型并在测试集上测试其准确性。

The GBM was run with the following parameters where the max depth of the tree was set to 4 (4 levels), a small learn rate, and five fold cross validation.

使用以下参数运行GBM,其中树的最大深度设置为4(4个级别),学习率小,交叉验证五倍。

Cross-validation is a technique used to validate our training model before we apply it to the test set. By specifying five folds, it means that we build five different models where each model is trained on four parts and tested on the fifth. So, the first model is trained on parts 1, 2, 3, and 4 and tested on 5. The second model is trained on parts 1, 3, 4, and 5 and tested on part 2 and so on.

交叉验证是一种用于将训练模型应用于测试集之前对其进行验证的技术。 通过指定五折,这意味着我们建立了五个不同的模型,其中每个模型分为四个部分进行训练,并在第五个部分进行测试。 因此,第一个模型在零件1、2、3和4上进行训练,并在5上进行测试。第二个模型在零件1、3、4和5上进行训练,并在第2部分上进行测试,依此类推。

This method is called k-fold cross-validation and allows us to be more confident in the performance of the modelling method utilised. When we create five different models, we are testing it on five different/unseen datasets. If we only test the model once, for example, on our test set, then we only have a single evaluation which may be a biased results.

这种方法称为k折交叉验证,它使我们对所使用的建模方法的性能更有信心。 当我们创建五个不同的模型时,我们正在五个不同/看不见的数据集上对其进行测试。 例如,如果仅在测试集上对模型进行一次测试,则只有一个评估,这可能是有偏差的结果。

gbm_model <-h2o.gbm(y=y_dv, x=x_iv, training_frame = model_train.h2o, 
ntrees =500, max_depth = 4, distribution="bernoulli", #for 0-1 outcomes
learn_rate = 0.01, seed = 1234, nfolds = 5, keep_cross_validation_predictions = TRUE)

To measure model accuracy, I used the ROC-AUC metrics. ROC or Receiver Operating Characteristic is a probability curve and the AUC, Area Under Curve, is a measure of the degree of separation between classes. In our case, the AUC is how accurately can the given model distinguish between non-fraudulent and fraudulent ads. The higher the AUC, the more accurate the model is at classifying the ads correctly.

为了测量模型的准确性,我使用了ROC-AUC指标。 ROC或接收器工作特性是一条概率曲线,而AUC(曲线面积)是对类别之间分离程度的度量。 在我们的案例中,AUC是给定模型在非欺诈性广告和欺诈性广告之间的区分精度。 AUC越高,模型正确分类广告的准确性就越高。

fpr <- h2o.fpr( h2o.performance(gbm_model, newdata=model_test.h2o) )[['fpr']]
tpr <- h2o.tpr( h2o.performance(gbm_model, newdata=model_test.h2o) )[['tpr']]
ggplot( data.table(fpr = fpr, tpr = tpr), aes(fpr, tpr) ) +
geom_line() + theme_bw() + ggtitle( sprintf('AUC: %f', gbm.auc) )

AUC is made up of a couple of metrics to test model accuracy which are:

AUC由几个衡量模型准确性的指标组成:

  • True Positives (TP): Fraudulent ads that were correctly predicted as fraudulent

    真实肯定(TP):被正确预测为欺诈的欺诈性广告
  • True Negatives (TN): Non-fraudulent ads that were correctly predicted as non-fraudulent

    真实否定词(TN):正确预测为非欺诈性的非欺诈性广告
  • False Positives (FP): Non-fraudulent ads that were incorrectly predicted as fraudulent

    误报(FP):被误认为是欺诈的非欺诈性广告
  • False Negatives (FN): Fraudulent ads that were incorrectly predicted as non-fraudulent

    假阴性(FN):被错误地预测为非欺诈的欺诈广告

These metrics can then be combined to calculate sensitivity and specificity.

然后可以将这些指标进行组合以计算敏感性和特异性。

Sensitivity is a measure of what proportion of fraudulent ads were correctly classified.

敏感性衡量正确分类欺诈广告的比例。

Sensitivity = count (TP) / sum(count(TP) + count(FP))

灵敏度=计数(TP)/总和(计数(TP)+计数(FP))

Specificity is a measure of what proportion of non-fraudulent ads were correctly identified.

特异性是衡量正确识别非欺诈性广告比例的一种方法。

Specificity = count (FP)/sum (count(TP) + count(FP))

特异性=计数(FP)/总和(计数(TP)+计数(FP))

When determining which measure is more important for your analysis, ask yourself the question whether it is more important for you to identify the number of correctly classified positives (sensitivity is more important) or negatives (specificity is more important). In our case, we want a model with higher sensitivity as we are more interested in correctly distinguishing fraudulent ads.

在确定哪种量度对您的分析更重要时,问自己一个问题,即确定正确分类的阳性(敏感性更重要)或阴性(特异性更重要)的数量对您来说更重要。 在我们的案例中,我们希望模型具有更高的灵敏度,因为我们对正确区分欺诈性广告更加感兴趣。

All these metrics can be summarized in a confusion matrix which is a table comparing number of cases that were correctly and incorrectly predicted against the actual number of fraudulent and non-fraudulent cases. This information can be used to supplement our understanding of the ROC and AUC metrics.

所有这些指标都可以汇总在一个混淆矩阵中 ,该矩阵是一个表格,该表格将正确和错误地预测的案件数量与欺诈和非欺诈案件的实际数量进行比较。 此信息可用于补充我们对ROC和AUC指标的理解。

Another aspect of the ROC-AUC metrics is the threshold used to determine whether an ad is fraudulent or non-fraudulent. To determine the best threshold t that maximizes the number of TPs positives, we can use the ROC curve, where we plot the TPR (True Positive Rate) on the y-axis against the FPR (False Positive Rate) on the x-axis.

ROC-AUC指标的另一个方面是用于确定广告是欺诈还是不欺诈的阈值 。 为了确定使TP阳性数最大化的最佳阈值t,我们可以使用ROC曲线,在该曲线上我们绘制y轴上的TPR(真阳性率)相对于x轴上的FPR(假阳性率)。

The AUC allows for comparison of models where we can compare their ROC curves for model accuracy on the test set as shown in the model output below.

AUC允许对模型进行比较,我们可以在测试集上比较其ROC曲线以确保模型准确性,如下面的模型输出所示。

模型输出 (Model Output)

模型精度比较 (Model Accuracy Comparison)

The table below shows that the DRF produces a model with the highest AUC of 0.962 on the test set. All three models have high AUC values (> 0.5 or random prediction).

下表显示了DRF在测试集上生成的AUC最高为0.962的模型。 这三个模型均具有较高的AUC值(> 0.5或随机预测)。

Image for post
Figure 9: AUC curve for classification of ads by fraudulence
图9:按欺诈分类广告的AUC曲线

However, let’s dig deeper into what this AUC means in terms of correctly classified ads as fraudulent by looking at the confusion matrix below for the GLM model as an example.

但是,让我们通过以GLM模型为例,查看下面的混淆矩阵,进一步深入了解该AUC在将广告正确分类为欺诈广告方面的含义。

The confusion matrix for GLM on the test set indicates an error rate of 8.15% in classifying fraudulent cases incorrectly. The sensitivity for this model is 327/(327+29) = 92% which is very good.

测试集上的GLM混淆矩阵表明,错误地对欺诈案件进行分类的错误率为8.15%。 该模型的灵敏度为327 /(327 + 29)= 92%,非常好。

Now, let’s look at the remaining output from the models, more specifically what are the top predictors in classifying fraudulent and non-fraudulent ads.

现在,让我们看一下模型的其余输出,更具体地说,是对欺诈性和非欺诈性广告进行分类的最佳预测指标是什么。

最重要的预测因子 (Most Important Predictors)

The variable importance rank in a classification problem tells us how accurately can a predictor variable classify fraudulent ads over non-fraudulent ads relative to all other predictors that were used int he mode.

分类问题中变量的重要性等级告诉我们,与在模式中使用的所有其他预测变量相比,预测变量可以如何准确地将欺诈性广告分类为非欺诈性广告。

For both (a) GBM and (b) DRF, the top three variables — location, company logo and industry — in terms of how useful they are in classifying job ads into fraudulent or non-fraudulent are the same. This is also true for the has questions and telecommuting variables as being least important

对于(a)GBM和(b)DRF,就它们在将招聘广告分类为欺诈或不欺诈方面的有用程度而言,前三个变量(位置,公司徽标和行业)是相同的。 对于具有最不重要的问题远程办公变量也是如此

Now, let’s plot the dataset to better understand how the top predictors vary for fraudulent and non-fraudulent ads.

现在,让我们绘制数据集,以更好地了解欺诈性和非欺诈性广告的主要预测变量如何变化。

Let’s look at the top variable, location, where we can see that a greater proportion of fraudulent than non-fraudulent ads are from the USA and Australia as indicated by the circled bars.

让我们看一下最上面的变量location ,在该变量中,如带圆圈的条所示,我们发现来自美国和澳大利亚的欺诈广告比非欺诈广告更大。

Image for post
Figure 10: Frequency of Fraudulent vs. non-fraudulent ads by country
图10:按国家划分的欺诈性广告和非欺诈性广告的频率

A greater proportion of fraudulent than non-fraudulent ads do not display a company logo in their job ads.

欺诈性广告中比不欺诈性广告更大的比例在其招聘广告中不显示公司徽标。

Image for post
Figure 11: Frequency of fraudulent vs. non-fraudulent ads by absence/presence of company logo in ad
图11:广告中是否存在公司徽标的欺诈性广告与非欺诈性广告的频率

了解模型系数 (Understanding the model coefficients)

Now, let’s try to numerically understand the relationship between the predictor variables and the classification of ads.

现在,让我们尝试从数字上了解预测变量与广告分类之间的关系。

As shown in the table below, the highlighted predictors are best at distinguishing fraudulent and non-fraudulent ads.

如下表所示,突出显示的预测变量最能区分欺诈性和非欺诈性广告。

  • 767 variables entered into model, only 48 have a non-zero coefficient (Top predictors shown)

    输入模型的767个变量中,只有48个具有非零系数(显示了最高预测变量)

  • The greater the probability, the higher the chance of the ad being fraudulent

    可能性越大,则广告被欺诈的可能性越高。

结论和后续步骤 (Conclusion and Next Steps)

Now that you have a good understanding of using both textual and numerical predictors in a classification problem with the employment of both textual analytics tools and machine learning classification algorithms.

现在,您已经对使用文本分析工具和机器学习分类算法同时使用文本预测器和数字预测器在分类问题中有了很好的了解。

So, what can we do next?

那么,下一步我们该怎么做?

  • A combination of textual analysis and predictive modelling should be used to classify job ads into fraudulent and non-fraudulent job ads

    应结合使用文本分析和预测模型来将求职广告分为欺诈性和非欺诈性职业广告
  • To improve accuracy of textual analysis the following methods can be introduced:

    为了提高文本分析的准确性,可以引入以下方法:
  • N-grams modelling: Look at the combination of words that occur together to identify patterns

    N-gram建模 :查看一起出现的单词组合以识别模式

  • Look for trends in capitalization and punctuation

    寻找大写和标点符号的趋势
  • Look for trends in emphasized text (bold, italicized)

    在强调的文本中查找趋势(粗体,斜体)
  • Look for trends in types of HTML tags used (raw text lists vs. list text wrapped in list elements)

    寻找使用HTML标签类型的趋势(原始文本列表与列表元素中包裹的列表文本)

Predictive model accuracy can be improved by:

预测模型的准确性可以通过以下方法提高:

  • Working with a larger dataset

    处理更大的数据集
  • Splitting dataset into three: training, test and validation sets

    将数据集分为三部分:训练集,测试集和验证集
  • Splitting salary range into numerical variables: minimum & maximum

    将薪水范围分为数字变量:最小和最大
  • Removing variables that are associated with each other (i.e. use of chi-squared test of independence)

    删除相互关联的变量(即,使用卡方独立性检验)
  • Splitting location into country, state, and city

    将位置分为国家,州和城市
  • Reducing number of variables by grouping industry and function categories

    通过对行业和职能类别进行分组来减少变量数量
  • Expanding the dataset to include online behaviour — i.e. number of times ad was clicked on, IP location, time ad was uploaded, etc.

    扩展数据集以包含在线行为,例如,广告被点击的次数,IP地址,广告被上传的时间等。

For all the code used to generate results, see my GitHub repository — https://github.com/shedoesdatascience/fraudanalytics

有关用于生成结果的所有代码,请参见我的GitHub存储库— https://github.com/shedoesdatascience/fraudanalytics

翻译自: https://towardsdatascience.com/identifying-fraudulent-job-advertisements-using-r-programming-230daa20aec7

欺诈行为识别

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388537.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

c语言实验四报告,湖北理工学院14本科C语言实验报告实验四数组

湖北理工学院14本科C语言实验报告实验四 数组.doc实验四 数 组实验课程名C语言程序设计专业班级 14电气工程2班 学号 201440210237 姓名 熊帆 实验时间 5.12-5.26 实验地点 K4-208 指导教师 祁文青 一、实验目的和要求1. 掌握一维数组和二维数组的定义、赋值和输入输出的方法&a…

rabbitmq channel参数详解【转】

1、Channel 1.1 channel.exchangeDeclare()&#xff1a; type&#xff1a;有direct、fanout、topic三种durable&#xff1a;true、false true&#xff1a;服务器重启会保留下来Exchange。警告&#xff1a;仅设置此选项&#xff0c;不代表消息持久化。即不保证重启后消息还在。原…

nlp gpt论文_GPT-3:NLP镇的最新动态

nlp gpt论文什么是GPT-3&#xff1f; (What is GPT-3?) The launch of Open AI’s 3rd generation of the pre-trained language model, GPT-3 (Generative Pre-training Transformer) has got the data science fraternity buzzing with excitement!Open AI的第三代预训练语言…

真实不装| 阿里巴巴新人上路指北

新手上路&#xff0c;总想听听前辈们分享他们走过的路。橙子选取了阿里巴巴合伙人逍遥子&#xff08;阿里巴巴集团CEO&#xff09; 、Eric&#xff08;蚂蚁金服董事长兼CEO&#xff09;、Judy&#xff08;阿里巴巴集团CPO&#xff09;的几段分享&#xff0c;他们是如何看待职场…

小程序学习总结

上个周末抽空了解了一下小程序,现在将所学所感记录以便日后翻看;需要指出的是我就粗略过了下小程序的api了解了下小程序的开发流程以及工具的使用,然后写了一个小程序的demo;在我看来,如果有前端基础学习小程序无异于锦上添花了,而我这个三年的码农虽也写过不少前端代码但离专业…

uber 数据可视化_使用R探索您在Uber上的活动:如何分析和可视化您的个人数据历史记录

uber 数据可视化Perhaps, dear reader, you are too young to remember that before, the only way to request a particular transport service such as a taxi was to raise a hand to make a signal to an available driver, who upon seeing you would stop if he was not …

java B2B2C springmvc mybatis电子商城系统(四)Ribbon

2019独角兽企业重金招聘Python工程师标准>>> 一&#xff1a;Ribbon是什么&#xff1f; Ribbon是Netflix发布的开源项目&#xff0c;主要功能是提供客户端的软件负载均衡算法&#xff0c;将Netflix的中间层服务连接在一起。Ribbon客户端组件提供一系列完善的配置项如…

基于plotly数据可视化_[Plotly + Datashader]可视化大型地理空间数据集

基于plotly数据可视化简介(我们将创建的内容)&#xff1a; (Introduction (what we’ll create):) Unlike the previous tutorials in this map-based visualization series, we will be dealing with a very large dataset in this tutorial (about 2GB of lat, lon coordinat…

Centos用户和用户组管理

inux系统是一个多用户多任务的分时操作系统&#xff0c;任何一个要使用系统资源的用户&#xff0c;都必须首先向系统管理员申请一个账号&#xff0c;然后以这个账号的身份进入系统。1、添加新的用户账号使用useradd命令&#xff0c;其语法如下&#xff1a;useradd 选项 用户名-…

划痕实验 迁移面积自动统计_从Jupyter迁移到合作实验室

划痕实验 迁移面积自动统计If you want to use Google Colaboratory to perform your data analysis, for building data pipelines and data visualizations, here is the beginners’ guide to migrate from one tool to the other.如果您想使用Google Colaboratory进行数据分…

数据开放 数据集_除开放式清洗之外:叙述是开放数据门户的未来吗?

数据开放 数据集There is growing consensus in the open data community that the mere release of open data — that is data that can be freely accessed, remixed, and redistributed — is not enough to realize the full potential of openness. Successful open data…

ios android 交互 区别,很多人不承认:iOS的返回交互,对比Android就是反人类。

宁之的奥义2020-09-21 10:54:39点灭只看此人举报给你解答&#xff1a;美国人都是左撇子&#xff0c;所以他们很方便&#x1f436;给你解答&#xff1a;美国人都是左撇子&#xff0c;所以他们很方便&#x1f436;亮了(504)回复查看评论(19)回忆的褶皱楼主2020-09-21 11:01:01点灭…

Servlet+JSP

需要说明的是&#xff0c;其实工具的版本不是主要因素&#xff0c;所以我下面忽略版本。 你能搜到这篇文章&#xff0c;说明你已经知道怎么部署Tomcat&#xff0c;并运行自己的网页了。 但是&#xff0c;我们知道&#xff0c;每次修改源文件&#xff0c;我们总得手工把文件co…

正态分布高斯分布泊松分布_正态分布:将数据转换为高斯分布

正态分布高斯分布泊松分布For detailed implementation in python check my GitHub repository.有关在python中的详细实现&#xff0c;请查看我的GitHub存储库。 介绍 (Introduction) Some machine learning model like linear and logistic regression assumes a Gaussian di…

BABOK - 开篇:业务分析知识体系介绍

本文更新版已挪至 http://www.zhoujingen.cn/itbang/328.html ---------------------------------------------- 当我们作项目时&#xff0c;下面这张图很多人都明白&#xff0c;从计划、构建、测试、部署实施后发现提供的方案并不能真正解决用户的问题&#xff0c;那么我们是…

黑苹果 wifi android,动动手指零负担让你的黑苹果连上Wifi

动动手指零负担让你的黑苹果连上Wifi2019-12-02 10:08:485点赞36收藏4评论购买理由黑苹果Wifi是个头疼的问题&#xff0c;高“贵”的原机Wifi蓝牙很贵&#xff0c;比如我最近偶然得到的BCM94360CS2&#xff0c;估计要180。稍微便宜的一点的&#xff0c;搞各种ID&#xff0c;各种…

float在html语言中的用法,float属性值包括

html中不属于float常用属性值的是float常用的值就三个:left\right\none。没有其他的值了。 其中none这个值是默认的&#xff0c;所以一般不用写。css中float属性有几种用法&#xff1f;值 描述left 元素向左浮动。 right 元素向右浮动。 none 默认值。元素不浮动&#xff0c;并…

它们是什么以及为什么我们不需要它们

Once in a while, when reading papers in the Reinforcement Learning domain, you may stumble across mysterious-sounding phrases such as ‘we deal with a filtered probability space’, ‘the expected value is conditional on a filtration’ or ‘the decision-mak…

LoadRunner8.1破解汉化过程

LR8.1版本已经将7.8和8.0中通用的license封了&#xff0c;因此目前无法使用LR8.1版本&#xff0c;包括该版本的中文补丁。 破解思路&#xff1a;由于软件的加密程序和运行的主程序是分开的&#xff0c;因此可以使用7.8的加密程序覆盖8.1中的加密程序&#xff0c;这样老的7.8和…

TCP/IP网络编程之基于TCP的服务端/客户端(二)

回声客户端问题 上一章TCP/IP网络编程之基于TCP的服务端/客户端&#xff08;一&#xff09;中&#xff0c;我们解释了回声客户端所存在的问题&#xff0c;那么单单是客户端的问题&#xff0c;服务端没有任何问题&#xff1f;是的&#xff0c;服务端没有问题&#xff0c;现在先让…