交易中的异常检测意味着识别交易或相关活动中的异常或意外模式。这些模式被称为异常或异常值,明显偏离预期规范,可能表明存在不规则或欺诈行为。
异常检测在各种业务中发挥着至关重要的作用,尤其是那些涉及金融交易、在线活动和安全敏感操作的业务。
我们可以遵循系统化的流程来应对异常检测的挑战。我们可以从收集和准备交易数据开始,确保其准确性和一致性。然后,我们可以在数据中寻找特定的模式来发现异常,并使用专门的异常检测算法(例如孤立森林)来检测异常。
首先,还是引入相关包和加载数据集。
import pandas as pd
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report
data = pd.read_csv("../data/transaction_anomalies_dataset.csv")
data.head()
现在,看一下数据中是否有空值,null:
data.isnull().sum()
继续下一步之前,还需要看一下每个列的基本信息:
data.info()
以及每个列的基本统计数据:
data.describe()
现在我们看一下数据中交易金额的分布:
fig_amount = px.histogram(data, x='Transaction_Amount',
nbins=20,
title='Distribution of Transaction Amount')
fig_amount.show()
现在我们看一下不同账户类型的交易金额分布:
fig_box_amount = px.box(data,
x='Account_Type',
y='Transaction_Amount',
title='Transaction Amount by Account Type')
fig_box_amount.show()
现在让我们看看按年龄划分的平均交易金额:
fig_scatter_avg_amount_age = px.scatter(data, x='Age',
y='Average_Transaction_Amount',
color='Account_Type',
title='Average Transaction Amount vs. Age',
trendline='ols')
fig_scatter_avg_amount_age.show()
按年龄划分的平均交易金额没有差异。现在让我们看一下一周中各天的交易数量:
fig_day_of_week = px.bar(data, x='Day_of_Week',
title='Count of Transactions by Day of the Week')
fig_day_of_week.show()
现在让我们看一下数据中一些交易相关的列之间的相关性:
correlation_matrix = data[['Transaction_Amount',
'Transaction_Volume',
'Average_Transaction_Amount',
'Frequency_of_Transactions',
'Time_Since_Last_Transaction',
'Age',
'Income']].corr()
fig_corr_heatmap = px.imshow(correlation_matrix,
title='Correlation Heatmap')
fig_corr_heatmap.show()
现在我们可以通过以下方式可视化数据中的异常情况:
# Calculate mean and standard deviation of Transaction Amount
mean_amount = data['Transaction_Amount'].mean()
std_amount = data['Transaction_Amount'].std()
# Define the anomaly threshold (2 standard deviations from the mean)
anomaly_threshold = mean_amount + 2 * std_amount
# Flag anomalies
data['Is_Anomaly'] = data['Transaction_Amount'] > anomaly_threshold
# Scatter plot of Transaction Amount with anomalies highlighted
fig_anomalies = px.scatter(data, x='Transaction_Amount', y='Average_Transaction_Amount',
color='Is_Anomaly', title='Anomalies in Transaction Amount')
fig_anomalies.update_traces(marker=dict(size=12),
selector=dict(mode='markers', marker_size=1))
fig_anomalies.show()
现在,我们可以计算数据中的异常数量,从而找出数据中异常的比例,这在使用孤立森林等异常检测算法时非常有用:
num_anomalies = data['Is_Anomaly'].sum()
# Calculate the total number of instances in the dataset
total_instances = data.shape[0]
# Calculate the ratio of anomalies
anomaly_ratio = num_anomalies / total_instances
print(anomaly_ratio) # should be 0.2
现在我们可以训练机器学习模型来检测异常:
relevant_features = ['Transaction_Amount',
'Average_Transaction_Amount',
'Frequency_of_Transactions']
# Split data into features (X) and target variable (y)
X = data[relevant_features]
y = data['Is_Anomaly']
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the Isolation Forest model
model = IsolationForest(contamination=0.02, random_state=42)
model.fit(X_train)
在这里,我们使用孤立森林算法训练异常检测模型。
首先,我们选择相关特征进行检测,即Transaction_Amount、Average_Transaction_Amount和Frequency_of_Transactions。
我们将数据集分为特征 (X) 和目标变量 (y),其中 X 包含选定的特征,y 包含指示实例是否异常的二进制标签。然后,我们使用 80-20 的分割比率将数据进一步分割为训练集和测试集。接下来,我们创建了一个孤立森林模型,其指定的污染参数为 0.02(表示预期的异常比率)和用于再现性的随机种子。然后在训练集 (X_train) 上训练模型。
现在我们来看看这个异常检测模型的性能:
# Predict anomalies on the test set
y_pred = model.predict(X_test)
# Convert predictions to binary values (0: normal, 1: anomaly)
y_pred_binary = [1 if pred == -1 else 0 for pred in y_pred]
# Evaluate the model's performance
report = classification_report(y_test, y_pred_binary, target_names=['Normal', 'Anomaly'])
print(report)
最后,我们使用经过训练的模型来检测异常:
# Relevant features used during training
relevant_features = ['Transaction_Amount', 'Average_Transaction_Amount', 'Frequency_of_Transactions']
# Get user inputs for features
user_inputs = []
for feature in relevant_features:
user_input = float(input(f"Enter the value for '{feature}': "))
user_inputs.append(user_input)
# Create a DataFrame from user inputs
user_df = pd.DataFrame([user_inputs], columns=relevant_features)
# Predict anomalies using the model
user_anomaly_pred = model.predict(user_df)
# Convert the prediction to binary value (0: normal, 1: anomaly)
user_anomaly_pred_binary = 1 if user_anomaly_pred == -1 else 0
if user_anomaly_pred_binary == 1:
print("Anomaly detected: This transaction is flagged as an anomaly.")
else:
print("No anomaly detected: This transaction is normal.")
数据集和代码位于:
https://github.com/realyinchen/MachineLearning