SMSSMS垃圾邮件检测器的专业攻击

Note: The methodology behind the approach discussed in this post stems from a collaborative publication between myself and Irene Anthi.

注意: 本文讨论的方法背后的方法来自 我本人和 Irene Anthi 之间 合作出版物

介绍 (INTRODUCTION)

Spam SMS text messages often show up unexpectedly on our phone screens. That’s aggravating enough, but it gets worse. Whoever is sending you a spam text message is usually trying to defraud you. Most spam text messages don’t come from another phone. They often originate from a computer and are delivered to your phone via an email address or an instant messaging account.

垃圾短信经常在我们的手机屏幕上意外显示。 这足够令人讨厌,但情况变得更糟。 谁向您发送垃圾短信通常是在欺骗您。 大多数垃圾短信不是来自其他手机。 它们通常来自计算机,并通过电子邮件地址或即时消息传递帐户传递到您的手机。

There exists several security mechanisms for automatically detecting whether an email or an SMS message is spam or not. These approaches often rely on machine learning. However, the introduction of such systems may also be subject to attacks.

存在几种用于自动检测电子邮件或SMS消息是否为垃圾邮件的安全机制。 这些方法通常依赖于机器学习。 但是,引入此类系统也可能会受到攻击。

The act of deploying attacks towards machine learning based systems is known as Adversarial Machine Learning (AML). The aim is to exploit the weaknesses of the pre-trained model which may have “blind spots” between the data points it has seen during training. More specifically, by automatically introducing slight perturbations to the unseen data points, the model may cross a decision boundary and classify the data as a different class. As a result, the model’s effectiveness can significantly be reduced.

向基于机器学习的系统部署攻击的行为称为对抗机器学习(AML)。 目的是利用预训练模型的弱点,该弱点在训练过程中看到的数据点之间可能有“盲点”。 更具体地,通过自动向看不见的数据点引入轻微的扰动,模型可以越过决策边界并将数据分类为不同的类别。 结果,该模型的有效性会大大降低。

In the context of SMS spam detection, AML can be used to manipulate textual data by including perturbations to cause spam data to be classified as being not spam, consequently bypassing the detector.

在SMS垃圾邮件检测的上下文中,AML可以通过包含扰动来操纵文本数据,从而使垃圾邮件数据被归类为非垃圾邮件,从而绕过检测器,从而可以操纵文本数据。

数据集和数据预处理 (DATASET AND DATA PRE-PROCESSING)

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS spam research. It contains a set of 5,574 English SMS text messages which are tagged according to whether they are spam (425 message) or not-spam (3,375).

SMS垃圾邮件收集是已收集用于SMS垃圾邮件研究的一组SMS标记邮件。 它包含一组5574条英文SMS文本消息,这些消息根据是垃圾邮件(425条消息)还是非垃圾邮件(3375条)进行了标记。

Let’s first cover the pre-processing techniques we need to consider before we dive into applying any kind of machine learning techniques. We’ll perform pre-processing techniques that are standard for most Natural Language Processing (NLP) problems. These include:

首先,我们将介绍在应用任何类型的机器学习技术之前需要考虑的预处理技术。 我们将执行大多数自然语言处理(NLP)问题的标准预处理技术。 这些包括:

  • Convert the text to lowercase.

    将文本转换为小写。
  • Remove punctuation.

    删除标点符号。
  • Remove additional white space.

    删除其他空格。
  • Remove numbers.

    删除数字。
  • Remove stop words such as “the”, “a”, “an”, “in”.

    删除停用词,例如“ the”,“ a”,“ an”,“ in”。
  • Lemmatisation.

    合法化。
  • Tokenisation.

    令牌化。

Python’s Natural Language Tool Kit (NLTK) can handle these pre-processing requirements. The output should now look something to the following:

Python的自然语言工具包(NLTK)可以处理这些预处理要求。 现在,输出应类似于以下内容:

词嵌入 (WORD EMBEDDINGS)

Word embedding is one of the most popular representation of text vocabulary. It is capable of capturing the context of a word in a document, its semantic and syntactic similarity to its surrounding words, and its relation with other words.

词嵌入是最流行的文本词汇表示形式之一。 它能够捕获文档中单词的上下文,与周围单词的语义和句法相似性以及与其他单词的关系。

But how are word embeddings captured in context? Word2Vec is one of the most popular technique to learn word embeddings using a two-layer Neural Network. The Neural Network takes in the corpus of text, analyses it, and for each word in the vocabulary, generates a vector of numbers that encode important information about the meaning of the word in relation to the context in which it appears.

但是如何在上下文中捕获单词嵌入呢? Word2Vec是使用两层神经网络学习单词嵌入的最流行技术之一。 神经网络接受文本的语料库,对其进行分析,然后为词汇表中的每个单词生成一个数字矢量,该矢量编码有关单词含义与单词出现上下文相关的重要信息。

There are two main models: the Continuous Bag-of-Words model and the Skip-gram model. The Word2Vec Skip-gram model is a shallow Neural Network with a single hidden layer that takes in a word as input and tries to predict the context of the words that surround it as an output.

有两个主要模型:连续词袋模型和Skip-gram模型。 Word2Vec跳过语法模型是一个浅层神经网络,具有单个隐藏层,该隐藏层将单词作为输入,并尝试预测围绕它的单词的上下文作为输出。

In this case, we will be using Gensim’s Word2Vec for creating the model. Some of the important parameters are as follows:

在这种情况下,我们将使用Gensim的Word2Vec创建模型。 一些重要参数如下:

  • size: The number of dimensions of the embeddings. The default is 100.

    size:嵌入的尺寸数。 默认值为100。
  • window: The maximum distance between a target word and the words around the target word. The default window is 5.

    窗口:目标词与目标词周围的词之间的最大距离。 默认窗口是5。
  • min_count: The minimum count of words to consider when training the model. Words with occurrence less than this count will be ignored. The default min_count is 5.

    min_count:训练模型时要考虑的最小单词数。 出现次数少于此次数的单词将被忽略。 默认的min_count为5。
  • workers: The number of partitions during training. The default workers is 3.

    工人:培训期间的分区数。 默认工作线程为3。
  • sg: The training algorithm, either Continuous Bag-of-Words (0) or Skip-gram (1). The default training algorithm is Continuous Bag-of-Words.

    sg:训练算法,连续单词袋(0)或跳过语法(1)。 默认的训练算法是“连续词袋”。

Next, we’ll see how to use the Word2Vec model to generate the vector for the documents in the dataset. Word2Vec vectors are generated for each SMS message in the training data by traversing through the dataset. By simply using the model on each word of the text messages, we retrieve the word embedding vectors for those words. We then represent a message in the dataset by calculating the average over all of the vectors of words in the text.

接下来,我们将看到如何使用Word2Vec模型为数据集中的文档生成向量。 通过遍历数据集,为训练数据中的每个SMS消息生成Word2Vec向量。 通过简单地在文本消息的每个单词上使用模型,我们检索了这些单词的单词嵌入向量。 然后,我们通过计算文本中所有单词向量的平均值来表示数据集中的一条消息。

模型训练和分类 (MODEL TRAINING AND CLASSIFICATION)

Let’s first encode our target labels spam and not_spam. This involves converting the categorical values to numerical values. We’ll then assign the features to the variable X and the target labels to the variable y. Lastly, we’ll split the pre-processed data into two datasets.

首先让我们对目标标签spamnot_spam进行编码 这涉及将分类值转换为数值。 然后,我们将要素分配给变量X ,将目标标签分配给变量y 。 最后,我们将预处理后的数据分为两个数据集。

  • Train dataset: For training the SMS text categorisation model.

    训练数据集:用于训练SMS文本分类模型。

  • Test dataset: For validating the performance of the model.

    测试数据集:用于验证模型的性能。

To split the data into 2 such datasets, we’ll use Scikit-learn’s train test split method from the model selection function. In this case, we’ll split the data into 70% training and 30% testing.

要将数据分为两个这样的数据集,我们将使用Scikit-learn的模型选择功能中的训练测试拆分方法 。 在这种情况下,我们会将数据分为70%的训练和30%的测试。

For the sake of this post, we’ll use a Decision Tree classifier. In reality, you’d want to evaluate a variety of classifiers using cross-validation to determine which is the best performing. The “no free lunch” theorem suggests that there is no universally best learning algorithm. In other words, the choice of an appropriate algorithm should be based on its performance for that particular problem and the properties of data that characterise the problem.

为了这篇文章的缘故,我们将使用Decision Tree分类器。 实际上,您想使用交叉验证来评估各种分类器,以确定哪个是性能最好的分类器。 “没有免费的午餐”定理表明,没有普遍适用的最佳学习算法。 换句话说,适当算法的选择应基于针对特定问题的性能以及表征该问题的数据的属性。

Once the model is trained, we can evaluate its performance when it tries to predict the target labels of the test set. The classification report shows that the model can predict the test samples with a high weighted-average F1-score of 0.94.

训练模型后,我们可以在尝试预测测试集的目标标签时评估其性能。 分类报告显示,该模型可以预测具有0.94的高加权平均F1分数的测试样本。

生成对抗性样本 (GENERATING ADVERSARIAL SAMPLES)

A well known use case of AML is in image classification. This involves adding noise that may not be perceptible to the human eye which also fools the classifier.

AML的一个众所周知的用例是图像分类。 这涉及增加人眼无法察觉的噪声,这也会使分类器蒙蔽。

Image for post
Adversarial machine learning in image classification图像分类中的对抗机器学习

There are various methods by which adversarial samples can be generated. Such methods vary in complexity, the speed of their generation, and their performance. An unsophisticated approach towards crafting such samples is to manually perturb the input data points. However, manual perturbations are slow to generate and evaluate by comparison with automatic approaches.

有多种方法可以生成对抗性样本。 此类方法的复杂性,生成速度和性能各不相同。 制作此类样本的简单方法是手动扰动输入数据点。 但是,与自动方法相比,手动扰动的生成和评估速度较慢。

One of the most popular technique towards automatically generating perturbed samples include the Jacobian-based Saliency Map Attack (JSMA). The methods rely on the methodology, that when adding small perturbations to the original sample, the resulting sample can exhibit adversarial characteristics in that the resulting sample is now classified differently by the targeted model.

自动生成扰动样本的最流行技术之一是基于雅可比的显着性图攻击(JSMA)。 该方法依赖于该方法,即在向原始样本添加较小扰动时,所得样本可以表现出对抗性特征,因为所得样本现在通过目标模型进行了不同分类。

The JSMA method generates perturbations using saliency maps. A saliency map identifies which features of the input data are the most relevant to the model decision being one class or another; these features, if altered, are most likely affect the classification of the target values. More specifically, an initial percentage of features (gamma) is chosen to be perturbed by a (theta) amount of noise. Then, the model establishes whether the added noise has caused the targeted model to misclassify or not. If the noise has not affected the model’s performance, another set of features is selected and a new iteration occurs until a saliency map appears which can be used to generate an adversarial sample.

JSMA方法使用显着图生成扰动。 显着性图标识输入数据的哪些特征与一个或另一个类别的模型决策最相关; 这些功能(如果更改)很可能会影响目标值的分类。 更具体地说,特征的初始百分比(γ)被选择为被θ量的噪声所干扰。 然后,模型确定添加的噪声是否导致目标模型分类错误。 如果噪声没有影响模型的性能,则选择另一组特征并进行新的迭代,直到出现显着图,该显着图可用于生成对抗性样本。

A pre-trained MLP is used as the underlying model for the generation of adversarial samples. Here, we explore how different combinations of the JSMA parameters affect the performance of the originally trained Decision Tree.

预先训练的MLP用作对抗性样本生成的基础模型。 在这里,我们探索JSMA参数的不同组合如何影响最初训练的决策树的性能。

评价 (EVALUATION)

To explore how different combinations of the JSMA parameters affect the performance of the trained Decision Tree, adversarial samples were generated from all spam data points present in the testing data by using a range of combinations of gamma and theta. The adversarial samples were then joined with the non-spam testing data points and presented to the trained model. The heat map reports the overall weighted-average F1-scores for all adversarial combinations of JSMA’s gamma and theta parameters.

为了探究JSMA参数的不同组合如何影响经过训练的决策树的性能,使用一系列伽玛和theta组合从测试数据中存在的所有垃圾邮件数据点生成了对抗样本。 然后将对抗性样本与非垃圾邮件测试数据点合并,并提供给训练有素的模型。 该热图报告了JSMA的γ和theta参数的所有对抗性组合的总体加权平均F1得分。

图片发布

The classification performance of the Decision Tree model achieved a decrease in F1-scores across all of the gamma and theta parameters. When gamma= 0.3, theta= 0.5, the model’s classification performance decreased by 18 percentage points (F1-score = 0.759). In this case, based on this dataset, gamma= 0.3, theta= 0.5 would be the optimal parameter one would use to successfully reduce the accuracy of a machine learning based SMS spam detector.

决策树模型的分类性能在所有gamma和theta参数上的F1得分均下降。 当gamma = 0.3,theta = 0.5时,模型的分类性能下降了18个百分点(F1分数= 0.759)。 在这种情况下,基于此数据集,gamma = 0.3,theta = 0.5将是用于成功降低基于机器学习的SMS垃圾邮件检测器准确性的最佳参数。

结论 (CONCLUSION)

So, what have I learnt from this analysis?

那么,我从这项分析中学到了什么?

Due to their effectiveness and flexibility, machine learning based detectors are now recognised as fundamental tools for detecting whether SMS text messages are spam or not. Nevertheless, such systems are vulnerable to attacks that may severely undermine or mislead their capabilities. Adversarial attacks may have severe consequences in such infrastructures, as SMS texts may be modified to bypass the detector.

由于它们的有效性和灵活性,基于机器学习的检测器现在被认为是检测SMS文本消息是否为垃圾邮件的基本工具。 但是,这样的系统容易受到攻击的攻击,这些攻击可能会严重破坏或误导其功能。 在这种基础架构中,对抗性攻击可能会带来严重后果,因为可以修改SMS文本以绕过检测器。

The next steps would be to explore how such samples can support the robustness of supervised models using adversarial training. This entails including adversarial samples into the training dataset, re-training the model, and evaluating its performance on all adversarial combinations of JSMA’s gamma and theta parameters.

下一步将是探索这些样本如何使用对抗训练来支持监督模型的鲁棒性。 这需要将对抗性样本包括到训练数据集中,重新训练模型,并在JSMA的γ和theta参数的所有对抗性组合上评估其性能。

For the full notebook, check out my GitHub repo below: https://github.com/LowriWilliams/SMS_Adversarial_Machine_Learning

对于完整的笔记本,请在下面查看我的GitHub存储库: https : //github.com/LowriWilliams/SMS_Adversarial_Machine_Learning

翻译自: https://towardsdatascience.com/adversarial-attacks-on-sms-spam-detectors-12b16f1e748e

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392296.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

php pdo 缓冲,PDO支持数据缓存_PHP教程

/*** 作者:初十* QQ:345610000*/class myPDO extends PDO{public $cache_Dir null; //缓存目录public $cache_expireTime 7200; //缓存时间,默认两小时//带缓存的查询public function cquery($sql){//缓存存放总目录if ($this->cache_Di…

mooc课程下载_如何使用十大商学院的免费课程制作MOOC“ MBA”

mooc课程下载by Laurie Pickard通过劳里皮卡德(Laurie Pickard) 如何使用十大商学院的免费课程制作MOOC“ MBA” (How to make a MOOC “MBA” using free courses from Top 10 business schools) Back when massive open online courses (MOOCs) were new, I started a proje…

leetcode 1584. 连接所有点的最小费用(并查集)

给你一个points 数组,表示 2D 平面上的一些点,其中 points[i] [xi, yi] 。 连接点 [xi, yi] 和点 [xj, yj] 的费用为它们之间的 曼哈顿距离 :|xi - xj| |yi - yj| ,其中 |val| 表示 val 的绝对值。 请你返回将所有点连接的最小…

Nagios学习实践系列

其实上篇Nagios学习实践系列——基本安装篇只是安装了Nagios基本组件,虽然能够打开主页,但是如果不配置相关配置文件文件,那么左边菜单很多页面都打不开,相当于只是一个空壳子。接下来,我们来学习研究一下Nagios的配置…

在Salesforce中处理Email的发送

在Salesforce中可以用自带的 Messaging 的 sendEmail 方法去处理Email的发送 请看如下一段简单代码: public boolean TextFormat {get;set;} public string EmailTo {get;set;} public string EmailCC {get;set;} public string EmailBCC {get;set;} public string …

kvm vnc的使用,鼠标漂移等

1.宿主机的vnc(virtual Network Computing)配置 安装rpm包 yum install tigervnc-server -y 为了防止干扰直接关闭防火墙和selinux /etc/init.d/iptables stop setenforce 0 配置vnc密码和启动vncserver服务 vncpasswd vncserver 2.客户机的vnc 在qemu…

php深浅拷贝,JavaScript 中的深浅拷贝

工作中经常会遇到需要复制 JavaScript 数据的时候,遇到 bug 时实在令人头疼;面试中也经常会被问到如何实现一个数据的深浅拷贝,但是你对其中的原理清晰吗?一起来看一下吧!一、为什么会有深浅拷贝想要更加透彻的理解为什…

使用Python进行地理编码和反向地理编码

Geocoding is the process of taking input text, such as an address or the name of a place, and returning a latitude/longitude location. To put it simply, Geocoding is converting physical address to latitude and longitude.地理编码是获取输入文本(例如地址或地点…

java开发简历编写_如何通过几个简单的步骤编写出色的初级开发人员简历

java开发简历编写So you’ve seen your dream junior developer role advertised, and are thinking about applying. It’s time to write that Resume! Nothing better than sitting down to a blank piece of paper and not knowing how to start, right?因此,您…

leetcode 628. 三个数的最大乘积(排序)

给定一个整型数组,在数组中找出由三个数组成的最大乘积,并输出这个乘积。 示例 1: 输入: [1,2,3] 输出: 6 解题思路 最大的乘积可能有两种情况 1.两个最小负数和一个最大正数 2.三个最大正数 代码 class Solution {public int maximumProduct(int[…

[Object-C语言随笔之三] 类的创建和实例化以及函数的添加和调用!

上一小节的随笔写了常用的打印以及很基础的数据类型的定义方式,今天就来一起学习下如何创建类与函数的一些随笔; 首先类的创建:在Xcode下,菜单File-New File,然后出现选择class模板,如下图&…

2024-AI人工智能学习-安装了pip install pydot但是还是报错

2024-AI人工智能学习-安装了pip install pydot但是还是报错 出现这样子的错误: /usr/local/bin/python3.11 /Users/wangyang/PycharmProjects/studyPython/tf_model.py 2023-12-24 22:59:02.238366: I tensorflow/core/platform/cpu_feature_guard.cc:182] This …

grafana 创建仪表盘_创建仪表盘前要问的三个问题

grafana 创建仪表盘可视化 (VISUALIZATIONS) It’s easier than ever to dive into dashboarding, but are you doing it right?深入仪表板比以往任何时候都容易,但是您这样做正确吗? Tableau, Power BI, and many other business intelligence tools …

qq群 voiceover_如何在iOS上使用VoiceOver为所有人构建应用程序

qq群 voiceoverby Jayven N由Jayven N 如何在iOS上使用VoiceOver为所有人构建应用程序 (How to build apps for everyone using VoiceOver on iOS) 辅助功能入门 (Getting started with accessibility) There’s always those topics that people don’t talk about enough. S…

IntelliJ IDEA代码常用的快捷键(自查)

IntelliJ IDEA代码常用的快捷键有: Alt回车 导入包,自动修正 CtrlN 查找类 CtrlShiftN 查找文件 CtrlAltL 格式化代码 CtrlAltO 优化导入的类和包 AltInsert 生成代码(如get,set方法,构造函数等) CtrlE或者AltShiftC 最近更改的代码 CtrlR…

leetcode 1489. 找到最小生成树里的关键边和伪关键边(并查集)

给你一个 n 个点的带权无向连通图,节点编号为 0 到 n-1 ,同时还有一个数组 edges ,其中 edges[i] [fromi, toi, weighti] 表示在 fromi 和 toi 节点之间有一条带权无向边。最小生成树 (MST) 是给定图中边的一个子集,它连接了所有…

带彩色字体的man pages(debian centos)

1234567891011121314151617181920212223242526272829303132333435363738我的博客已迁移到xdoujiang.com请去那边和我交流简介most is a paging program that displays,one windowful at a time,the contents of a file on a terminal. It pauses after each windowful and prin…

提取json对象中的数据,转化为数组

var xx1 ["乐谱中的调号为( )调", "写出a自然小调音阶。", "以G为冠音,构写增四、减五音程。", "调式分析。", "将下列乐谱移为C大调。", "正确组合以下乐谱。", "以下…

java 同步块的锁是什么,java – 同步块 – 锁定多个对象

我添加了另一个答案,因为我还没有添加评论给其他人的帖子。>事实上,同步是用于代码,而不是对象或数据。在同步块中用作参数的对象引用表示锁定。所以如果你有如下代码:class Player {// Same instance shared for all players.…

大数据对社交媒体的影响_数据如何影响媒体,广告和娱乐职业

大数据对社交媒体的影响In advance of our upcoming event — Data Science Salon: Applying AI and ML to Media, Advertising, and Entertainment, we asked our speakers, who are some of nation’s leading data scientists in the media, advertising, and entertainment…