运营商ip映射
Being able to accurately predict which carriers use which IP addresses is important for Wandera’s data cost management solution. Customers with dual-SIM/eSIM devices in their fleet need to be aware at which point in time a device is using which SIM, such that Wandera can notify the user if they’re using the wrong SIM in the wrong situations, or take action in any other way.
能够准确预测哪些运营商使用哪个IP地址对于Wandera的数据成本管理解决方案很重要。 拥有双SIM / eSIM设备的客户需要知道设备何时使用哪个SIM,以便Wandera可以通知用户在错误情况下使用错误的SIM或采取措施。以任何其他方式。
We were therefore interested in investigating if we could predict which carriers use which IP addresses using only our Wandera dataset. This dataset contains millions of IP addresses but for comparatively very few we have been able to determine the associated carrier. IP addresses are single dimensional and as such difficult to assign a carrier without additional information. Therefore, we propose a novel solution to predict carriers from only IP addresses which utilises the Hilbert curve, cluster analysis and classification algorithms.
因此,我们有兴趣调查是否可以仅使用Wandera数据集来预测哪些运营商使用哪个IP地址。 该数据集包含数百万个IP地址,但相对而言,我们能够确定关联的运营商的数量很少。 IP地址是一维的,因此很难在没有其他信息的情况下分配运营商。 因此,我们提出了一种利用希尔伯特曲线 ,聚类分析和分类算法来仅从IP地址预测载波的新颖解决方案。
The Hilbert curve is a continuous fractal space-filling curve which has a locality property; this means that if two points are ‘close’ in one-dimension then they will also be ‘close’ in two-dimensions. Since carriers often purchase IP addresses in ranges we may be able to use the Hilbert curve to create clusters of addresses from the same carrier.
希尔伯特曲线是具有局部性的连续分形 填充曲线 。 这意味着如果两个点在一维上“接近”,那么它们在二维上也将“接近”。 由于运营商通常会购买范围内的IP地址,因此我们可以使用希尔伯特曲线从同一运营商创建地址集群。
We can therefore use cluster analysis to determine which IP addresses belong to a single carrier; these can be labelled by the carrier labels we do have in our dataset. Labelling clusters will consequently improve the coverage of labelled IP addresses in our data.
因此,我们可以使用聚类分析来确定哪些IP地址属于单个运营商。 这些可以由我们在数据集中确实具有的载体标签来标记。 标签集群将因此改善我们数据中标签IP地址的覆盖范围。
For any IP addresses which are not in our dataset we can use the clustered data to train a classifier to predict which carrier is associated with a given unlabelled IP address.
对于不在我们的数据集中的任何IP地址,我们可以使用聚类数据来训练分类器,以预测哪个运营商与给定的未标记IP地址相关联。
希尔伯特空间的IP地址 (IP address to Hilbert space)
Mapping IP addresses into a Hilbert space is a commonly used application, see for example visualising IP geolocation. This is achieved by first converting the IP address into a numeric form; by representing each octet as a binary string of base 2 bits, shifting based on the position of the octet in the IP address and summing we determine the integer representation. This calculation can be generalised by noticing that each shift of eight bits is equivalent to multiplying by 256, hence we can simply determine the integer for a generalised IP address: a.b.c.d by the formula: a256³+ b256²+ c256 + d.
将IP地址映射到Hilbert空间是一种常用的应用程序,例如参见可视化IP地理位置 。 这是通过首先将IP地址转换为数字形式来实现的; 通过将每个八位位组表示为基数为2的二进制字符串,根据八位位组在IP地址中的位置进行移位并求和,我们确定了整数表示形式。 这个计算可以通过注意到的每个8位移位等效于由256乘以一概而论,因此,我们可以简单地确定用于广义IP地址的整数: 一 256³+ B256²+ C 256 + d:由式ABCD。
From the integer form of an IP address we can determine the Hilbert space coordinates by applying a mapping algorithm. This algorithm works by viewing the entire space as composed of 4 regions, arranged 2 by 2. Each region is then sub- divided into 4 smaller regions, with each subsequent region subdivided again. This continues until we reach 2¹⁶ since an order 16 Hilbert curve is sufficient to display all 2³² IPv4 addresses. At each iteration the curve is rotated through each region to create a curve which fills the space:
从IP地址的整数形式,我们可以通过应用映射算法来确定希尔伯特空间坐标。 该算法通过将整个空间视为由2个2排列的4个区域组成来工作。然后,将每个区域细分为4个较小的区域,并随后将每个随后的区域再次细分。 这一直持续到我们达到2¹⁶为止,因为16阶希尔伯特曲线足以显示所有2³²IPv4地址。 在每次迭代时,曲线将旋转通过每个区域以创建一条填充空间的曲线:
By applying this transformation to the IP addresses in our dataset we begin to see clusters forming, where IP addresses in the same range are being grouped together:
通过将这种转换应用于我们数据集中的IP地址,我们开始看到集群形成,其中相同范围内的IP地址被分组在一起:
聚类 (Clustering)
Clustering is a Machine Learning (ML) technique that involves the grouping of data points; with the IP addresses in a 2D Hilbert space we hypothesise that each distinct group of IP’s can be labelled with a single carrier. Around 30% of IP addresses in our dataset have carrier labels, therefore if we can determine clusters of IP addresses in the Hilbert space we may be able to use this information to label the remaining IP addresses.
聚类是一种涉及数据点分组的机器学习(ML)技术。 利用2D Hilbert空间中的IP地址,我们假设IP的每个不同组都可以用单个载波标记。 我们的数据集中大约30%的IP地址带有运营商标签,因此,如果我们可以确定希尔伯特空间中IP地址的群集,我们也许可以使用此信息来标记剩余的IP地址。
算法 (Algorithm)
In the Hilbert space we have an unknown number of clusters, which means that we cannot use clustering algorithms such as k-Means or Mean-Shift. Also our dataset has points seemingly grouped but with very different densities, meaning that density based clustering such as density-based spatial clustering of applications with noise (DBSCAN) would not be appropriate.
在希尔伯特空间中,我们有未知数量的聚类,这意味着我们无法使用聚类算法,例如k-Means或Mean-Shift。 同样,我们的数据集具有看似已分组但具有不同密度的点,这意味着基于密度的聚类(例如基于噪声的应用程序的基于密度的空间聚类(DBSCAN))将不合适。
The algorithm which we chose for clustering our data in the Hilbert space was dynamic method density-based spatial clustering of applications with noise (DMDBSCAN). This clustering method uses a k-Nearest Neighbour (kNN) search to determine suitable parameters for the different density levels found in the dataset and then applies the DBSCAN algorithm for each determined density level.
我们选择的用于在希尔伯特空间中对数据进行聚类的算法是基于动态方法的基于噪声的应用程序基于密度的空间聚类 (DMDBSCAN)。 该聚类方法使用k最近邻(kNN)搜索来确定适用于数据集中不同密度级别的参数,然后将DBSCAN算法应用于每个确定的密度级别。
Optimal values for the densities, ε, are found using the kNN search where k is determined from the minimum number of points that a cluster can contain. The distances from kNN can then be plotted in ascending order, where sudden changes in gradient correspond to changes in density and therefore suitable values of ε.
使用kNN搜索可找到密度的最佳值ε,其中k是从群集可以包含的最小点数确定的。 与kNN的距离然后可以按升序绘制,其中梯度的突然变化对应于密度的变化,因此对应于ε的合适值。
Applying the DBSCAN algorithm for each density level gives us clusters of IP addresses, which we label using the most common (mode) carrier label from the labelled IP addresses in each cluster.
在每个密度级别上应用DBSCAN算法可以为我们提供IP地址群集,我们可以使用每个群集中标记的IP地址中最常见的(模式)载波标签对其进行标记。
聚类效应 (Clustering Effect)
Cross validation is used to optimise the hyper-parameters for the DMDBSCAN algorithm; we want to ensure that as many IP addresses as possible are clustered but we don’t want to generate false classifications.
交叉验证用于优化DMDBSCAN算法的超参数。 我们希望确保将尽可能多的IP地址聚集在一起,但我们不想生成错误的分类。
The cluster analysis increases the coverage of labelled points from around 30% to over 85%, with high balanced accuracy. The remaining unlabelled points are deemed to be ‘noise’ and do not have a carrier assigned.
聚类分析以较高的平衡精度将标记点的覆盖率从大约30%增加到超过85%。 其余未标记的点被认为是“噪声”,并且没有分配载波。
This label coverage increase can be seen in the plots below; in the first plot the data is coloured by the carrier labels we have in our original dataset, where the black points represent unlabelled IPs, and in the second plot the IPs are labelled after applying the clustering algorithm. Far more of the points now have a carrier label and it can be observed that the clustering has corrected some mis-labelled IP addresses. However, there are still some unlabelled points, where the DMSBSCAN algorithm cannot confidently assign a carrier label.
下图显示了标签覆盖率的增加; 在第一个图中,数据由原始数据集中的运营商标签着色,其中黑点表示未标记的IP,在第二个图中,在应用聚类算法后对IP进行了标记。 现在,更多的点都带有运营商标签,可以看出群集已纠正了某些标签错误的IP地址。 但是,仍然存在一些未标记的点,DMSBSCAN算法无法可靠地分配载波标签。
The DMDBSCAN algorithm is designed to differentiate between clusters of differing densities, this can be seen below:
DMDBSCAN算法旨在区分不同密度的群集,如下所示:
These points in the Hilbert space have very different densities, but are close to each other. Using different density values for the DBSCAN clustering algorithm we successfully create distinct clusters which are assigned different carrier labels.
希尔伯特空间中的这些点具有非常不同的密度,但彼此接近。 为DBSCAN聚类算法使用不同的密度值,我们成功创建了不同的簇,并为其分配了不同的载波标签。
分类 (Classification)
From the clustered dataset we can train a predictive ML model to try and predict the unlabelled and any IP addresses not in our dataset. For the ML model the training dataset consists of all the IP addresses for which we were able to determine carrier labels using cluster analysis.
从集群数据集中,我们可以训练一个预测的ML模型,以尝试预测数据集中未标记的IP地址和任何IP地址。 对于ML模型,训练数据集包含所有IP地址,我们能够使用聚类分析为其确定运营商标签。
The features which we shall use for this classification are:
我们将用于此分类的功能是:
- Hilbert space x-coordinate 希尔伯特空间x坐标
- Hilbert space y-coordinate. 希尔伯特空间y坐标。
These coordinates define the position of an IP address in the Hilbert space and it is the position in this space which we hypothesise is how we can label.
这些坐标定义了IP地址在希尔伯特(Hilbert)空间中的位置,而这正是我们在此空间中标记的方式。
The classification algorithm chosen for predicting the carrier of an IP address was a Random Forest classifier. Investigations into the various different methods: support vector classifiers, decision trees (with and without boosting), logistic regression, kNN, showed that training a Random Forest classifier was the optimal approach.
为预测IP地址的载体而选择的分类算法是随机森林分类器 。 对各种不同方法的研究:支持向量分类器,决策树(有或没有增强),逻辑回归,kNN,表明训练随机森林分类器是最佳方法。
Using k-fold cross validation we are able to determine an optimal set of hyper-parameters for the Random Forest classifier. Since our dataset has varying proportions of IP addresses from different carriers; it is therefore useful to calibrate our classifier. This rescales the predicted probabilities to improve the probability-based predictions for each carrier in our model.
使用k倍交叉验证,我们能够为随机森林分类器确定最佳的超参数集。 由于我们的数据集具有来自不同运营商的不同IP地址比例; 因此,校准我们的分类器很有用。 这将重新缩放预测的概率,以改善模型中每个载波的基于概率的预测。
This is a multi-class problem hence we have confidence in a prediction if the probability, p, if: p >1/Number of carrier labels.
这是一个多类问题,因此如果概率为p ,则我们对预测有信心: p > 1 / 载体标签数。
In the Hilbert space we can visualise the classifiers probabilities for a given carrier using a heat-map. Below is an example of the probability heat-map for a single carrier; included are the labelled points for that carrier. This demonstrates how the Random Forest algorithm predicts in the regions about the training data and how the density of points in a region leads to greater confidence.
在希尔伯特空间中,我们可以使用热图可视化给定载波的分类器概率。 以下是单个载波的概率热图示例; 其中包括该运营商的标记点。 这证明了随机森林算法如何在有关训练数据的区域中进行预测,以及区域中点的密度如何导致更大的置信度。
结果 (Results)
Applying this process to a one day of unique IP addresses from the Wandera dataset:
将这一过程应用于来自Wandera数据集的唯一IP地址的一天:
We find that we can quite accurately predict carriers using only IP addresses.
我们发现,仅使用IP地址就可以非常准确地预测运营商。
Being able to predict which carriers use which IP addresses means that we have increased the coverage of IP addresses with carriers from 30% to over 97%, and this figure is increasing all the time. This means that Wandera can much more easily identify if a customer’s device is using the wrong SIM in the wrong situation and notify that device accordingly and therefore help save our customers money.
能够预测哪些运营商使用哪个IP地址意味着我们已经将运营商对IP地址的覆盖范围从30%增加到了97%以上,并且这个数字一直在增加。 这意味着Wandera可以更轻松地识别客户的设备在错误的情况下是否使用了错误的SIM,并相应地通知该设备,从而帮助我们节省了客户的钱。
The classification process defined above is regularly retrained and optimised such that we ensure the predictions continue to be accurate and the models respond to re-provisioning of IP addresses. Along with this we monitor the predictions daily, such that we can identify false classifications and remediate effectively. This helps to ensure that we provide the most accurate carrier information for our customers.
上面定义的分类过程会定期进行重新培训和优化,以便我们确保预测继续准确,并且模型响应IP地址的重新配置。 与此同时,我们每天监控这些预测,以便我们可以识别错误的分类并有效地进行补救。 这有助于确保我们为客户提供最准确的承运人信息。
进一步阅读 (Further Reading)
Hilbert Curve
希尔伯特曲线
Visualising IP geolocation using the Hilbert curve
使用希尔伯特曲线可视化IP地理位置
A dynamic method for discovering density varied clusters
一种发现密度变化簇的动态方法
Random Forest
随机森林
翻译自: https://medium.com/wandera-engineering/how-we-mapped-the-internet-to-discover-carriers-ba9a9ad586e5
运营商ip映射
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/242115.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!