监督学习模型的训练流程
perming是一个主要在支持CUDA加速的Windows操作系统上架构的机器学习算法,基于感知机模型来解决分布在欧式空间中线性不可分数据集的解决方案,是基于PyTorch中预定义的可调用函数,设计的一个面向大规模结构化数据集的通用监督学习器,v1.4.2之后支持检测验证损失的变化间隔并提前停止训练。
pip install perming --upgrade
pip install perming>=1.4.2
数据清洗后的特征输入
在常见的自动化机器学习管线中,一组原始结构化数据集是经过一系列函数式的数据清洗操作后,得到了固定特征维度的特征数据集,但是该特征数据集没有专用的线性不可分检测方式以及相应的线性可分空间指定,所以需要用户指定潜在的线性可分空间的大小以及一些组合的学习参数。以下是以perming.Box为例展开机器学习训练的案例:
import numpy
import pandas
df = pandas.read_csv('../data/bitcoin_heist_data.csv')
df = df.to_numpy()
labels = df[:,-1] # input
features = df[:,1:-1].astype(numpy.float64) # input
此处下载数据集
加载perming并配置超参数
import perming # v1.6.0
main = perming.Box(8, 29, (60,), batch_size=256, activation='relu', inplace_on=True, solver='sgd', learning_rate_init=0.01)
main.print_config()
MLP((mlp): Sequential((Linear0): Linear(in_features=8, out_features=60, bias=True)(Activation0): ReLU(inplace=True)(Linear1): Linear(in_features=60, out_features=29, bias=True))
)
Out[1]: OrderedDict([('torch -v', '1.7.1+cu101'),('criterion', CrossEntropyLoss()),('batch_size', 256),('solver',SGD (Parameter Group 0dampening: 0lr: 0.01momentum: 0nesterov: Falseweight_decay: 0)),('lr_scheduler', None),('device', device(type='cuda'))])
参考这里查看每个模型的参数文档
从numpy.ndarray多线程加载数据集
main.data_loader(features, labels, random_seed=0)
# 参考main.data_loader.__doc__获取更多默认参数的信息
训练阶段和加速验证
main.train_val(num_epochs=1, interval=100, early_stop=True)
# 参考`main.train_val.__doc__`获取更多默认参数的信息,例如tolerance, patience
Epoch [1/1], Step [100/3277], Training Loss: 2.5657, Validation Loss: 2.5551
Epoch [1/1], Step [200/3277], Training Loss: 1.8318, Validation Loss: 1.8269
Epoch [1/1], Step [300/3277], Training Loss: 1.2668, Validation Loss: 1.2844
Epoch [1/1], Step [400/3277], Training Loss: 0.9546, Validation Loss: 0.9302
Epoch [1/1], Step [500/3277], Training Loss: 0.7440, Validation Loss: 0.7169
Epoch [1/1], Step [600/3277], Training Loss: 0.5863, Validation Loss: 0.5889
Epoch [1/1], Step [700/3277], Training Loss: 0.5062, Validation Loss: 0.5086
Epoch [1/1], Step [800/3277], Training Loss: 0.3308, Validation Loss: 0.4563
Epoch [1/1], Step [900/3277], Training Loss: 0.3079, Validation Loss: 0.4204
Epoch [1/1], Step [1000/3277], Training Loss: 0.4298, Validation Loss: 0.3946
Epoch [1/1], Step [1100/3277], Training Loss: 0.3918, Validation Loss: 0.3758
Epoch [1/1], Step [1200/3277], Training Loss: 0.4366, Validation Loss: 0.3618
Process stop at epoch [1/1] with patience 10 within tolerance 0.001
使用内置的返回项来预测评估模型
main.test()
# main.test中的默认参数只在一维标签列的分类问题中作用
# 因为损失衡量函数广泛且众多,以torcheval中的策略为主
loss of Box on the 104960 test dataset: 0.3505959212779999.
Out[2]: OrderedDict([('problem', 'classification'),('accuracy', '95.99942835365853%'),('num_classes', 29),('column', ('label name', ('true numbers', 'total numbers'))),('labels',{'montrealAPT': [100761, 104857],'montrealComradeCircle': [100761, 104857],'montrealCryptConsole': [100761, 104857],'montrealCryptXXX': [100761, 104857],'montrealCryptoLocker': [100761, 104857],'montrealCryptoTorLocker2015': [100761, 104857],'montrealDMALocker': [100761, 104857],'montrealDMALockerv3': [100761, 104857],'montrealEDA2': [100761, 104857],'montrealFlyper': [100761, 104857],'montrealGlobe': [100761, 104857],'montrealGlobeImposter': [100761, 104857],'montrealGlobev3': [100761, 104857],'montrealJigSaw': [100761, 104857],'montrealNoobCrypt': [100761, 104857],'montrealRazy': [100761, 104857],'montrealSam': [100761, 104857],'montrealSamSam': [100761, 104857],'montrealVenusLocker': [100761, 104857],'montrealWannaCry': [100761, 104857],'montrealXLocker': [100761, 104857],'montrealXLockerv5.0': [100761, 104857],'montrealXTPLocker': [100761, 104857],'paduaCryptoWall': [100761, 104857],'paduaJigsaw': [100761, 104857],'paduaKeRanger': [100761, 104857],'princetonCerber': [100761, 104857],'princetonLocky': [100761, 104857],'white': [100761, 104857]}),('loss',{'train': 0.330683171749115,'val': 0.3547004163265228,'test': 0.3505959212779999}),('sorted',[('montrealAPT', [100761, 104857]),('montrealComradeCircle', [100761, 104857]),('montrealCryptConsole', [100761, 104857]),('montrealCryptXXX', [100761, 104857]),('montrealCryptoLocker', [100761, 104857]),('montrealCryptoTorLocker2015', [100761, 104857]),('montrealDMALocker', [100761, 104857]),('montrealDMALockerv3', [100761, 104857]),('montrealEDA2', [100761, 104857]),('montrealFlyper', [100761, 104857]),('montrealGlobe', [100761, 104857]),('montrealGlobeImposter', [100761, 104857]),('montrealGlobev3', [100761, 104857]),('montrealJigSaw', [100761, 104857]),('montrealNoobCrypt', [100761, 104857]),('montrealRazy', [100761, 104857]),('montrealSam', [100761, 104857]),('montrealSamSam', [100761, 104857]),('montrealVenusLocker', [100761, 104857]),('montrealWannaCry', [100761, 104857]),('montrealXLocker', [100761, 104857]),('montrealXLockerv5.0', [100761, 104857]),('montrealXTPLocker', [100761, 104857]),('paduaCryptoWall', [100761, 104857]),('paduaJigsaw', [100761, 104857]),('paduaKeRanger', [100761, 104857]),('princetonCerber', [100761, 104857]),('princetonLocky', [100761, 104857]),('white', [100761, 104857])])])
保存模型参数到本地
main.save(con=False, dir='../models/bitcoin.ckpt')
# 使用main.unique和main.indices来建立标签的双向转换
加载模型参数到预训练算法
main.load(con=False, dir='../models/bitcoin.ckpt')
加载模型后可以通过更改组合训练参数,例如优化器等来微调模型的训练。模型训练文件见Multi-classification Task.ipynb
其他常用的模型初始化设置
main = perming.Box(10, 3, (30,), batch_size=8, activation='relu', inplace_on=True, solver='sgd', criterion="MultiLabelSoftMarginLoss", learning_rate_init=0.01)
# 用于解决多标签排序问题,在用户定义标签的双向转换之后,data_loader能检测划分数据集并封装
使用如下访问该软件的测试和算法:
git clone https://github.com/linjing-lab/easy-pytorch.git
cd easy-pytorch/released_box