NLPAUG
这个python库帮助您为机器学习项目增加nlp。访问此简介了解Data Augmentation in NLP。Augmenter是增广的基本元素,而Flow是将多个增广器组合在一起的管道。
起动指南
增强器TargetAugmenterActionDescriptionCharacterRandomAuginsertInsert character randomly
substituteSubstitute character randomly
swapSwap character randomly
deleteDelete character randomly
OcrAugsubstituteSimulate OCR engine error
KeyboardAugsubstituteSimulate keyboard distance error
WordRandomWordAugswapSwap word randomly
deleteDelete word randomly
SpellingAugsubstituteSubstitute word according to spelling mistake dictionary
WordNetAugsubstituteSubstitute word according to WordNet's synonym
WordEmbsAuginsertInsert word randomly from word2vec, GloVe or fasttext dictionary
substituteSubstitute word based on word2vec, GloVe or fasttext embeddings
TfIdfAuginsertInsert word randomly trained TF-IDF model
substituteSubstitute word based on TF-IDF score
BertAuginsertInsert word based by feeding surroundings word to BERT language model
substituteSubstitute word based by feeding surroundings word to BERT language model
SpectrogramFrequencyMaskingAugsubstituteSet block of values to zero according to frequency dimension
TimeMaskingAugsubstituteSet block of values to zero according to time dimension
AudioNoiseAugsubstituteInject noise
PitchAugsubstituteAdjust audio's pitch
ShiftAugsubstituteShift time dimension forward/ backward
SpeedAugsubstituteAdjust audio's speed
CropAugdeleteDelete audio's segment
LoudnessAugsubstituteAdjust audio's volume
MaskAugsubstituteMask audio's segment
流量PipelineDescriptionSequentialApply list of augmentation functions sequentially
SometimesApply some augmentation functions randomly
安装
该库在linux和windows平台上支持python 3.5+。
要安装库:pip install nlpaug
或者直接从github安装最新版本(包括beta版功能)pip install git+https://github.com/makcedward/nlpaug.git
如果您使用bertaug,请同时安装以下依赖项pip install pytorch_pretrained_bert torch
如果使用wordembsaug(word2vec、glove或fasttext),请先下载经过培训的模型from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.')# Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.')# Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.')# Download fasttext model
最近的更改
beta2019年8月16日添加新增强器(Cropaug、LoudnessAug、Maskaug)
QWERTYAUG已弃用。它将被键盘所取代
删除StopWordSaug。它将被randomWordAug替换
代码重构
为word2vec、glove和fasttext添加了模型下载功能
^{str 1}0.0.6美元2019年7月29日:
有关详细信息,请参见changelog。
测试Word2vec, GloVe, Fasttext models are used in word insertion and substitution. Those model files are necessary in order to run test case. You have to add ".env" file in root directory and the content should be
- MODEL_DIR={MODEL FILE PATH}Folder structure of model should be
-- root directory
- glove.6B.50d.txt
- GoogleNews-vectors-negative300.bin
- wiki-news-300d-1M.vec
研究参考
以上的一些增强器是受到以下研究论文的启发。但是,由于不同的原因,它并不总是遵循最初的实现。如果需要原始实现,请参考原始源代码。
数据源
用于构建增强器/测试用例的来自Internet的饱和数据。
有关详细信息,请参见data source。
欢迎加入QQ群-->: 979659372
推荐PyPI第三方库