数据不均衡
数据不平衡是一个非常经典的问题,数据挖掘、计算广告、NLP等工作经常遇到。该文总结了可能有效的方法,值得参考:
1.Do nothing. Sometimes you get lucky and nothing needs to be done. You can train on the so-called natural (or stratified) distribution and sometimes it works without need for modification.2. Balance the training set in some way:2.1 Oversample the minority class.2.2 Undersample the majority class.2.3 Synthesize new minority classes.3. Throw away minority examples and switch to an anomaly detection framework.4. At the algorithm level, or after it:4.1 Adjust the class weight (misclassification costs).4.2 Adjust the decision threshold.4.3 Modify an existing algorithm to be more sensitive to rare classes.5. Construct an entirely new algorithm to perform well on imbalanced data.
参考文献
https://svds.com/learning-imbalanced-classes/
Learning from Imbalanced Classes