Feature Preprocessing on Kaggle

刚入手data science, 想着自己玩一玩kaggle,玩了新手Titanic和House Price的 项目, 觉得基本的baseline还是可以写出来,但是具体到一些细节,以至于到能拿到的出手的成绩还是需要理论分析的。

本文旨在介绍kaggle比赛到各种原理与技巧,当然一切源自于coursera,由于课程都是英文的,且都比较好理解,这里直接使用英文

  • Reference
    How to Win a Data Science Competition: Learn from Top Kagglers

Features: numeric, categorical, ordinal, datetime, coordinate, text

Numeric features

All models are divided into tree-based model and non-tree-based model.

 

Scaling

For example: if we apply KNN algorithm to the instances below, as we see in the second row, we caculate the distance between the instance and the object. It is obvious that dimension of large scale dominates the distance.

 

Tree-based models doesn’t depend on scaling

Non-tree-based models hugely depend on scaling

How to do

sklearn:

  1. To [0,1]
    sklearn.preprocessing.MinMaxScaler
    X = ( X-X.min( ) )/( X.max()-X.min() )
  2. To mean=0, std=1
    sklearn.preprocessing.StandardScaler
    X = ( X-X.mean( ) )/X.std()

    • if you want to use KNN, we can go one step ahead and recall that the bigger feature is, the more important it will be for KNN. So, we can optimize scaling parameter to boost features which seems to be more important for us and see if this helps

Outliers

The outliers make the model diviate like the red line.

这里写图片描述

We can clip features values between teo chosen values of lower bound and upper bound

  • Rank Transformation

If we have outliers, it behaves better than scaling. It will move the outliers closer to other objects

Linear model, KNN, Neural Network will benefit from this mothod.

rank([-100, 0, 1e5]) == [0,1,2]  
rank([1000,1,10]) = [2,0,1]

scipy:

scipy.stats.rankdata

  • Other method

    1. Log transform: np.log(1 + x)
    2. Raising to the power < 1: np.sqrt(x + 2/3)

Feature Generation

Depends on

a. Prior knowledge
b. Exploratory data analysis


Ordinal features

Examples:

  • Ticket class: 1,2,3
  • Driver’s license: A, B, C, D
  • Education: kindergarden, school, undergraduate, bachelor, master, doctoral

Processing

1.Label Encoding
* Alphabetical (sorted)
[S,C,Q] -> [2, 1, 3]

sklearn.preprocessing.LabelEncoder

  • Order of appearance
    [S,C,Q] -> [1, 2, 3]

Pandas.factorize

This method works fine with two ways because tree-methods can split feature, and extract most of the useful values in categories on its own. Non-tree-based-models, on the other side,usually can’t use this feature effectively.

2.Frequency Encoding
[S,C,Q] -> [0.5, 0.3, 0.2]

encoding = titanic.groupby(‘Embarked’).size()  
encoding = encoding/len(titanic)  
titanic[‘enc’] = titanic.Embarked.map(encoding)

from scipy.stats import rankdata

For linear model, it is also helpful.
if frequency of category is correlated with target value, linear model will utilize this dependency.

3.One-hot Encoding

pandas.get_dummies

It give all the categories of one feature a new columns and often used for non-tree-based model.
It will slow down tree-based model, so we introduce sparse matric. Most of libaraies can work with these sparse matrices directly. Namely, xgboost, lightGBM

Feature generation

Interactions of categorical features can help linear models and KNN

By concatenating string

这里写图片描述


Datetime and Coordinates

Date and time

1.Periodicity
2.Time since

a. Row-independent moment  
For example: since 00:00:00 UTC, 1 January 1970;b. Row-dependent important moment  
Number of days left until next holidays/ time passed after last holiday.

3.Difference betwenn dates

We can add date_diff feature which indicates number of days between these events

Coordicates

1.Interesting places from train/test data or additional data

Generate distance between the instance to a flat or an old building(Everything that is meanful)

2.Aggergates statistics

The price of surrounding building

3.Rotation

Sometime it makes the model more precisely to classify the instances.

这里写图片描述


Missing data

Hidden Nan, numeric

When drawing a histgram, we see the following picture:

这里写图片描述

It is obivous that -1 is a hidden Nan which is no meaning for this feature.

Fillna approaches

1.-999,-1,etc(outside the feature range)

It is useful in a way that it gives three possibility to take missing value into separate category. The downside of this is that performance of linear networks can suffer.

2.mean,median

Second method usually beneficial for simple linear models and neural networks. But again for trees it can be harder to select object which had missing values in the first place.

3.Reconstruct:

  • Isnull

  • Prediction

这里写图片描述
* Replace the missing data with the mean of medain grouped by another feature.
But sometimes it can be screwed up, like:

这里写图片描述

The way to handle this is to ignore missing values while calculating means for each category.

  • Treating values which do not present in trian data

Just generate new feature indicating number of occurrence in the data(freqency)

这里写图片描述

  • Xgboost can handle Nan

4.Remove rows with missing values

This one is possible, but it can lead to loss of important samples and a quality decrease.


Text

Bag of words

Text preprocessing

1.Lowercase

2.Lemmatization and Stemming
这里写图片描述

3.Stopwords

Examples:
1.Articles(冠词) or prepositions
2.Very common words

sklearn.feature_extraction.text.CountVectorizer:
max_df

  • max_df : float in range [0.0, 1.0] or int, default=1.0
    When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

CountVectorizer

The number of times a term occurs in a given document

sklearn.feature_extraction.text.CountVectorizer

TFiDF

In order to re-weight the count features into floating point values suitable for usage by a classifier

  • Term frequency
    tf = 1 / x.sum(axis=1) [:,None]
    x = x * tf

  • Inverse Document Frequency
    idf = np.log(x.shape[0] / (x > 0).sum(0))
    x = x * idf

N-gram

这里写图片描述

sklearn.feature_extraction.text.CountVectorizer:
Ngram_range, analyzer

  • ngram_range : tuple (min_n, max_n)
    The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

Embeddings(~word2vec)

It converts each word to some vector in some sophisticated space, which usually have several hundred dimensions

a. Relatively small vectors

b. Values in vector can be interpreted only in some cases

c. The words with similar meaning often have similar
embeddings

Example:

这里写图片描述

 

转载于:https://www.cnblogs.com/bjwu/p/8970821.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/251863.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

如果您遇到文件或数据库问题,如何重置Joomla

2019独角兽企业重金招聘Python工程师标准>>> 如果您遇到Joomla站点的问题&#xff0c;那么重新安装其核心文件和数据库可能是最佳解决方案。 了解问题 这种方法无法解决您的所有问题。但它主要适用于由Joomla核心引起的问题。 运行Joomla核心更新后&#xff0c;这些…

Genymotion模拟器拖入文件报An error occured while deploying the file的错误

今天需要用到资源文件&#xff0c;需要将资源文件拖拽到sd卡中&#xff0c;但老是出现这个问题&#xff1a; 资源文件拖不进去genymotion。查看了sd的DownLoad目录&#xff0c;确实没有成功拖拽进去。 遇到这种问题的&#xff0c;我按下面的思路排查问题&#xff1a; Genymotio…

激光炸弹(BZOJ1218)

激光炸弹&#xff08;BZOJ1218&#xff09; 一种新型的激光炸弹&#xff0c;可以摧毁一个边长为R的正方形内的所有的目标。现在地图上有n(N<10000)个目标&#xff0c;用整数Xi,Yi(其值在[0,5000])表示目标在地图上的位置&#xff0c;每个目标都有一个价值。激光炸弹的投放是…

用servlet设计OA管理系统时遇到问题

如果不加单引号会使得除变量和int类型的值不能传递 转发和重定向的区别 转发需要填写完整路径&#xff0c;重定向只需要写相对路径。原因是重定向是一次请求之内已经定位到了服务器端&#xff0c;转发则需要两次请求每次都需要完整的路径。 Request和response在解决中文乱码时的…

[Usaco2010 Mar]gather 奶牛大集会

1827: [Usaco2010 Mar]gather 奶牛大集会 Time Limit: 1 Sec Memory Limit: 64 MB Submit: 1129 Solved: 525 [Submit][Status][Discuss]Description Bessie正在计划一年一度的奶牛大集会&#xff0c;来自全国各地的奶牛将来参加这一次集会。当然&#xff0c;她会选择最方便的…

1-1、作用域深入和面向对象

课时1&#xff1a;预解释 JS中的数据类型 number、string、 boolean、null、undefined JS中引用数据类型 object: {}、[]、/^$/、Date Function var num12; var obj{name:白鸟齐鸣,age:10}; function fn(){ console.log(勿忘初心方得始终&#xff01;) }console.log(fn);//把整…

JWT协议学习笔记

2019独角兽企业重金招聘Python工程师标准>>> 官方 https://jwt.io 英文原版 https://www.ietf.org/rfc/rfc7519.txt 或 https://tools.ietf.org/html/rfc7519 中文翻译 https://www.jianshu.com/p/10f5161dd9df 1. 概述 JSON Web Token&#xff08;JWT&#xff09;是…

验证Oracle收集统计信息参数granularity数据分析的力度

最近在学习Oracle的统计信息这一块&#xff0c;收集统计信息的方法如下&#xff1a; DBMS_STATS.GATHER_TABLE_STATS (ownname VARCHAR2, ---所有者名字tabname VARCHAR2, ---表名partname VARCHAR2 DEFAULT NULL, ---要分析的分区名estimate_percent NUMBER DEFAULT NULL, …

Python之NumPy(axis=0 与axis=1)区分

Python之NumPy&#xff08;axis0 与axis1&#xff09;区分 转载于:https://www.cnblogs.com/greatljg/p/10802392.html

20165320 第九周学习总结

主要内容&#xff1a; 1.URL类 URL类是java.net包中的一个重要的类&#xff0c;使用URL创建对象的应用程序称为客户端程序。URL 的构造方法&#xff1a;try { URL url new URL ("http://www.google.com"); } catch (MalformedURLException e) {System.out.println(&…

Python 函数的执行流程-函数递归-匿名函数-生成器

1 函数的执行流程函数的执行需要对函数进行压栈的&#xff0c;什么是压栈呢&#xff0c;简而言之就是在函数执行时在栈中创建栈帧存放需要变量以及指针的意思。具体涉及的知识非常多&#xff0c;这里就已一个Python脚本简单进行分析。当我们运行上面代码时&#xff0c;它的执行…

【2】信息的表示和处理

1.现代计算机存储和处理的信息都以二值信号表示。 2.机器为什么要使用二进制进行存储和处理&#xff1f; 答&#xff1a;二值信号能够很容易的被表示、存储、传输。例如&#xff1a; 可以表示为穿孔卡片上有洞和无洞、导线上的高压和低压&#xff0c;顺逆时针的磁场。 3.大多数…

java版b2b2c社交电商spring cloud分布式微服务(二) 服务消费者(rest+ribbon)

一、ribbon简介 Ribbon is a client side load balancer which gives you a lot of control over the behaviour of HTTP and TCP clients. Feign already uses Ribbon, so if you are using FeignClient then this section also applies. —–摘自官网 ribbon是一个负载均衡客…

[学习笔记]支配树

被支配树支配的恐惧 定义 显然&#xff0c;这个支配关系是一个树&#xff08;或者如果有的点不能从r到达&#xff0c;就是一个树一堆点&#xff09;。 首先不会成环&#xff0c;其次也不会是DAG 即如果A支配C&#xff0c;B支配C&#xff0c;那么A和B之间必然有支配关系 解法 首…

RBAC 权限设计(转载)

来源 &#xff1a;https://blog.csdn.net/rocher88/article/details/43190743 这是我在网上找的一些设计比较好的RBAC权限管理不知道&#xff0c;像新浪、搜狐、网易、百度、阿里巴巴、淘宝网的RBAC用户权限这一块&#xff0c;都是这种细颗粒的RBAC设计开发&#xff0c;还是把他…

20172311 2017-2018-2 《程序设计与数据结构》第八周学习总结

20172311 2017-2018-2 《程序设计与数据结构》第八周学习总结 教材学习内容总结 本周对JAVA中的多态性进行了学习 多态性引用能够随时间变化指向不同类型的对象&#xff0c;是通过后绑定实现的。实现多态性的主要途径有两种&#xff1a; 1.由继承实现多态性 2.利用接口实现多态…

Linux系统安装Apache 2.4.6

http://www.cnblogs.com/kerrycode/p/3261101.html Apache简介 Apache HTTP Server&#xff08;简称Apache&#xff09;是Apache软件基金会的一个开放源码的网页服务器&#xff0c;可以在大多数计算机操作系统中运行&#xff0c;由于其多平台和安全性被广泛使用&#xff0c;是最…

lnmp化境开启pathinfo,支持tp5.0等访问

一、 开启pathinfo   #注释 下面这一行 #include enable-php.conf #载入新的配置文件 include enable-php-pathinfo.conf #添加如下location / {if (!-e $request_filename){rewrite ^/(.*)$ /index.php/$1 last;break;}}location ~ /index.php {fastcgi_pass 127.0.0.1:…

openfire(一):使用idea编译openfire4.2.3源码

最近公司项目要使用openfire&#xff0c;并对源码做一些修改&#xff0c;使用的openfire版本为官网目前最新版本4.2.3&#xff0c;网上资料较少&#xff0c;踩了很多坑&#xff0c;特此记录。 1.下载源码 http://www.igniterealtime.org/downloads/source.jsp 2.使用idea导入源…

[BZOJ] 1688: [Usaco2005 Open]Disease Manangement 疾病管理

1688: [Usaco2005 Open]Disease Manangement 疾病管理 Time Limit: 5 Sec Memory Limit: 64 MBSubmit: 727 Solved: 468[Submit][Status][Discuss]Description Alas! A set of D (1 < D < 15) diseases (numbered 1..D) is running through the farm. Farmer John woul…