特征工程之特征选择_特征工程与特征选择

特征工程之特征选择

📈Python金融系列 (📈Python for finance series)

Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.

警告这里没有神奇的配方或圣杯,尽管新世界可能为您打开了大门。

📈Python金融系列 (📈Python for finance series)

  1. Identifying Outliers

    识别异常值

  2. Identifying Outliers — Part Two

    识别异常值-第二部分

  3. Identifying Outliers — Part Three

    识别异常值-第三部分

  4. Stylized Facts

    程式化的事实

  5. Feature Engineering & Feature Selection

    特征工程与特征选择

  6. Data Transformation

    数据转换

Following up the previous posts in these series, this time we are going to explore a real Technical Analysis (TA) in the financial market. For a very long time, I have been fascinated by the inner logic of TA called Volume Spread Analysis (VSA). I have found no articles on applying modern Machine learning on this time proving long-lasting technique. Here I am trying to throw out a minnow to catch a whale. If I could make some noise in this field, it was worth the time I spent on this article.

遵循这些系列中的先前文章,这次我们将探索金融市场中的实际技术分析(TA)。 很长时间以来,我一直着迷于TA的内部逻辑,即体积扩展分析(VSA)。 到目前为止,我还没有发现有关应用现代机器学习的文章证明了其持久的技术。 在这里,我试图扔一条小鱼来捉鲸鱼。 如果我能在这个领域引起一些轰动,那是值得我在本文上花费的时间。

Especially, after I read David H. Weis’s Trades About to Happen, in his book he described:

特别是,在我阅读David H. Weis的著作《 即将发生的交易》之后 ,他描述了:

“Instead of analyzing an array of indicators or algorithms, you should be able to listen to what any market says about itself.”¹

“您不必分析各种指标或算法,而应该听取任何市场对其自身的评价。”¹

To closely listen to the market, as also well said from this quote below, just as it may not be possible to predict the future, it is also hard to neglect things about to happen. The key is to capture what is about to happen and follow the flow.

正如下面的引文所言,密切听取市场意见,正如可能无法预测未来一样,也很难忽略即将发生的事情。 关键是捕获将要发生的事情并遵循流程。

Image for post

But how to perceive things about to happen, a statement made long ago by Richard Wyckoff gives some clues:

但是,如何感知即将发生的事情, Richard Wyckoff很久以前发表的声明给出了一些线索:

“Successful tape reading [chart reading] is a study of Force. It requires ability to judge which side has the greatest pulling power and one must have the courage to go with that side. There are critical points which occur in each swing just as in the life of a business or of an individual. At these junctures it seems as though a feather’s weight on either side would determine the immediate trend. Any one who can spot these points has much to win and little to lose.”²

“成功的磁带阅读[图表阅读]是对Force的研究。 它需要能够判断哪一方具有最大的拉动力,而一方必须有勇气与那一方并驾齐驱。 就像企业或个人的生活一样,每一个环节都有一些关键点。 在这些关头,似乎两侧的羽毛重量将决定当前趋势。 任何能够发现这些点的人都将赢得很多,而损失却很少。”²

But how to interpret the market behaviours? One of the eloquent description of market forces by Richard Wyckoff is very instructive:

但是,如何解释市场行为呢? 理查德·怀科夫 ( Richard Wyckoff )对市场力量的雄辩性描述之一很有启发性:

“The market is like a slowly revolving wheel: Whether the wheel will continue to revolve in the same direction, stand still or reverse depends entirely upon the forces which come in contact with it hub and tread. even when the contact is broken, and nothing remains to affect its course, the wheel retains a certain impulse from the most recent dominating force, and revolves until it comes to a standstill or is subjected to other influences.”²

“市场就像一个缓慢旋转的轮子:轮子将继续沿相同方向旋转,静止还是反向旋转,完全取决于与轮毂和胎面相接触的力。 即使接触断开,也没有影响其行程的方向盘,车轮仍会保留来自最新支配力的一定冲力,并旋转直到其停止或受到其他影响。”²

David H. Weis gives a marvellous example of how to interpret the bars and relate them to the market behaviours. Through his construction of a hypothetical bar behaviour, every single bar becomes alive and rushes to tell you their stories.

戴维·H·韦斯(David H. Weis)提供了一个出色的例子,说明了如何解读酒吧并将其与市场行为联系起来。 通过构造一个假想的酒吧行为,每个酒吧都活着并争先恐后地告诉您他们的故事。

Image for post
Hypothetical Behaviour
假设行为

For all the details of the analysis, please refer to David’s book.

有关分析的所有详细信息,请参阅David的书。

Image for post
Purchased this book before it officially released and got David’s signature.
在正式发行前购买了这本书并获得了大卫的签名。

Before we dive deep into the code, it would be better to give a bit more background on Volume Spread Analysis (VSA). VSA is the study of the relationship between volume and price to predict market direction by following the professional traders, so-called market maker. All the interpretations of market behaviours follow 3 basic laws:

在深入研究代码之前,最好对体积扩展分析(VSA)有所了解。 VSA是对数量和价格之间关系的研究,旨在通过跟随专业交易者(所谓的做市商)来预测市场方向。 市场行为的所有解释都遵循3个基本定律:

  • The Law of Supply and Demand

    供求法则
  • The Law of Effort vs. Results

    努力法则与结果
  • The Law of Cause and Effect

    因果定律

There are also three big names in VSA’s development history.

VSA的发展历史上也有三大名声。

  • Jesse Livermore

    杰西·利佛摩
  • Richard Wyckoff

    理查德·威科夫(Richard Wyckoff)
  • Tom Williams

    汤姆·威廉姆斯

And tons of learning materials can be found online. For a beginner, I would recommend the following 2 books.

大量的学习资料可以在网上找到。 对于初学者,我建议以下2本书。

  1. Master the Markets by Tom Williams

    汤姆·威廉姆斯(Tom Williams) 掌握市场

  2. Trades About to Happen by David H. Weis

    David H. Weis 即将发生的交易

Also, if you only want to have a quick peek on this topic, there is a nice article on VSA from here.

另外,如果您只想快速浏览一下此主题,可以在此处找到有关VSA的不错的文章。

One of the great advantages of Machine learning / Deep learning lies on the no need for feature engineering. The basic of VSA is, as said in its name, volume, the spread of price range, location of the close related to the change of stock price in a bar.

机器学习/深度学习的一大优势在于无需特征工程。 正如其名称中所述,VSA的基本原理是数量,价格范围的价差,与条形图的股价变化相关的收盘位置。

These features can be defined as:

这些功能可以定义为:

Image for post
Definition of bars
条的定义
  • Volume: pretty straight forward

    数量:挺直的
  • Range/Spread: Difference between high and close

    范围/价差:最高价和收市价之间的差异
  • Closing Price Relative to Range: Is the closing price near the top or the bottom of the price bar?

    收盘价相对于范围:收盘价是否在价格柱的顶部或底部附近?
  • The change of stock price: pretty straight forward

    股票价格的变化:很简单

There are many terminologies created by Richard Wyckoff, like Sign of strength (SOS), Sign of weakness (SOW) etc.. However, most of those terminologies are purely the combination of those 4 basic features. I don’t believe that, with deep learning, over-engineering features is a sensible thing to do. Considering one of the advantages of deep learning is that it completely automates what used to be the most crucial step in a machine-learning workflow: feature engineering. The thing we need to do is to tell the algorithm where to look at, rather than babysitting them step by step. Without further ado, let’s dive into the code.

理查德·怀科夫(Richard Wyckoff)创建了许多术语 ,例如“强势迹象(SOS)”,“弱势迹象(SOW)”等。但是,这些术语中的大多数纯粹是这四个基本特征的组合。 我不认为通过深度学习,过度设计功能不是明智的选择。 考虑到深度学习的优点之一是它可以完全自动化机器学习工作流程中最关键的步骤:特征工程。 我们需要做的是告诉算法要看的地方,而不是一步一步地照顾他们。 事不宜迟,让我们深入研究代码。

1.数据准备 (1. Data preparation)

For consistency, in all the 📈Python for finance series, I will try to reuse the same data as much as I can. More details about data preparation can be found here, here and here.

为了保持一致性,在所有Python金融系列丛书中 ,我将尽量重用相同的数据。 有关数据准备的更多详细信息,可以在此处 , 此处和此处找到 。

#import all the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import yfinance as yf #the stock data from Yahoo Financeimport matplotlib.pyplot as plt #set the parameters for plotting
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300#define a function to get data
def get_data(symbols, begin_date=None,end_date=None):
df = yf.download('AAPL', start = '2000-01-01',
auto_adjust=True,#only download adjusted data
end= '2010-12-31')
#my convention: always lowercase
df.columns = ['open','high','low',
'close','volume']

return df
prices = get_data('AAPL', '2000-01-01', '2010-12-31')
prices.head()
Image for post

✍提示! (✍Tip!)

The data we download this time is adjusted data from yfinance by setting the auto_adjust=True. If you have access to tick data, by all means. It would be much better with tick data as articulated from Advances in Financial Machine Learning by Marcos Prado. Anyway, 10 years adjusted data only gives 2766 entries, which is far from “Big Data”.

我们这次下载的数据是 通过设置 auto_adjust=True yfinance 调整的数据 如果可以访问滴答数据,请务必。 剔除的报价数据会更好 Marcos Prado 在金融机器学习中 进展 。 无论如何,经过10年调整的数据仅提供2766条记录,与“大数据”相去甚远。

Image for post

2.特征工程 (2. Feature Engineering)

The key point of combining VSA with modern data science is through reading and interpreting the bars' own actions, one (hopefully algorithm) can construct a story of the market behaviours. The story might not be easily understood by a human, but works in a sophisticated way.

将VSA与现代数据科学相结合的关键点在于,通过阅读和解释条形自身的行为,一个(希望是一种算法)可以构建一个关于市场行为的故事。 这个故事可能不容易为人类所理解,而是以一种复杂的方式进行。

Volume in conjunction with the price range and the position of the close is easy to be expressed by code.

交易量,价格范围和收盘位置很容易用代码表示。

  • Volume: pretty straight forward

    数量:挺直的
  • Range/Spread: Difference between high and close

    范围/价差:最高价和收市价之间的差异
def price_spread(df):
return (df.high - df.low)
  • Closing Price Relative to Range: Is the closing price near the top or the bottom of the price bar?

    收盘价相对于范围:收盘价是否在价格柱的顶部或底部附近?
def close_location(df):
return (df.high - df.close) / (df.high - df.low)#o indicates the close is the high of the day, and 1 means close
#is the low of the day and the smaller the value, the closer the #close price to the high.
  • The change of stock price: pretty straight forward

    股票价格的变化:很简单

Now comes the tricky part,

现在是棘手的部分,

“When viewed in a larger context, some of the price bars take on a new meaning.”

“从更大的角度来看,某些价格柱具有新的含义。”

That means to see the full pictures, we need to observe those 4 basic features under a different time scale.

这意味着要查看完整图片,我们需要在不同的时间范围内观察这4个基本功能。

To do that, we need to reconstruct a High(H), Low(L), Close(C) and Volume(V) bar at varied time span.

为此,我们需要在不同的时间跨度上重建高(H),低(L),关闭(C)和体积(V)条。

def create_HLCV(i): 
'''
#i: days
#as we don't care about open that much, that leaves volume,
#high,low and close
''' df = pd.DataFrame(index=prices.index) df[f'high_{i}D'] = prices.high.rolling(i).max()
df[f'low_{i}D'] = prices.low.rolling(i).min()
df[f'close_{i}D'] = prices.close.rolling(i).\
apply(lambda x:x[-1])
# close_2D = close as rolling backwards means today is
#literally, the last day of the rolling window.
df[f'volume_{i}D'] = prices.volume.rolling(i).sum()

return df

next step, create those 4 basic features based on a different time scale.

下一步,根据不同的时间范围创建这4个基本功能。

def create_features(i):
df = create_HLCV(i)
high = df[f'high_{i}D']
low = df[f'low_{i}D']
close = df[f'close_{i}D']
volume = df[f'volume_{i}D']

features = pd.DataFrame(index=prices.index)
features[f'volume_{i}D'] = volume
features[f'price_spread_{i}D'] = high - low
features[f'close_loc_{i}D'] = (high - close) / (high - low)
features[f'close_change_{i}D'] = close.diff()

return features

The time spans that I would like to explore are 1, 2, 3 days and 1 week, 1 month, 2 months, 3 months, which roughly are [1,2,3,5,20,40,60] days. Now, we can create a whole bunch of features,

我想探索的时间范围是1、2、3天和1周,1个月,2个月,3个月,大约是[1,2,3,5,20,40,60]天。 现在,我们可以创建很多功能,

def create_bunch_of_features():
days = [1,2,3,5,20,40,60]
bunch_of_features = pd.DataFrame(index=prices.index)
for day in days:
f = create_features(day)
bunch_of_features = bunch_of_features.join(f)

return bunch_of_featuresbunch_of_features = create_bunch_of_features()
bunch_of_features.info()
Image for post

To make things easy to understand, our target outcome will only be the next day’s return.

为了使事情易于理解,我们的目标结果将仅是第二天的退货。

# next day's returns as outcomes
outcomes = pd.DataFrame(index=prices.index)
outcomes['close_1'] = prices.close.pct_change(-1)

3.特征选择 (3. Feature Selection)

Let’s have a look at how those features correlated with outcomes, the next day’s return.

让我们看一下这些功能与结果,第二天的收益如何相关。

corr = bunch_of_features.corrwith(outcomes.close_1)
corr.sort_values(ascending=False).plot.barh(title = 'Strength of Correlation');
Image for post

It is hard to say there are some correlations, as all the numbers are well below 0.8.

很难说存在一些相关性,因为所有数字都远低于0.8。

corr.sort_values(ascending=False)
Image for post

Next, let’s see how those features related to each other.

接下来,让我们看看这些功能如何相互关联。

corr_matrix = bunch_of_features.corr()

Instead of making heatmap, I am trying to use Seaborn’s Clustermap to cluster row-wise or col-wise to see if there is any pattern emerges. Seaborn’s Clustermap function is great for making simple heatmaps and hierarchically-clustered heatmaps with dendrograms on both rows and/or columns. This reorganizes the data for the rows and columns and displays similar content next to one another for even more depth of understanding the data. A nice tutorial about cluster map can be found here. To get a cluster map, all you need is actually one line of code.

我没有制作热图,而是尝试使用Seaborn的Clustermap对行或逐行进行聚类,以查看是否出现任何模式。 Seaborn的Clustermap功能非常适用于制作简单的热图和在行和/或列上均具有树状图的层次集群的热图。 这将重新组织行和列的数据,并相邻显示相似的内容,以进一步了解数据。 可以在这里找到有关集群映射的很好的教程。 要获得集群图,实际上只需要一行代码。

sns.clustermap(corr_matrix)
Image for post

If you carefully scrutinize the graph, some conclusions can be drawn:

如果仔细检查图形,可以得出一些结论:

  1. Price spread closely related to the volume, as clearly shown at the centre of the graph.

    价格点差与交易量密切相关,如图表中心所示。
  2. And the location of close related to each other at different timespan, as indicated at the bottom right corner.

    并且关闭位置在不同的时间范围内彼此相关,如右下角所示。
  3. From the pale colour of the top left corner, close price change does pair with itself, which makes perfect sense. However, it is a bit random as no cluster pattern at varied time scale. I would expect that 2Days change should be paired with 3Days change.

    从左上角的浅色开始,接近的价格变化与自身匹配,这是很合理的。 但是,由于在变化的时间尺度上没有群集模式,因此它有点随机。 我希望2天的更改应与3天的更改配对。

The randomness of the close price difference could thank to the characteristics of the stock price itself. Simple percentage return might be a better option. This can be realized by modifying the close diff() to close pct_change().

收盘价差的随机性可以归功于股票价格本身的特征。 简单的百分比回报可能是更好的选择。 这可以通过将close diff()修改为close pct_change()

def create_features_v1(i):
df = create_HLCV(i)
high = df[f'high_{i}D']
low = df[f'low_{i}D']
close = df[f'close_{i}D']
volume = df[f'volume_{i}D']

features = pd.DataFrame(index=prices.index)
features[f'volume_{i}D'] = volume
features[f'price_spread_{i}D'] = high - low
features[f'close_loc_{i}D'] = (high - close) / (high - low)
#only change here
features[f'close_change_{i}D'] = close.pct_change()

return features

and do everything again.

然后再做一遍。

def create_bunch_of_features_v1():
days = [1,2,3,5,20,40,60]
bunch_of_features = pd.DataFrame(index=prices.index)
for day in days:
f = create_features_v1(day)#here is the only difference
bunch_of_features = bunch_of_features.join(f)

return bunch_of_featuresbunch_of_features_v1 = create_bunch_of_features_v1()#check the correlation
corr_v1 = bunch_of_features_v1.corrwith(outcomes.close_1)
corr_v1.sort_values(ascending=False).plot.barh( title = 'Strength of Correlation')
Image for post

a little bit different, but not much!

有点不同,但是没有太大关系!

corr_v1.sort_values(ascending=False)
Image for post

What happens to the correlation between features?

特征之间的相关性会发生什么?

corr_matrix_v1 = bunch_of_features_v1.corr()
sns.clustermap(corr_matrix_v1, cmap='coolwarm', linewidth=1)
Image for post

Well, the pattern remains unchanged. Let’s change the default method from “average” to “ward”. These two methods are similar, but “ward” is more like K-MEANs clustering. A nice tutorial on this topic can be found here.

好吧,模式保持不变。 让我们将默认方法从“平均值”更改为“病房”。 这两种方法相似,但是“ ward”更像是K-MEAN聚类。 可以在这里找到有关该主题的不错的教程。

Image for post
sns.clustermap(corr_matrix_v1, cmap='coolwarm', linewidth=1,
method='ward')
Image for post

To select features, we want to pick those that have the strongest, most persistent relationships to the target outcome. At the meantime, to minimize the amount of overlap or collinearity in your selected features to avoid noise and waste of computer power. For those features that paired together in a cluster, I only pick the one that has a stronger correlation with the outcome. By just looking at the cluster map, a few features are picked out.

要选择特征,我们要选择与目标结果之间关系最牢固,最持久的特征。 同时,为了最大程度地减少所选功能中的重叠或共线性,避免产生噪音和计算机电源浪费。 对于在集群中配对在一起的那些特征,我只选择与结果相关性更强的那些特征。 通过仅查看聚类图,就可以挑选出一些功能。

deselected_features_v1 = ['close_loc_3D','close_loc_60D',
'volume_3D', 'volume_60D',
'price_spread_3D','price_spread_60D',
'close_change_3D','close_change_60D']selected_features_v1 = bunch_of_features.drop \
(labels=deselected_features_v1, axis=1)

Next, we are going to take a look at pair-plot, A pair plot is a great method to identify trends for follow-up analysis, allowing us to see both distributions of single variables and relationships between multiple variables. Again, all we need is a single line of code.

接下来,我们将看一下配对图, 配对图是识别趋势以进行后续分析的好方法,它使我们既可以看到单个变量的分布又可以看到多个变量之间的关系。 同样,我们只需要一行代码。

sns.pairplot(selected_features_v1)
Image for post

The graph is overwhelming and hard to see. Let’s take a small group as an example.

该图是压倒性的,很难看到。 让我们以一个小组为例。

selected_features_1D_list = ['volume_1D', 'price_spread_1D',\         'close_loc_1D', 'close_change_1D']selected_features_1D = selected_features_v1\
[selected_features_1D_list]sns.pairplot(selected_features_1D)
Image for post

There are two things I noticed immediately, one is there are outliers and another is the distribution are no way close to normal.

我立即注意到两件事,一是存在离群值,二是分布与正常值相差无几。

Let’s deal with the outliers for now. In order to do everything in one go, I will join the outcome with features and remove outliers together.

现在让我们处理异常值。 为了一次性完成所有工作,我将结果与功能结合在一起,并将异常值一起移除。

features_outcomes = selected_features_v1.join(outcomes)
features_outcomes.info()
Image for post

I will use the same method described here, here and here to remove the outliers.

我将使用此处 , 此处和此处所述的相同方法来删除异常值。

stats = features_outcomes.describe()
def get_outliers(df, i=4):
#i is number of sigma, which define the boundary along mean
outliers = pd.DataFrame()

for col in df.columns:
mu = stats.loc['mean', col]
sigma = stats.loc['std', col]
condition = (df[col] > mu + sigma * i) | (df[col] < mu - sigma * i)
outliers[f'{col}_outliers'] = df[col][condition]

return outliersoutliers = get_outliers(features_outcomes, i=1)
outliers.info()
Image for post

I set 1 standard deviation as the boundary to dig out most of the outliers. Then remove all the outliers along with the NaN values.

我将1个标准差设置为边界,以挖掘出大多数异常值。 然后删除所有异常值以及NaN值。

features_outcomes_rmv_outliers = features_outcomes.drop(index = outliers.index).dropna()
features_outcomes_rmv_outliers.info()
Image for post

With the outliers removed, we can do the pair plot again.

除去异常值后,我们可以再次绘制对图。

sns.pairplot(features_outcomes_rmv_outliers, vars=selected_features_1D_list);
Image for post

Now, the plots are looking much better, but it is barely to draw any useful conclusions. It would be nice to see which spots are down moves and which are up moves in conjunction with those features. I can extract the sign of stock price change and add an extra dimension to the plots.

现在,这些图看起来好多了,但是几乎没有得出任何有用的结论。 很高兴看到哪些点向下移动,哪些点向上移动以及这些功能。 我可以提取股票价格变化的迹象,并为图添加额外的维度。

features_outcomes_rmv_outliers['sign_of_close'] = features_outcomes_rmv_outliers['close_1'].apply(np.sign)

Now, let’s re-plot the pairplot() again with a bit of tweak to make the graph pretty.

现在,让我们通过一些调整再次重新绘制pairplot() ,以使图表更漂亮。

sns.pairplot(features_outcomes_rmv_outliers, 
vars=selected_features_1D_list,
diag_kind='kde',
palette='husl', hue='sign_of_close',
markers = ['*', '<', '+'],
plot_kws={'alpha':0.3});#transparence:0.3
Image for post

Now, it looks much better. Clearly, when the prices go up, they (the blue spot) are denser and aggregate at a certain location. Whereas on down days, they spread everywhere.

现在,它看起来好多了。 显然,当价格上涨时,它们(蓝色点)在某个位置更密集且聚集在一起。 而在低迷时期,它们无处不在。

I would really appreciate it if you could shed some light on the pair plot and leave your comments below, thanks.

如果您能对结对图有所了解并在下面留下您的评论,我将非常感激。

Here is the summary of all the codes used in this article:

这是本文中使用的所有代码的摘要:

#import all the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import yfinance as yf #the stock data from Yahoo Financeimport matplotlib.pyplot as plt #set the parameters for plotting
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300#define a function to get data
def get_data(symbols, begin_date=None,end_date=None):
df = yf.download('AAPL', start = '2000-01-01',
auto_adjust=True,#only download adjusted data
end= '2010-12-31')
#my convention: always lowercase
df.columns = ['open','high','low',
'close','volume']

return dfprices = get_data('AAPL', '2000-01-01', '2010-12-31')#create some features
def create_HLCV(i):#as we don't care open that much, that leaves volume,
#high,low and closedf = pd.DataFrame(index=prices.index)
df[f'high_{i}D'] = prices.high.rolling(i).max()
df[f'low_{i}D'] = prices.low.rolling(i).min()
df[f'close_{i}D'] = prices.close.rolling(i).\
apply(lambda x:x[-1])
# close_2D = close as rolling backwards means today is
# literly the last day of the rolling window.
df[f'volume_{i}D'] = prices.volume.rolling(i).sum()

return dfdef create_features_v1(i):
df = create_HLCV(i)
high = df[f'high_{i}D']
low = df[f'low_{i}D']
close = df[f'close_{i}D']
volume = df[f'volume_{i}D']

features = pd.DataFrame(index=prices.index)
features[f'volume_{i}D'] = volume
features[f'price_spread_{i}D'] = high - low
features[f'close_loc_{i}D'] = (high - close) / (high - low)
features[f'close_change_{i}D'] = close.pct_change()

return featuresdef create_bunch_of_features_v1():
'''
the timespan that i would like to explore
are 1, 2, 3 days and 1 week, 1 month, 2 month, 3 month
which roughly are [1,2,3,5,20,40,60]
'''
days = [1,2,3,5,20,40,60]
bunch_of_features = pd.DataFrame(index=prices.index)
for day in days:
f = create_features_v1(day)
bunch_of_features = bunch_of_features.join(f)

return bunch_of_featuresbunch_of_features_v1 = create_bunch_of_features_v1()#define the outcome target
#here, to make thing easy to understand, i will only try to predict #the next days's return
outcomes = pd.DataFrame(index=prices.index)# next day's returns
outcomes['close_1'] = prices.close.pct_change(-1)#decide which features are abundant from cluster map
deselected_features_v1 = ['close_loc_3D','close_loc_60D',
'volume_3D', 'volume_60D',
'price_spread_3D','price_spread_60D',
'close_change_3D','close_change_60D']
selected_features_v1 = bunch_of_features_v1.drop(labels=deselected_features_v1, axis=1)#join the features and outcome together to remove the outliers
features_outcomes = selected_features_v1.join(outcomes)
stats = features_outcomes.describe()#define the method to identify outliers
def get_outliers(df, i=4):
#i is number of sigma, which define the boundary along mean
outliers = pd.DataFrame()

for col in df.columns:
mu = stats.loc['mean', col]
sigma = stats.loc['std', col]
condition = (df[col] > mu + sigma * i) | (df[col] < mu - sigma * i)
outliers[f'{col}_outliers'] = df[col][condition]

return outliersoutliers = get_outliers(features_outcomes, i=1)#remove all the outliers and Nan value
features_outcomes_rmv_outliers = features_outcomes.drop(index = outliers.index).dropna()

I know this article goes too long, I am better off leaving it here. In the next article, I will do a data transformation to see if I have a way to fix the issue of distribution. Stay tuned!

我知道这篇文章太长了,最好把它留在这里。 在下一篇文章中,我将进行数据转换,以查看是否有办法解决分发问题。 敬请关注!

翻译自: https://towardsdatascience.com/feature-engineering-feature-selection-8c1d57af18d2

特征工程之特征选择

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391655.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

版本号控制-GitHub

前面几篇文章。我们介绍了Git的基本使用方法及Gitserver的搭建。本篇文章来学习一下怎样使用GitHub。GitHub是开源的代码库以及版本号控制库&#xff0c;是眼下使用网络上使用最为广泛的服务&#xff0c;GitHub能够托管各种Git库。首先我们须要注冊一个GitHub账号&#xff0c;打…

数据标准化和离散化

在某些比较和评价的指标处理中经常需要去除数据的单位限制&#xff0c;将其转化为无量纲的纯数值&#xff0c;便于不同单位或量级的指标能够进行比较和加权。因此需要通过一定的方法进行数据标准化&#xff0c;将数据按比例缩放&#xff0c;使之落入一个小的特定区间。 一、标准…

熊猫tv新功能介绍_熊猫简单介绍

熊猫tv新功能介绍Out of all technologies that is introduced in Data Analysis, Pandas is one of the most popular and widely used library.在Data Analysis引入的所有技术中&#xff0c;P andas是最受欢迎和使用最广泛的库之一。 So what are we going to cover :那么我…

数据转换软件_数据转换

数据转换软件&#x1f4c8;Python金融系列 (&#x1f4c8;Python for finance series) Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.警告 &#xff1a;这里没有神奇的配方或圣杯&#xff0c;尽管新世界可能为您…

10张图带你深入理解Docker容器和镜像

【编者的话】本文用图文并茂的方式介绍了容器、镜像的区别和Docker每个命令后面的技术细节&#xff0c;能够很好的帮助读者深入理解Docker。这篇文章希望能够帮助读者深入理解Docker的命令&#xff0c;还有容器&#xff08;container&#xff09;和镜像&#xff08;image&#…

matlab界area_Matlab的数据科学界

matlab界area意见 (Opinion) My personal interest in Data Science spans back to 2011. I was learning more about Economies and wanted to experiment with some of the ‘classic’ theories and whilst many of them held ground, at a micro level, many were also pur…

hdf5文件和csv的区别_使用HDF5文件并创建CSV文件

hdf5文件和csv的区别In my last article, I discussed the steps to download NASA data from GES DISC. The data files downloaded are in the HDF5 format. HDF5 is a file format, a technology, that enables the management of very large data collections. Thus, it is…

机器学习常用模型:决策树_fairmodels:让我们与有偏见的机器学习模型作斗争

机器学习常用模型:决策树TL; DR (TL;DR) The R Package fairmodels facilitates bias detection through model visualizations. It implements a few mitigation strategies that could reduce bias. It enables easy to use checks for fairness metrics and comparison betw…

高德地图如何将比例尺放大到10米?

2019独角兽企业重金招聘Python工程师标准>>> var map new AMap.Map(container, {resizeEnable: true,expandZoomRange:true,zoom:20,zooms:[3,20],center: [116.397428, 39.90923] }); alert(map.getZoom());http://lbs.amap.com/faq/web/javascript-api/expand-zo…

Android 手把手带你玩转自己定义相机

本文已授权微信公众号《鸿洋》原创首发&#xff0c;转载请务必注明出处。概述 相机差点儿是每一个APP都要用到的功能&#xff0c;万一老板让你定制相机方不方&#xff1f;反正我是有点方。关于相机的两天奋斗总结免费送给你。Intent intent new Intent(); intent.setAction(M…

100米队伍,从队伍后到前_我们的队伍

100米队伍,从队伍后到前The last twelve months have brought us a presidential impeachment trial, the coronavirus pandemic, sweeping racial justice protests triggered by the death of George Floyd, and a critical presidential election. News coverage of these e…

idea使用 git 撤销commit

2019独角兽企业重金招聘Python工程师标准>>> 填写commit的id 就可以取消这一次的commit 转载于:https://my.oschina.net/u/3559695/blog/1596669

mongodb数据可视化_使用MongoDB实时可视化开放数据

mongodb数据可视化Using Python to connect to Taiwan Government PM2.5 open data API, and schedule to update data in real time to MongoDB — Part 2使用Python连接到台湾政府PM2.5开放数据API&#xff0c;并计划将数据实时更新到MongoDB —第2部分 目标 (Goal) This ti…

4.kafka的安装部署

为了安装过程对一些参数的理解&#xff0c;我先在这里提一下kafka一些重点概念,topic,broker,producer,consumer,message,partition,依赖于zookeeper, kafka是一种消息队列,他的服务端是由若干个broker组成的&#xff0c;broker会向zookeeper&#xff0c;producer生成者对应一个…

ecshop 前台个人中心修改侧边栏 和 侧边栏显示不全 或 导航现实不全

怎么给个人中心侧边栏加项或者减项 在模板文件default/user_menu.lbi 文件里添加或者修改,一般看到页面都会知道怎么加,怎么删,这里就不啰嗦了 添加一个栏目以后,这个地址跳的页面怎么写 这是最基本的一个包括左侧个人信息,头部导航栏 <!DOCTYPE html PUBLIC "-//W3C//…

面向对象编程思想-观察者模式

一、引言 相信猿友都大大小小经历过一些面试&#xff0c;其中有道经典题目&#xff0c;场景是猫咪叫了一声&#xff0c;老鼠跑了&#xff0c;主人被惊醒&#xff08;设计有扩展性的可加分&#xff09;。对于初学者来说&#xff0c;可能一脸懵逼&#xff0c;这啥跟啥啊是&#x…

Python:在Pandas数据框中查找缺失值

How to find Missing values in a data frame using Python/Pandas如何使用Python / Pandas查找数据框中的缺失值 介绍&#xff1a; (Introduction:) When you start working on any data science project the data you are provided is never clean. One of the most common …

监督学习-回归分析

一、数学建模概述 监督学习&#xff1a;通过已有的训练样本进行训练得到一个最优模型&#xff0c;再利用这个模型将所有的输入映射为相应的输出。监督学习根据输出数据又分为回归问题&#xff08;regression&#xff09;和分类问题&#xff08;classfication&#xff09;&#…

微服务架构技能

2019独角兽企业重金招聘Python工程师标准>>> 微服务架构技能 博客分类&#xff1a; 架构 &#xff08;StuQ 微服务技能图谱&#xff09; 2课程简介 本课程分为基础篇和高级篇两部分&#xff0c;旨在通过完整的案例&#xff0c;呈现微服务的开发、测试、构建、部署、…

Tableau Desktop认证:为什么要关心以及如何通过

Woah, Tableau!哇&#xff0c;Tableau&#xff01; By now, almost everyone’s heard of the data visualization software that brought visual analytics to the public. Its intuitive drag and drop interface makes connecting to data, creating graphs, and sharing d…