特征工程之特征选择_特征工程与特征选择

特征工程之特征选择

📈Python金融系列 (📈Python for finance series)

Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.

警告这里没有神奇的配方或圣杯,尽管新世界可能为您打开了大门。

📈Python金融系列 (📈Python for finance series)

  1. Identifying Outliers

    识别异常值

  2. Identifying Outliers — Part Two

    识别异常值-第二部分

  3. Identifying Outliers — Part Three

    识别异常值-第三部分

  4. Stylized Facts

    程式化的事实

  5. Feature Engineering & Feature Selection

    特征工程与特征选择

  6. Data Transformation

    数据转换

Following up the previous posts in these series, this time we are going to explore a real Technical Analysis (TA) in the financial market. For a very long time, I have been fascinated by the inner logic of TA called Volume Spread Analysis (VSA). I have found no articles on applying modern Machine learning on this time proving long-lasting technique. Here I am trying to throw out a minnow to catch a whale. If I could make some noise in this field, it was worth the time I spent on this article.

遵循这些系列中的先前文章,这次我们将探索金融市场中的实际技术分析(TA)。 很长时间以来,我一直着迷于TA的内部逻辑,即体积扩展分析(VSA)。 到目前为止,我还没有发现有关应用现代机器学习的文章证明了其持久的技术。 在这里,我试图扔一条小鱼来捉鲸鱼。 如果我能在这个领域引起一些轰动,那是值得我在本文上花费的时间。

Especially, after I read David H. Weis’s Trades About to Happen, in his book he described:

特别是,在我阅读David H. Weis的著作《 即将发生的交易》之后 ,他描述了:

“Instead of analyzing an array of indicators or algorithms, you should be able to listen to what any market says about itself.”¹

“您不必分析各种指标或算法,而应该听取任何市场对其自身的评价。”¹

To closely listen to the market, as also well said from this quote below, just as it may not be possible to predict the future, it is also hard to neglect things about to happen. The key is to capture what is about to happen and follow the flow.

正如下面的引文所言,密切听取市场意见,正如可能无法预测未来一样,也很难忽略即将发生的事情。 关键是捕获将要发生的事情并遵循流程。

Image for post

But how to perceive things about to happen, a statement made long ago by Richard Wyckoff gives some clues:

但是,如何感知即将发生的事情, Richard Wyckoff很久以前发表的声明给出了一些线索:

“Successful tape reading [chart reading] is a study of Force. It requires ability to judge which side has the greatest pulling power and one must have the courage to go with that side. There are critical points which occur in each swing just as in the life of a business or of an individual. At these junctures it seems as though a feather’s weight on either side would determine the immediate trend. Any one who can spot these points has much to win and little to lose.”²

“成功的磁带阅读[图表阅读]是对Force的研究。 它需要能够判断哪一方具有最大的拉动力,而一方必须有勇气与那一方并驾齐驱。 就像企业或个人的生活一样,每一个环节都有一些关键点。 在这些关头,似乎两侧的羽毛重量将决定当前趋势。 任何能够发现这些点的人都将赢得很多,而损失却很少。”²

But how to interpret the market behaviours? One of the eloquent description of market forces by Richard Wyckoff is very instructive:

但是,如何解释市场行为呢? 理查德·怀科夫 ( Richard Wyckoff )对市场力量的雄辩性描述之一很有启发性:

“The market is like a slowly revolving wheel: Whether the wheel will continue to revolve in the same direction, stand still or reverse depends entirely upon the forces which come in contact with it hub and tread. even when the contact is broken, and nothing remains to affect its course, the wheel retains a certain impulse from the most recent dominating force, and revolves until it comes to a standstill or is subjected to other influences.”²

“市场就像一个缓慢旋转的轮子:轮子将继续沿相同方向旋转,静止还是反向旋转,完全取决于与轮毂和胎面相接触的力。 即使接触断开,也没有影响其行程的方向盘,车轮仍会保留来自最新支配力的一定冲力,并旋转直到其停止或受到其他影响。”²

David H. Weis gives a marvellous example of how to interpret the bars and relate them to the market behaviours. Through his construction of a hypothetical bar behaviour, every single bar becomes alive and rushes to tell you their stories.

戴维·H·韦斯(David H. Weis)提供了一个出色的例子,说明了如何解读酒吧并将其与市场行为联系起来。 通过构造一个假想的酒吧行为,每个酒吧都活着并争先恐后地告诉您他们的故事。

Image for post
Hypothetical Behaviour
假设行为

For all the details of the analysis, please refer to David’s book.

有关分析的所有详细信息,请参阅David的书。

Image for post
Purchased this book before it officially released and got David’s signature.
在正式发行前购买了这本书并获得了大卫的签名。

Before we dive deep into the code, it would be better to give a bit more background on Volume Spread Analysis (VSA). VSA is the study of the relationship between volume and price to predict market direction by following the professional traders, so-called market maker. All the interpretations of market behaviours follow 3 basic laws:

在深入研究代码之前,最好对体积扩展分析(VSA)有所了解。 VSA是对数量和价格之间关系的研究,旨在通过跟随专业交易者(所谓的做市商)来预测市场方向。 市场行为的所有解释都遵循3个基本定律:

  • The Law of Supply and Demand

    供求法则
  • The Law of Effort vs. Results

    努力法则与结果
  • The Law of Cause and Effect

    因果定律

There are also three big names in VSA’s development history.

VSA的发展历史上也有三大名声。

  • Jesse Livermore

    杰西·利佛摩
  • Richard Wyckoff

    理查德·威科夫(Richard Wyckoff)
  • Tom Williams

    汤姆·威廉姆斯

And tons of learning materials can be found online. For a beginner, I would recommend the following 2 books.

大量的学习资料可以在网上找到。 对于初学者,我建议以下2本书。

  1. Master the Markets by Tom Williams

    汤姆·威廉姆斯(Tom Williams) 掌握市场

  2. Trades About to Happen by David H. Weis

    David H. Weis 即将发生的交易

Also, if you only want to have a quick peek on this topic, there is a nice article on VSA from here.

另外,如果您只想快速浏览一下此主题,可以在此处找到有关VSA的不错的文章。

One of the great advantages of Machine learning / Deep learning lies on the no need for feature engineering. The basic of VSA is, as said in its name, volume, the spread of price range, location of the close related to the change of stock price in a bar.

机器学习/深度学习的一大优势在于无需特征工程。 正如其名称中所述,VSA的基本原理是数量,价格范围的价差,与条形图的股价变化相关的收盘位置。

These features can be defined as:

这些功能可以定义为:

Image for post
Definition of bars
条的定义
  • Volume: pretty straight forward

    数量:挺直的
  • Range/Spread: Difference between high and close

    范围/价差:最高价和收市价之间的差异
  • Closing Price Relative to Range: Is the closing price near the top or the bottom of the price bar?

    收盘价相对于范围:收盘价是否在价格柱的顶部或底部附近?
  • The change of stock price: pretty straight forward

    股票价格的变化:很简单

There are many terminologies created by Richard Wyckoff, like Sign of strength (SOS), Sign of weakness (SOW) etc.. However, most of those terminologies are purely the combination of those 4 basic features. I don’t believe that, with deep learning, over-engineering features is a sensible thing to do. Considering one of the advantages of deep learning is that it completely automates what used to be the most crucial step in a machine-learning workflow: feature engineering. The thing we need to do is to tell the algorithm where to look at, rather than babysitting them step by step. Without further ado, let’s dive into the code.

理查德·怀科夫(Richard Wyckoff)创建了许多术语 ,例如“强势迹象(SOS)”,“弱势迹象(SOW)”等。但是,这些术语中的大多数纯粹是这四个基本特征的组合。 我不认为通过深度学习,过度设计功能不是明智的选择。 考虑到深度学习的优点之一是它可以完全自动化机器学习工作流程中最关键的步骤:特征工程。 我们需要做的是告诉算法要看的地方,而不是一步一步地照顾他们。 事不宜迟,让我们深入研究代码。

1.数据准备 (1. Data preparation)

For consistency, in all the 📈Python for finance series, I will try to reuse the same data as much as I can. More details about data preparation can be found here, here and here.

为了保持一致性,在所有Python金融系列丛书中 ,我将尽量重用相同的数据。 有关数据准备的更多详细信息,可以在此处 , 此处和此处找到 。

#import all the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import yfinance as yf #the stock data from Yahoo Financeimport matplotlib.pyplot as plt #set the parameters for plotting
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300#define a function to get data
def get_data(symbols, begin_date=None,end_date=None):
df = yf.download('AAPL', start = '2000-01-01',
auto_adjust=True,#only download adjusted data
end= '2010-12-31')
#my convention: always lowercase
df.columns = ['open','high','low',
'close','volume']

return df
prices = get_data('AAPL', '2000-01-01', '2010-12-31')
prices.head()
Image for post

✍提示! (✍Tip!)

The data we download this time is adjusted data from yfinance by setting the auto_adjust=True. If you have access to tick data, by all means. It would be much better with tick data as articulated from Advances in Financial Machine Learning by Marcos Prado. Anyway, 10 years adjusted data only gives 2766 entries, which is far from “Big Data”.

我们这次下载的数据是 通过设置 auto_adjust=True yfinance 调整的数据 如果可以访问滴答数据,请务必。 剔除的报价数据会更好 Marcos Prado 在金融机器学习中 进展 。 无论如何,经过10年调整的数据仅提供2766条记录,与“大数据”相去甚远。

Image for post

2.特征工程 (2. Feature Engineering)

The key point of combining VSA with modern data science is through reading and interpreting the bars' own actions, one (hopefully algorithm) can construct a story of the market behaviours. The story might not be easily understood by a human, but works in a sophisticated way.

将VSA与现代数据科学相结合的关键点在于,通过阅读和解释条形自身的行为,一个(希望是一种算法)可以构建一个关于市场行为的故事。 这个故事可能不容易为人类所理解,而是以一种复杂的方式进行。

Volume in conjunction with the price range and the position of the close is easy to be expressed by code.

交易量,价格范围和收盘位置很容易用代码表示。

  • Volume: pretty straight forward

    数量:挺直的
  • Range/Spread: Difference between high and close

    范围/价差:最高价和收市价之间的差异
def price_spread(df):
return (df.high - df.low)
  • Closing Price Relative to Range: Is the closing price near the top or the bottom of the price bar?

    收盘价相对于范围:收盘价是否在价格柱的顶部或底部附近?
def close_location(df):
return (df.high - df.close) / (df.high - df.low)#o indicates the close is the high of the day, and 1 means close
#is the low of the day and the smaller the value, the closer the #close price to the high.
  • The change of stock price: pretty straight forward

    股票价格的变化:很简单

Now comes the tricky part,

现在是棘手的部分,

“When viewed in a larger context, some of the price bars take on a new meaning.”

“从更大的角度来看,某些价格柱具有新的含义。”

That means to see the full pictures, we need to observe those 4 basic features under a different time scale.

这意味着要查看完整图片,我们需要在不同的时间范围内观察这4个基本功能。

To do that, we need to reconstruct a High(H), Low(L), Close(C) and Volume(V) bar at varied time span.

为此,我们需要在不同的时间跨度上重建高(H),低(L),关闭(C)和体积(V)条。

def create_HLCV(i): 
'''
#i: days
#as we don't care about open that much, that leaves volume,
#high,low and close
''' df = pd.DataFrame(index=prices.index) df[f'high_{i}D'] = prices.high.rolling(i).max()
df[f'low_{i}D'] = prices.low.rolling(i).min()
df[f'close_{i}D'] = prices.close.rolling(i).\
apply(lambda x:x[-1])
# close_2D = close as rolling backwards means today is
#literally, the last day of the rolling window.
df[f'volume_{i}D'] = prices.volume.rolling(i).sum()

return df

next step, create those 4 basic features based on a different time scale.

下一步,根据不同的时间范围创建这4个基本功能。

def create_features(i):
df = create_HLCV(i)
high = df[f'high_{i}D']
low = df[f'low_{i}D']
close = df[f'close_{i}D']
volume = df[f'volume_{i}D']

features = pd.DataFrame(index=prices.index)
features[f'volume_{i}D'] = volume
features[f'price_spread_{i}D'] = high - low
features[f'close_loc_{i}D'] = (high - close) / (high - low)
features[f'close_change_{i}D'] = close.diff()

return features

The time spans that I would like to explore are 1, 2, 3 days and 1 week, 1 month, 2 months, 3 months, which roughly are [1,2,3,5,20,40,60] days. Now, we can create a whole bunch of features,

我想探索的时间范围是1、2、3天和1周,1个月,2个月,3个月,大约是[1,2,3,5,20,40,60]天。 现在,我们可以创建很多功能,

def create_bunch_of_features():
days = [1,2,3,5,20,40,60]
bunch_of_features = pd.DataFrame(index=prices.index)
for day in days:
f = create_features(day)
bunch_of_features = bunch_of_features.join(f)

return bunch_of_featuresbunch_of_features = create_bunch_of_features()
bunch_of_features.info()
Image for post

To make things easy to understand, our target outcome will only be the next day’s return.

为了使事情易于理解,我们的目标结果将仅是第二天的退货。

# next day's returns as outcomes
outcomes = pd.DataFrame(index=prices.index)
outcomes['close_1'] = prices.close.pct_change(-1)

3.特征选择 (3. Feature Selection)

Let’s have a look at how those features correlated with outcomes, the next day’s return.

让我们看一下这些功能与结果,第二天的收益如何相关。

corr = bunch_of_features.corrwith(outcomes.close_1)
corr.sort_values(ascending=False).plot.barh(title = 'Strength of Correlation');
Image for post

It is hard to say there are some correlations, as all the numbers are well below 0.8.

很难说存在一些相关性,因为所有数字都远低于0.8。

corr.sort_values(ascending=False)
Image for post

Next, let’s see how those features related to each other.

接下来,让我们看看这些功能如何相互关联。

corr_matrix = bunch_of_features.corr()

Instead of making heatmap, I am trying to use Seaborn’s Clustermap to cluster row-wise or col-wise to see if there is any pattern emerges. Seaborn’s Clustermap function is great for making simple heatmaps and hierarchically-clustered heatmaps with dendrograms on both rows and/or columns. This reorganizes the data for the rows and columns and displays similar content next to one another for even more depth of understanding the data. A nice tutorial about cluster map can be found here. To get a cluster map, all you need is actually one line of code.

我没有制作热图,而是尝试使用Seaborn的Clustermap对行或逐行进行聚类,以查看是否出现任何模式。 Seaborn的Clustermap功能非常适用于制作简单的热图和在行和/或列上均具有树状图的层次集群的热图。 这将重新组织行和列的数据,并相邻显示相似的内容,以进一步了解数据。 可以在这里找到有关集群映射的很好的教程。 要获得集群图,实际上只需要一行代码。

sns.clustermap(corr_matrix)
Image for post

If you carefully scrutinize the graph, some conclusions can be drawn:

如果仔细检查图形,可以得出一些结论:

  1. Price spread closely related to the volume, as clearly shown at the centre of the graph.

    价格点差与交易量密切相关,如图表中心所示。
  2. And the location of close related to each other at different timespan, as indicated at the bottom right corner.

    并且关闭位置在不同的时间范围内彼此相关,如右下角所示。
  3. From the pale colour of the top left corner, close price change does pair with itself, which makes perfect sense. However, it is a bit random as no cluster pattern at varied time scale. I would expect that 2Days change should be paired with 3Days change.

    从左上角的浅色开始,接近的价格变化与自身匹配,这是很合理的。 但是,由于在变化的时间尺度上没有群集模式,因此它有点随机。 我希望2天的更改应与3天的更改配对。

The randomness of the close price difference could thank to the characteristics of the stock price itself. Simple percentage return might be a better option. This can be realized by modifying the close diff() to close pct_change().

收盘价差的随机性可以归功于股票价格本身的特征。 简单的百分比回报可能是更好的选择。 这可以通过将close diff()修改为close pct_change()

def create_features_v1(i):
df = create_HLCV(i)
high = df[f'high_{i}D']
low = df[f'low_{i}D']
close = df[f'close_{i}D']
volume = df[f'volume_{i}D']

features = pd.DataFrame(index=prices.index)
features[f'volume_{i}D'] = volume
features[f'price_spread_{i}D'] = high - low
features[f'close_loc_{i}D'] = (high - close) / (high - low)
#only change here
features[f'close_change_{i}D'] = close.pct_change()

return features

and do everything again.

然后再做一遍。

def create_bunch_of_features_v1():
days = [1,2,3,5,20,40,60]
bunch_of_features = pd.DataFrame(index=prices.index)
for day in days:
f = create_features_v1(day)#here is the only difference
bunch_of_features = bunch_of_features.join(f)

return bunch_of_featuresbunch_of_features_v1 = create_bunch_of_features_v1()#check the correlation
corr_v1 = bunch_of_features_v1.corrwith(outcomes.close_1)
corr_v1.sort_values(ascending=False).plot.barh( title = 'Strength of Correlation')
Image for post

a little bit different, but not much!

有点不同,但是没有太大关系!

corr_v1.sort_values(ascending=False)
Image for post

What happens to the correlation between features?

特征之间的相关性会发生什么?

corr_matrix_v1 = bunch_of_features_v1.corr()
sns.clustermap(corr_matrix_v1, cmap='coolwarm', linewidth=1)
Image for post

Well, the pattern remains unchanged. Let’s change the default method from “average” to “ward”. These two methods are similar, but “ward” is more like K-MEANs clustering. A nice tutorial on this topic can be found here.

好吧,模式保持不变。 让我们将默认方法从“平均值”更改为“病房”。 这两种方法相似,但是“ ward”更像是K-MEAN聚类。 可以在这里找到有关该主题的不错的教程。

Image for post
sns.clustermap(corr_matrix_v1, cmap='coolwarm', linewidth=1,
method='ward')
Image for post

To select features, we want to pick those that have the strongest, most persistent relationships to the target outcome. At the meantime, to minimize the amount of overlap or collinearity in your selected features to avoid noise and waste of computer power. For those features that paired together in a cluster, I only pick the one that has a stronger correlation with the outcome. By just looking at the cluster map, a few features are picked out.

要选择特征,我们要选择与目标结果之间关系最牢固,最持久的特征。 同时,为了最大程度地减少所选功能中的重叠或共线性,避免产生噪音和计算机电源浪费。 对于在集群中配对在一起的那些特征,我只选择与结果相关性更强的那些特征。 通过仅查看聚类图,就可以挑选出一些功能。

deselected_features_v1 = ['close_loc_3D','close_loc_60D',
'volume_3D', 'volume_60D',
'price_spread_3D','price_spread_60D',
'close_change_3D','close_change_60D']selected_features_v1 = bunch_of_features.drop \
(labels=deselected_features_v1, axis=1)

Next, we are going to take a look at pair-plot, A pair plot is a great method to identify trends for follow-up analysis, allowing us to see both distributions of single variables and relationships between multiple variables. Again, all we need is a single line of code.

接下来,我们将看一下配对图, 配对图是识别趋势以进行后续分析的好方法,它使我们既可以看到单个变量的分布又可以看到多个变量之间的关系。 同样,我们只需要一行代码。

sns.pairplot(selected_features_v1)
Image for post

The graph is overwhelming and hard to see. Let’s take a small group as an example.

该图是压倒性的,很难看到。 让我们以一个小组为例。

selected_features_1D_list = ['volume_1D', 'price_spread_1D',\         'close_loc_1D', 'close_change_1D']selected_features_1D = selected_features_v1\
[selected_features_1D_list]sns.pairplot(selected_features_1D)
Image for post

There are two things I noticed immediately, one is there are outliers and another is the distribution are no way close to normal.

我立即注意到两件事,一是存在离群值,二是分布与正常值相差无几。

Let’s deal with the outliers for now. In order to do everything in one go, I will join the outcome with features and remove outliers together.

现在让我们处理异常值。 为了一次性完成所有工作,我将结果与功能结合在一起,并将异常值一起移除。

features_outcomes = selected_features_v1.join(outcomes)
features_outcomes.info()
Image for post

I will use the same method described here, here and here to remove the outliers.

我将使用此处 , 此处和此处所述的相同方法来删除异常值。

stats = features_outcomes.describe()
def get_outliers(df, i=4):
#i is number of sigma, which define the boundary along mean
outliers = pd.DataFrame()

for col in df.columns:
mu = stats.loc['mean', col]
sigma = stats.loc['std', col]
condition = (df[col] > mu + sigma * i) | (df[col] < mu - sigma * i)
outliers[f'{col}_outliers'] = df[col][condition]

return outliersoutliers = get_outliers(features_outcomes, i=1)
outliers.info()
Image for post

I set 1 standard deviation as the boundary to dig out most of the outliers. Then remove all the outliers along with the NaN values.

我将1个标准差设置为边界,以挖掘出大多数异常值。 然后删除所有异常值以及NaN值。

features_outcomes_rmv_outliers = features_outcomes.drop(index = outliers.index).dropna()
features_outcomes_rmv_outliers.info()
Image for post

With the outliers removed, we can do the pair plot again.

除去异常值后,我们可以再次绘制对图。

sns.pairplot(features_outcomes_rmv_outliers, vars=selected_features_1D_list);
Image for post

Now, the plots are looking much better, but it is barely to draw any useful conclusions. It would be nice to see which spots are down moves and which are up moves in conjunction with those features. I can extract the sign of stock price change and add an extra dimension to the plots.

现在,这些图看起来好多了,但是几乎没有得出任何有用的结论。 很高兴看到哪些点向下移动,哪些点向上移动以及这些功能。 我可以提取股票价格变化的迹象,并为图添加额外的维度。

features_outcomes_rmv_outliers['sign_of_close'] = features_outcomes_rmv_outliers['close_1'].apply(np.sign)

Now, let’s re-plot the pairplot() again with a bit of tweak to make the graph pretty.

现在,让我们通过一些调整再次重新绘制pairplot() ,以使图表更漂亮。

sns.pairplot(features_outcomes_rmv_outliers, 
vars=selected_features_1D_list,
diag_kind='kde',
palette='husl', hue='sign_of_close',
markers = ['*', '<', '+'],
plot_kws={'alpha':0.3});#transparence:0.3
Image for post

Now, it looks much better. Clearly, when the prices go up, they (the blue spot) are denser and aggregate at a certain location. Whereas on down days, they spread everywhere.

现在,它看起来好多了。 显然,当价格上涨时,它们(蓝色点)在某个位置更密集且聚集在一起。 而在低迷时期,它们无处不在。

I would really appreciate it if you could shed some light on the pair plot and leave your comments below, thanks.

如果您能对结对图有所了解并在下面留下您的评论,我将非常感激。

Here is the summary of all the codes used in this article:

这是本文中使用的所有代码的摘要:

#import all the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import yfinance as yf #the stock data from Yahoo Financeimport matplotlib.pyplot as plt #set the parameters for plotting
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300#define a function to get data
def get_data(symbols, begin_date=None,end_date=None):
df = yf.download('AAPL', start = '2000-01-01',
auto_adjust=True,#only download adjusted data
end= '2010-12-31')
#my convention: always lowercase
df.columns = ['open','high','low',
'close','volume']

return dfprices = get_data('AAPL', '2000-01-01', '2010-12-31')#create some features
def create_HLCV(i):#as we don't care open that much, that leaves volume,
#high,low and closedf = pd.DataFrame(index=prices.index)
df[f'high_{i}D'] = prices.high.rolling(i).max()
df[f'low_{i}D'] = prices.low.rolling(i).min()
df[f'close_{i}D'] = prices.close.rolling(i).\
apply(lambda x:x[-1])
# close_2D = close as rolling backwards means today is
# literly the last day of the rolling window.
df[f'volume_{i}D'] = prices.volume.rolling(i).sum()

return dfdef create_features_v1(i):
df = create_HLCV(i)
high = df[f'high_{i}D']
low = df[f'low_{i}D']
close = df[f'close_{i}D']
volume = df[f'volume_{i}D']

features = pd.DataFrame(index=prices.index)
features[f'volume_{i}D'] = volume
features[f'price_spread_{i}D'] = high - low
features[f'close_loc_{i}D'] = (high - close) / (high - low)
features[f'close_change_{i}D'] = close.pct_change()

return featuresdef create_bunch_of_features_v1():
'''
the timespan that i would like to explore
are 1, 2, 3 days and 1 week, 1 month, 2 month, 3 month
which roughly are [1,2,3,5,20,40,60]
'''
days = [1,2,3,5,20,40,60]
bunch_of_features = pd.DataFrame(index=prices.index)
for day in days:
f = create_features_v1(day)
bunch_of_features = bunch_of_features.join(f)

return bunch_of_featuresbunch_of_features_v1 = create_bunch_of_features_v1()#define the outcome target
#here, to make thing easy to understand, i will only try to predict #the next days's return
outcomes = pd.DataFrame(index=prices.index)# next day's returns
outcomes['close_1'] = prices.close.pct_change(-1)#decide which features are abundant from cluster map
deselected_features_v1 = ['close_loc_3D','close_loc_60D',
'volume_3D', 'volume_60D',
'price_spread_3D','price_spread_60D',
'close_change_3D','close_change_60D']
selected_features_v1 = bunch_of_features_v1.drop(labels=deselected_features_v1, axis=1)#join the features and outcome together to remove the outliers
features_outcomes = selected_features_v1.join(outcomes)
stats = features_outcomes.describe()#define the method to identify outliers
def get_outliers(df, i=4):
#i is number of sigma, which define the boundary along mean
outliers = pd.DataFrame()

for col in df.columns:
mu = stats.loc['mean', col]
sigma = stats.loc['std', col]
condition = (df[col] > mu + sigma * i) | (df[col] < mu - sigma * i)
outliers[f'{col}_outliers'] = df[col][condition]

return outliersoutliers = get_outliers(features_outcomes, i=1)#remove all the outliers and Nan value
features_outcomes_rmv_outliers = features_outcomes.drop(index = outliers.index).dropna()

I know this article goes too long, I am better off leaving it here. In the next article, I will do a data transformation to see if I have a way to fix the issue of distribution. Stay tuned!

我知道这篇文章太长了,最好把它留在这里。 在下一篇文章中,我将进行数据转换,以查看是否有办法解决分发问题。 敬请关注!

翻译自: https://towardsdatascience.com/feature-engineering-feature-selection-8c1d57af18d2

特征工程之特征选择

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391655.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

搭建Harbor企业级docker仓库

https://www.cnblogs.com/pangguoping/p/7650014.html 转载于:https://www.cnblogs.com/gcgc/p/11377461.html

leetcode 131. 分割回文串(dp+回溯)

给你一个字符串 s&#xff0c;请你将 s 分割成一些子串&#xff0c;使每个子串都是 回文串 。返回 s 所有可能的分割方案。 回文串 是正着读和反着读都一样的字符串。 示例 1&#xff1a; 输入&#xff1a;s “aab” 输出&#xff1a;[[“a”,“a”,“b”],[“aa”,“b”]]…

[翻译练习] 对视图控制器压入导航栈进行测试

译自&#xff1a;swiftandpainless.com/testing-pus… 上个月我写的关于使用 Swift 进行测试驱动开发的书终于出版了&#xff0c;我会在本文和接下来的一些博文中介绍这本书撰写过程中的一些心得和体会。 在本文中&#xff0c;我将会展示一种很好的用来测试一个视图控制器是否因…

python多人游戏服务器_Python在线多人游戏开发教程

python多人游戏服务器This Python online game tutorial from Tech with Tim will show you how to code a scaleable multiplayer game with python using sockets/networking and pygame. You will learn how to deploy your game so that people anywhere around the world …

版本号控制-GitHub

前面几篇文章。我们介绍了Git的基本使用方法及Gitserver的搭建。本篇文章来学习一下怎样使用GitHub。GitHub是开源的代码库以及版本号控制库&#xff0c;是眼下使用网络上使用最为广泛的服务&#xff0c;GitHub能够托管各种Git库。首先我们须要注冊一个GitHub账号&#xff0c;打…

leetcode132. 分割回文串 II(dp)

给你一个字符串 s&#xff0c;请你将 s 分割成一些子串&#xff0c;使每个子串都是回文。 返回符合要求的 最少分割次数 。 示例 1&#xff1a; 输入&#xff1a;s “aab” 输出&#xff1a;1 解释&#xff1a;只需一次分割就可将 s 分割成 [“aa”,“b”] 这样两个回文子串…

数据标准化和离散化

在某些比较和评价的指标处理中经常需要去除数据的单位限制&#xff0c;将其转化为无量纲的纯数值&#xff0c;便于不同单位或量级的指标能够进行比较和加权。因此需要通过一定的方法进行数据标准化&#xff0c;将数据按比例缩放&#xff0c;使之落入一个小的特定区间。 一、标准…

熊猫tv新功能介绍_熊猫简单介绍

熊猫tv新功能介绍Out of all technologies that is introduced in Data Analysis, Pandas is one of the most popular and widely used library.在Data Analysis引入的所有技术中&#xff0c;P andas是最受欢迎和使用最广泛的库之一。 So what are we going to cover :那么我…

关于sublime-text-2的Package Control组件安装方法,自动和手动

之前在自己的文章《Linux下安装以及破解sublim-text-2编辑器》的文章中提到过关于sublime-text-2的Package Control组件安装方法。 当时使用的是粘贴代码&#xff1a; 1import urllib2,os;pfPackage Control.sublime-package;ippsublime.installed_packages_path();os.makedirs…

上海区块链会议演讲ppt_进行第一次会议演讲的完整指南

上海区块链会议演讲pptConferences can be stressful even if you are not giving a talk. On the other hand, speaking can really boost your career, help you network, allow you to travel for (almost) free, and give back to others at the same time.即使您不讲话…

windows下Call to undefined function curl_init() error问题

本地项目中使用到curl_init()时出现Call to undefined function curl_init()的错误&#xff0c;去掉php.ini中的extensionphp_curl.dll前的分号还是不行&#xff0c;phpinfo()中无curl模块&#xff0c;于是上网搜索并实践了如下方法&#xff0c;成功&#xff1a; 在使用php5的c…

数据转换软件_数据转换

数据转换软件&#x1f4c8;Python金融系列 (&#x1f4c8;Python for finance series) Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.警告 &#xff1a;这里没有神奇的配方或圣杯&#xff0c;尽管新世界可能为您…

leetcode 1047. 删除字符串中的所有相邻重复项(栈)

给出由小写字母组成的字符串 S&#xff0c;重复项删除操作会选择两个相邻且相同的字母&#xff0c;并删除它们。 在 S 上反复执行重复项删除操作&#xff0c;直到无法继续删除。 在完成所有重复项删除操作后返回最终的字符串。答案保证唯一。 示例&#xff1a; 输入&#x…

spring boot: spring Aware的目的是为了让Bean获得Spring容器的服务

Spring Aware的目的是为了让Bean获得Spring容器的服务 //获取容器中的bean名称import org.springframework.beans.factory.BeanNameAware;//获得资源加载器&#xff0c;可以获得额外的资源import org.springframework.context.ResourceLoaderAware; package ch2.aware; import …

10张图带你深入理解Docker容器和镜像

【编者的话】本文用图文并茂的方式介绍了容器、镜像的区别和Docker每个命令后面的技术细节&#xff0c;能够很好的帮助读者深入理解Docker。这篇文章希望能够帮助读者深入理解Docker的命令&#xff0c;还有容器&#xff08;container&#xff09;和镜像&#xff08;image&#…

matlab界area_Matlab的数据科学界

matlab界area意见 (Opinion) My personal interest in Data Science spans back to 2011. I was learning more about Economies and wanted to experiment with some of the ‘classic’ theories and whilst many of them held ground, at a micro level, many were also pur…

javascript异步_JavaScript异步并在循环中等待

javascript异步Basic async and await is simple. Things get a bit more complicated when you try to use await in loops.基本的async和await很简单。 当您尝试在循环中使用await时&#xff0c;事情会变得更加复杂。 In this article, I want to share some gotchas to wat…

白盒测试目录导航

白盒测试目录导航&#xff08;更新中&#xff09; 2017-12-29 [1] 白盒测试&#xff1a;为什么要做白盒测试 [2] 白盒测试&#xff1a;理论基础 [3] 白盒测试实战&#xff08;上&#xff09; [4] 白盒测试实战&#xff08;中&#xff09; [5] 白盒测试实战&#xff08;下&#…

hdf5文件和csv的区别_使用HDF5文件并创建CSV文件

hdf5文件和csv的区别In my last article, I discussed the steps to download NASA data from GES DISC. The data files downloaded are in the HDF5 format. HDF5 is a file format, a technology, that enables the management of very large data collections. Thus, it is…

CSS仿艺龙首页鼠标移入图片放大

CSS仿艺龙首页鼠标移入图片放大&#xff0c;效果似乎没有js好。。。。。。 <!DOCTYPE html> <html lang"en"> <head><meta charset"UTF-8"><title>图片放大</title><style>*{padding:0;margin:0;}body{padding-…