📈Python金融系列 (📈Python for finance series)
Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.
警告 : 这里没有神奇的配方或圣杯,尽管新世界可能为您打开了大门。
📈Python金融系列 (📈Python for finance series)
Identifying Outliers
Identifying Outliers — Part Two
Identifying Outliers — Part Three
Stylized Facts
Feature Engineering & Feature Selection
Data Transformation
Pandas has quite a few handy methods to clean up messy data, like dropna,drop_duplicates, etc.. However, finding and removing outliers is one of those functions that we would like to have and still not exist yet. Here I would like to share with you how to do it step by step in details:
Pandas有很多方便的方法可以清理混乱的数据,例如dropna , drop_duplicates等。但是,查找和删除异常值是我们希望拥有的但仍然不存在的功能之一。 在这里,我想与您分享如何逐步进行详细操作:
The key to defining an outlier lays at the boundary we employed. Here I will give 3 different ways to define the boundary, namely, the Average mean, the Moving Average mean and the Exponential Weighted Moving Average mean.
定义离群值的关键在于我们采用的边界。 在这里,我将给出3种不同的方法来定义边界,即平均均值,移动平均数和指数加权移动平均数。
1.数据准备 (1. Data preparation)
Here I used Apple’s 10-year stock history price and returns from Yahoo Finance as an example, of course, you can use any data.
在这里,我以苹果公司10年的股票历史价格和Yahoo Finance的收益为例,当然,您可以使用任何数据。
import pandas as pd
import yfinance as yfimport matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 300df = yf.download('AAPL',
start = '2000-01-01',
end= '2010-12-31')
As we only care about the returns, a new DataFrame (d1)
is created to hold the adjusted price and returns.
由于我们只关心收益, DataFrame (d1)
会创建一个新的DataFrame (d1)
d1 = pd.DataFrame(df['Adj Close'])
d1.rename(columns={'Adj Close':'adj_close'}, inplace=True)
2.以均值和标准差为边界。 (2. Using mean and standard deviation as the boundary.)
Calculate the mean and std of the simple_rtn:
d1_mean = d1['simple_rtn'].agg(['mean', 'std'])
If we use mean and one std as the boundary, the results will look like these:
fig, ax = plt.subplots(figsize=(10,6))
d1['simple_rtn'].plot(label='simple_rtn', legend=True, ax = ax)
plt.axhline(y=d1_mean.loc['mean'], c='r', label='mean')
plt.axhline(y=d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.axhline(y=-d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.legend(loc='lower right')
What happens if I use 3 times std instead?
Looks good! Now is the time to look for those outliers:
看起来挺好的! 现在是时候寻找那些离群值了:
mu = d1_mean.loc['mean']
sigma = d1_mean.loc['std']def get_outliers(df, mu=mu, sigma=sigma, n_sigmas=3):
df: the DataFrame
mu: mean
sigmas: std
n_sigmas: number of std as boundary
x = df['simple_rtn']
mu = mu
sigma = sigma
if (x > mu+n_sigmas*sigma) | (x<mu-n_sigmas*sigma):
return 1
return 0
After applied the rule get_outliers
to the stock price return, a new column is created:
d1['outlier'] = d1.apply(get_outliers, axis=1)
✍提示! (✍Tip!)
#The above code snippet can be refracted as follow:cond = (d1['simple_rtn'] > mu + sigma * 2) | (d1['simple_rtn'] < mu - sigma * 2)
d1['outliers'] = np.where(cond, 1, 0)
Let’s have a look at the outliers. We can check how many outliers we found by doing a value count.
让我们看看异常值。 我们可以通过计数来检查发现了多少离群值。
We found 30 outliers if we set 3 times std as the boundary. We can pick those outliers out and put it into another DataFrame
and show it in the graph:
如果我们将std设置为3倍,则发现30个离群值。 我们可以挑选出这些离群值,并将其放入另一个DataFrame
outliers = d1.loc[d1['outlier'] == 1, ['simple_rtn']]fig, ax = plt.subplots()ax.plot(d1.index, d1.simple_rtn,
color='blue', label='Normal')
ax.scatter(outliers.index, outliers.simple_rtn,
color='red', label='Anomaly')
ax.set_title("Apple's stock returns")
ax.legend(loc='lower right')plt.tight_layout()
In the above plot, we can observe outliers marked with a red dot. In the next post, I will show you how to use Moving Average Mean and Standard deviation as the boundary.
在上图中,我们可以观察到标有红点的离群值。 在下一篇文章中,我将向您展示如何使用移动平均均值和标准差作为边界。
Happy learning, happy coding!
翻译自: https://medium.com/python-in-plain-english/identifying-outliers-part-one-c0a31d9faefa