缺失值和异常值的识别与处理_识别异常值-第一部分

缺失值和异常值的识别与处理

📈Python金融系列 (📈Python for finance series)

Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.

警告： 这里没有神奇的配方或圣杯，尽管新世界可能为您打开了大门。

📈Python金融系列 (📈Python for finance series)

Identifying Outliers
识别异常值
Identifying Outliers — Part Two
识别异常值-第二部分
Identifying Outliers — Part Three
识别异常值-第三部分
Stylized Facts
程式化的事实
Feature Engineering & Feature Selection
特征工程与特征选择
Data Transformation
数据转换

Pandas has quite a few handy methods to clean up messy data, like dropna,drop_duplicates, etc.. However, finding and removing outliers is one of those functions that we would like to have and still not exist yet. Here I would like to share with you how to do it step by step in details:

Pandas有很多方便的方法可以清理混乱的数据，例如dropna ， drop_duplicates等。但是，查找和删除异常值是我们希望拥有的但仍然不存在的功能之一。在这里，我想与您分享如何逐步进行详细操作：

The key to defining an outlier lays at the boundary we employed. Here I will give 3 different ways to define the boundary, namely, the Average mean, the Moving Average mean and the Exponential Weighted Moving Average mean.

定义离群值的关键在于我们采用的边界。在这里，我将给出3种不同的方法来定义边界，即平均均值，移动平均数和指数加权移动平均数。

1.数据准备 (1. Data preparation)

Here I used Apple’s 10-year stock history price and returns from Yahoo Finance as an example, of course, you can use any data.

在这里，我以苹果公司10年的股票历史价格和Yahoo Finance的收益为例，当然，您可以使用任何数据。

import pandas as pd 
import yfinance as yfimport matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300df = yf.download('AAPL',
                 start = '2000-01-01',
                 end= '2010-12-31')

As we only care about the returns, a new DataFrame (d1) is created to hold the adjusted price and returns.

由于我们只关心收益， DataFrame (d1)会创建一个新的DataFrame (d1)来容纳调整后的价格和收益。

d1 = pd.DataFrame(df['Adj Close'])
d1.rename(columns={'Adj Close':'adj_close'}, inplace=True)
d1['simple_rtn']=d1.adj_close.pct_change()
d1.head()

2.以均值和标准差为边界。 (2. Using mean and standard deviation as the boundary.)

Calculate the mean and std of the simple_rtn:

计算simple_rtn的均值和std：

d1_mean = d1['simple_rtn'].agg(['mean', 'std'])

If we use mean and one std as the boundary, the results will look like these:

如果我们使用均值和一个std作为边界，结果将如下所示：

fig, ax = plt.subplots(figsize=(10,6))
d1['simple_rtn'].plot(label='simple_rtn', legend=True, ax = ax)
plt.axhline(y=d1_mean.loc['mean'], c='r', label='mean')
plt.axhline(y=d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.axhline(y=-d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.legend(loc='lower right')

What happens if I use 3 times std instead?

如果我使用3次std会怎样？

Looks good! Now is the time to look for those outliers:

看起来挺好的！现在是时候寻找那些离群值了：

mu = d1_mean.loc['mean']
sigma = d1_mean.loc['std']def get_outliers(df, mu=mu, sigma=sigma, n_sigmas=3):
    '''
    df: the DataFrame
    mu: mean
    sigmas: std
    n_sigmas: number of std as boundary
    '''
    x = df['simple_rtn']
    mu = mu
    sigma = sigma
    
    if (x > mu+n_sigmas*sigma) | (x<mu-n_sigmas*sigma):
        return 1
    else:
        return 0

After applied the rule get_outliers to the stock price return, a new column is created:

将规则get_outliers应用于股票价格收益后，将创建一个新列：

d1['outlier'] = d1.apply(get_outliers, axis=1)
d1.head()

✍提示！ (✍Tip!)

#The above code snippet can be refracted as follow:cond = (d1['simple_rtn'] > mu + sigma * 2) | (d1['simple_rtn'] < mu - sigma * 2)
d1['outliers'] = np.where(cond, 1, 0)

Let’s have a look at the outliers. We can check how many outliers we found by doing a value count.

让我们看看异常值。我们可以通过计数来检查发现了多少离群值。

d1.outlier.value_counts()

We found 30 outliers if we set 3 times std as the boundary. We can pick those outliers out and put it into another DataFrame and show it in the graph:

如果我们将std设置为3倍，则发现30个离群值。我们可以挑选出这些离群值，并将其放入另一个DataFrame ，并在图中显示出来：

outliers = d1.loc[d1['outlier'] == 1, ['simple_rtn']]fig, ax = plt.subplots()ax.plot(d1.index, d1.simple_rtn, 
        color='blue', label='Normal')
ax.scatter(outliers.index, outliers.simple_rtn, 
           color='red', label='Anomaly')
ax.set_title("Apple's stock returns")
ax.legend(loc='lower right')plt.tight_layout()
plt.show()

In the above plot, we can observe outliers marked with a red dot. In the next post, I will show you how to use Moving Average Mean and Standard deviation as the boundary.

在上图中，我们可以观察到标有红点的离群值。在下一篇文章中，我将向您展示如何使用移动平均均值和标准差作为边界。

Happy learning, happy coding!

学习愉快，编码愉快！

翻译自: https://medium.com/python-in-plain-english/identifying-outliers-part-one-c0a31d9faefa

缺失值和异常值的识别与处理

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/390827.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

SQL Server 常用分页SQL

今天无聊和朋友讨论分页，发现网上好多都是错的。网上经常查到的那个Top Not in 或者Max 大部分都不实用，很多都忽略了Order和性能问题。为此上网查了查，顺带把2000和2012版本的也补上了。先说说网上常见SQL的错误或者说局限问题 12345select…

Word中摘要和正文同时分栏后，正文跑到下一页，怎么办？或Word分栏后第一页明明有空位后面的文字却自动跳到第二页了，怎么办？...

问题1：Word中摘要和正文同时分栏后，正文跑到下一页，怎么办？或Word分栏后第一页明明有空位后面的文字却自动跳到第二页了，怎么办？ 答：在word2010中，菜单栏中最左侧选“文件”->“选…

leetcode 664. 奇怪的打印机(dp)

题目有台奇怪的打印机有以下两个特殊要求： 打印机每次只能打印由同一个字符组成的序列。每次可以在任意起始和结束位置打印新字符，并且会覆盖掉原来已有的字符。给你一个字符串 s ，你的任务是计算这个打印机打印它需要的最少打印次数。…

SQL数据类型说明和MySQL语法示例

SQL数据类型 (SQL Data Types) Each column in a database table is required to have a name and a data type. 数据库表中的每一列都必须具有名称和数据类型。 An SQL developer must decide what type of data that will be stored inside each column when creating a tab…

PHP7.2 redis

为什么80%的码农都做不了架构师？>>> PHP7.2 的redis安装方法： 顺便说一下PHP7.2的安装： wget http://cn2.php.net/distributions/php-7.2.4.tar.gz tar -zxvf php-7.2.4.tar.gz cd php-7.2.4./configure --prefix/usr/local/php…

leetcode 1787. 使所有区间的异或结果为零

题目给你一个整数数组 nums 和一个整数 k 。区间 [left, right]（left < right）的异或结果是对下标位于 left 和 right（包括 left 和 right ）之间所有元素进行 XOR 运算的结果：nums[left] XOR n…

【JavaScript】网站源码防止被人另存为

1、禁示查看源代码从"查看"菜单下的"源文件"中同样可以看到源代码，下面我们就来解决这个问题： 其实这只要使用一个含有<frame></frame>标记的网页便可以达到目的。 <frameset> <frame src"你要保密的文件…

梯度 cv2.sobel_TensorFlow 2.0中连续策略梯度的最小工作示例

梯度 cv2.sobelAt the root of all the sophisticated actor-critic algorithms that are designed and applied these days is the vanilla policy gradient algorithm, which essentially is an actor-only algorithm. Nowadays, the actor that learns the decision-making …