缺失值和异常值的识别与处理_识别异常值-第一部分

缺失值和异常值的识别与处理

📈Python金融系列 (📈Python for finance series)

Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.

警告这里没有神奇的配方或圣杯,尽管新世界可能为您打开了大门。

📈Python金融系列 (📈Python for finance series)

  1. Identifying Outliers

    识别异常值

  2. Identifying Outliers — Part Two

    识别异常值-第二部分

  3. Identifying Outliers — Part Three

    识别异常值-第三部分

  4. Stylized Facts

    程式化的事实

  5. Feature Engineering & Feature Selection

    特征工程与特征选择

  6. Data Transformation

    数据转换

Pandas has quite a few handy methods to clean up messy data, like dropna,drop_duplicates, etc.. However, finding and removing outliers is one of those functions that we would like to have and still not exist yet. Here I would like to share with you how to do it step by step in details:

Pandas有很多方便的方法可以清理混乱的数据,例如dropnadrop_duplicates等。但是,查找和删除异常值是我们希望拥有的但仍然不存在的功能之一。 在这里,我想与您分享如何逐步进行详细操作:

The key to defining an outlier lays at the boundary we employed. Here I will give 3 different ways to define the boundary, namely, the Average mean, the Moving Average mean and the Exponential Weighted Moving Average mean.

定义离群值的关键在于我们采用的边界。 在这里,我将给出3种不同的方法来定义边界,即平均均值,移动平均数和指数加权移动平均数。

1.数据准备 (1. Data preparation)

Here I used Apple’s 10-year stock history price and returns from Yahoo Finance as an example, of course, you can use any data.

在这里,我以苹果公司10年的股票历史价格和Yahoo Finance的收益为例,当然,您可以使用任何数据。

import pandas as pd 
import yfinance as yfimport matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300df = yf.download('AAPL',
start = '2000-01-01',
end= '2010-12-31')

As we only care about the returns, a new DataFrame (d1) is created to hold the adjusted price and returns.

由于我们只关心收益, DataFrame (d1)会创建一个新的DataFrame (d1)来容纳调整后的价格和收益。

d1 = pd.DataFrame(df['Adj Close'])
d1.rename(columns={'Adj Close':'adj_close'}, inplace=True)
d1['simple_rtn']=d1.adj_close.pct_change()
d1.head()
Image for post

2.以均值和标准差为边界。 (2. Using mean and standard deviation as the boundary.)

Calculate the mean and std of the simple_rtn:

计算simple_rtn的均值和std:

d1_mean = d1['simple_rtn'].agg(['mean', 'std'])

If we use mean and one std as the boundary, the results will look like these:

如果我们使用均值和一个std作为边界,结果将如下所示:

fig, ax = plt.subplots(figsize=(10,6))
d1['simple_rtn'].plot(label='simple_rtn', legend=True, ax = ax)
plt.axhline(y=d1_mean.loc['mean'], c='r', label='mean')
plt.axhline(y=d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.axhline(y=-d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.legend(loc='lower right')
Image for post

What happens if I use 3 times std instead?

如果我使用3次std会怎样?

Image for post

Looks good! Now is the time to look for those outliers:

看起来挺好的! 现在是时候寻找那些离群值了:

mu = d1_mean.loc['mean']
sigma = d1_mean.loc['std']def get_outliers(df, mu=mu, sigma=sigma, n_sigmas=3):
'''
df: the DataFrame
mu: mean
sigmas: std
n_sigmas: number of std as boundary
'''
x = df['simple_rtn']
mu = mu
sigma = sigma

if (x > mu+n_sigmas*sigma) | (x<mu-n_sigmas*sigma):
return 1
else:
return 0

After applied the rule get_outliers to the stock price return, a new column is created:

将规则get_outliers应用于股票价格收益后,将创建一个新列:

d1['outlier'] = d1.apply(get_outliers, axis=1)
d1.head()
Image for post

✍提示! (✍Tip!)

#The above code snippet can be refracted as follow:cond = (d1['simple_rtn'] > mu + sigma * 2) | (d1['simple_rtn'] < mu - sigma * 2)
d1['outliers'] = np.where(cond, 1, 0)

Let’s have a look at the outliers. We can check how many outliers we found by doing a value count.

让我们看看异常值。 我们可以通过计数来检查发现了多少离群值。

d1.outlier.value_counts()
Image for post

We found 30 outliers if we set 3 times std as the boundary. We can pick those outliers out and put it into another DataFrame and show it in the graph:

如果我们将std设置为3倍,则发现30个离群值。 我们可以挑选出这些离群值,并将其放入另一个DataFrame ,并在图中显示出来:

outliers = d1.loc[d1['outlier'] == 1, ['simple_rtn']]fig, ax = plt.subplots()ax.plot(d1.index, d1.simple_rtn, 
color='blue', label='Normal')
ax.scatter(outliers.index, outliers.simple_rtn,
color='red', label='Anomaly')
ax.set_title("Apple's stock returns")
ax.legend(loc='lower right')plt.tight_layout()
plt.show()
Image for post

In the above plot, we can observe outliers marked with a red dot. In the next post, I will show you how to use Moving Average Mean and Standard deviation as the boundary.

在上图中,我们可以观察到标有红点的离群值。 在下一篇文章中,我将向您展示如何使用移动平均均值和标准差作为边界。

Happy learning, happy coding!

学习愉快,编码愉快!

翻译自: https://medium.com/python-in-plain-english/identifying-outliers-part-one-c0a31d9faefa

缺失值和异常值的识别与处理

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390827.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

leetcode 664. 奇怪的打印机(dp)

题目 有台奇怪的打印机有以下两个特殊要求&#xff1a; 打印机每次只能打印由 同一个字符 组成的序列。 每次可以在任意起始和结束位置打印新字符&#xff0c;并且会覆盖掉原来已有的字符。 给你一个字符串 s &#xff0c;你的任务是计算这个打印机打印它需要的最少打印次数。…

PHP7.2 redis

为什么80%的码农都做不了架构师&#xff1f;>>> PHP7.2 的redis安装方法&#xff1a; 顺便说一下PHP7.2的安装&#xff1a; wget http://cn2.php.net/distributions/php-7.2.4.tar.gz tar -zxvf php-7.2.4.tar.gz cd php-7.2.4./configure --prefix/usr/local/php…

梯度 cv2.sobel_TensorFlow 2.0中连续策略梯度的最小工作示例

梯度 cv2.sobelAt the root of all the sophisticated actor-critic algorithms that are designed and applied these days is the vanilla policy gradient algorithm, which essentially is an actor-only algorithm. Nowadays, the actor that learns the decision-making …

垃圾回收算法优缺点对比

image.pngGC之前 说明&#xff1a;该文中的GC算法讲解不仅仅局限于某种具体开发语言。 mutator mutator 是 Edsger Dijkstra 、 琢磨出来的词&#xff0c;有“改变某物”的意思。说到要改变什么&#xff0c;那就是 GC 对象间的引用关系。不过光这么说可能大家还是不能理解&…

yolo人脸检测数据集_自定义数据集上的Yolo-V5对象检测

yolo人脸检测数据集计算机视觉 (Computer Vision) Step by step instructions to train Yolo-v5 & do Inference(from ultralytics) to count the blood cells and localize them.循序渐进的说明来训练Yolo-v5和进行推理(来自Ultralytics )以对血细胞进行计数并将其定位。 …

图深度学习-第2部分

有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING) These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of cou…

Linux下 安装Redis并配置服务

一、简介 1、 Redis为单进程单线程模式&#xff0c;采用队列模式将并发访问变成串行访问。 2、 Redis不仅仅支持简单的k/v类型的数据&#xff0c;同时还提供list&#xff0c;set&#xff0c;zset&#xff0c;hash等数据结构的存储。 3、 Redis支持数据的备份&#xff0c;即mas…

leetcode 477. 汉明距离总和(位运算)

theme: healer-readable 题目 两个整数的 汉明距离 指的是这两个数字的二进制数对应位不同的数量。 计算一个数组中&#xff0c;任意两个数之间汉明距离的总和。 示例: 输入: 4, 14, 2 输出: 6 解释: 在二进制表示中&#xff0c;4表示为0100&#xff0c;14表示为1110&…

量子信息与量子计算_量子计算为23美分。

量子信息与量子计算On Aug 13, 2020, AWS announced the General Availability of Amazon Braket. Braket is their fully managed quantum computing service. It is available on an on-demand basis, much like SageMaker. That means the everyday developer and data scie…

全面理解Java内存模型

Java内存模型即Java Memory Model&#xff0c;简称JMM。JMM定义了Java 虚拟机(JVM)在计算机内存(RAM)中的工作方式。JVM是整个计算机虚拟模型&#xff0c;所以JMM是隶属于JVM的。 如果我们要想深入了解Java并发编程&#xff0c;就要先理解好Java内存模型。Java内存模型定义了多…

leetcode 1074. 元素和为目标值的子矩阵数量(map+前缀和)

给出矩阵 matrix 和目标值 target&#xff0c;返回元素总和等于目标值的非空子矩阵的数量。 子矩阵 x1, y1, x2, y2 是满足 x1 < x < x2 且 y1 < y < y2 的所有单元 matrix[x][y] 的集合。 如果 (x1, y1, x2, y2) 和 (x1’, y1’, x2’, y2’) 两个子矩阵中部分坐…

失物招领php_新奥尔良圣徒队是否增加了失物招领?

失物招领phpOver the past couple of years, the New Orleans Saints’ offense has been criticized for its lack of wide receiver options. Luckily for Saints’ fans like me, this area has been addressed by the signing of Emmanuel Sanders back in March — or has…

leetcode 5756. 两个数组最小的异或值之和(状态压缩dp)

题目 给你两个整数数组 nums1 和 nums2 &#xff0c;它们长度都为 n 。 两个数组的 异或值之和 为 (nums1[0] XOR nums2[0]) (nums1[1] XOR nums2[1]) … (nums1[n - 1] XOR nums2[n - 1]) &#xff08;下标从 0 开始&#xff09;。 比方说&#xff0c;[1,2,3] 和 [3,2,1…

客户细分模型_Avarto金融解决方案的客户细分和监督学习模型

客户细分模型Lets assume that you are a CEO of a company which have some X amount of customers in a city with 1000 *X population. Analyzing the trends/features of your customer and segmenting the population of the city to land new potential customers would …

leetcode 231. 2 的幂

给你一个整数 n&#xff0c;请你判断该整数是否是 2 的幂次方。如果是&#xff0c;返回 true &#xff1b;否则&#xff0c;返回 false 。 如果存在一个整数 x 使得 n 2x &#xff0c;则认为 n 是 2 的幂次方。 示例 1&#xff1a; 输入&#xff1a;n 1 输出&#xff1a;tr…

leetcode 342. 4的幂

给定一个整数&#xff0c;写一个函数来判断它是否是 4 的幂次方。如果是&#xff0c;返回 true &#xff1b;否则&#xff0c;返回 false 。 整数 n 是 4 的幂次方需满足&#xff1a;存在整数 x 使得 n 4x 示例 1&#xff1a; 输入&#xff1a;n 16 输出&#xff1a;true …

梯度反传_反事实政策梯度解释

梯度反传Among many of its challenges, multi-agent reinforcement learning has one obstacle that is overlooked: “credit assignment.” To explain this concept, let’s first take a look at an example…在许多挑战中&#xff0c;多主体强化学习有一个被忽略的障碍&a…

大数据与Hadoop

大数据的定义 大数据是指无法在一定时间内用常规软件工具对其内容进行抓取、管理和处理的数据集合。 大数据的概念–4VXV 1,数据量大&#xff08;Volume&#xff09;2,类型繁多&#xff08;Variety &#xff09;3,速度快时效高&#xff08;Velocity&#xff09;4,价值密度低…

facebook.com_如何降低电子商务的Facebook CPM

facebook.comWith the 2020 election looming, Facebook advertisers and e-commerce stores are going to continually see their ad costs go up as the date gets closer (if they haven’t already).随着2020年选举的临近&#xff0c;随着日期越来越近&#xff0c;Facebook…

Hadoop安装及配置

Hadoop的三种运行模式 单机模式&#xff08;Standalone,独立或本地模式&#xff09;:安装简单,运行时只启动单个进程,仅调试用途&#xff1b;伪分布模式&#xff08;Pseudo-Distributed&#xff09;:在单节点上同时启动namenode、datanode、secondarynamenode、resourcemanage…