数据分析和大数据哪个更吃香_处理数据,大数据甚至更大数据的17种策略

数据分析和大数据哪个更吃香

Dealing with big data can be tricky. No one likes out of memory errors. ☹️ No one likes waiting for code to run. ⏳ No one likes leaving Python. 🐍

处理大数据可能很棘手。 没有人喜欢内存不足错误。 No️没有人喜欢等待代码运行。 ⏳没有人喜欢离开Python。 🐍

Don’t despair! In this article I’ll provide tips and introduce up and coming libraries to help you efficiently deal with big data. I’ll also point you toward solutions for code that won’t fit into memory. And all while staying in Python. 👍

别失望! 在本文中,我将提供一些技巧,并介绍和建立新的库来帮助您有效地处理大数据。 我还将向您指出不适合内存的代码的解决方案。 而所有这些都停留在Python中。 👍

Python is the most popular language for scientific and numerical computing. Pandas is the most popular for cleaning code and exploratory data analysis.

Python是用于科学和数值计算的最流行的语言。 熊猫是最受欢迎的清洁代码和探索性数据分析工具。

Using pandas with Python allows you to handle much more data than you could with Microsoft Excel or Google Sheets.

与Microsoft Excel或Google Sheets相比,在Python中使用pandas可以处理更多的数据。

pandas logo

SQL databases are very popular for storing data, but the Python ecosystem has many advantages over SQL when it comes to expressiveness, testing, reproducibility, and the ability to quickly perform data analysis, statistics, and machine learning.

SQL数据库在存储数据方面非常流行,但是在表达性,测试,可再现性以及快速执行数据分析,统计信息和机器学习的能力方面,Python生态系统具有许多优于SQL的优势。

Unfortunately, if you are working locally, the amount of data that pandas can handle is limited by the amount of memory on your machine. And if you’re working in the cloud, more memory costs more money.

不幸的是,如果您在本地工作,熊猫可以处理的数据量受到计算机内存量的限制。 而且,如果您在云中工作,那么更多的内存将花费更多的资金。

Regardless of where you code is running you want operations to happen quickly so you can GSD (Get Stuff Done)! 😀

无论您在哪里运行代码,都希望操作能够快速进行,以便您可以进行GSD(完成工作)! 😀

永远要做的事情 (Things to always do)

If you’ve ever heard or seen advice on speeding up code you’ve seen the warning. ⚠️ Don’t prematurely optimize! ⚠️

如果您曾经听说过或看到有关加速代码的建议,那么您会看到警告。 ⚠️ 不要过早优化! ⚠️

This is good advice. But it’s also smart to know techniques so you can write clean fast code the first time. 🚀

这是个好建议。 但是了解技术也很聪明,因此您可以在第一时间编写干净的快速代码。 🚀

Image for post
Getting after it! Source: pixabay.com
得到它! 资料来源:foto.com

The following are three good coding practices for any size dataset.

以下是适用于任何大小数据集的三种良好编码实践。

  1. Avoid nested loops whenever possible. Here’s a brief primer on Big-O notation and algorithm analysis. One for loop nested inside another for loop generally leads to polynomial time calculations. If you have more than a few items to search through, you’ll be waiting for a while. See a nice chart and explanation here.

    尽可能避免嵌套循环。 这是有关Big-O表示法和算法分析的简要介绍。 一个for循环嵌套在另一个for循环中通常会导致多项式时间计算。 如果您要搜索的项目不止几个,则需要等待一段时间。 在这里看到一个不错的图表和说明。

  2. Use list comprehensions (and dict comprehensions) whenever possible in Python. Creating a list on demand is faster than load the append attribute of the list and repeatedly calling it as a function — hat tip to the Stack Overflow answer here. However, in general, don’t sacrifice clarity for speed, so be careful with nesting list comprehensions. ⚠️

    尽可能在Python中使用列表推导(和字典推导)。 按需创建列表的速度比加载列表的append属性并作为函数重复调用的速度要快- 这里的Stack Overflow答案提示。 但是,总的来说,不要为了速度而牺牲清晰度,因此请小心嵌套列表的理解。 ⚠️

  3. In pandas, use built-in vectorized functions. The principle is really the same as the reason for dict comprehensions. Apply a function to a whole data structure at once is much faster than repeatedly calling a function.

    在熊猫中,使用内置的矢量化功能。 原理实际上与字典理解的原因相同。 一次将函数应用于整个数据结构比重复调用函数要快得多。

If you find yourself reaching for apply, think about whether you really need to. It's looping over rows or columns. Vectorized methods are usually faster and less code, so they are a win win. 🚀

如果您发现自己想要apply ,请考虑是否确实需要。 它遍历行或列。 向量化方法通常更快,代码更少,因此是双赢的。 🚀

Avoid the other pandas Series and DataFrame methods that loop over your data — applymap, itterrows, ittertuples. Use the replace method on a DataFrame instead of any of those other options to save lots of time.

避免其他遍历数据的pandas Series和DataFrame方法applymapitterrowsittertuples 。 在DataFrame上使用replace方法,而不是其他任何选项,可以节省大量时间。

Notice that these suggestions might not hold for very small amounts of data, but in that cases, the stakes are low, so who cares. 😉

请注意,这些建议可能只适用于非常少量的数据,但在那种情况下,风险很低,所以谁在乎。 😉

这将我们带到最重要的规则 (This brings us to our most important rule)

如果可以的话,留在熊猫里。 🐼 (If you can, stay in pandas. 🐼)

It’s a happy place. 😀

这是一个快乐的地方。 😀

Don’t worry about these issues if you aren’t having problems and you don’t expect your data to balloon. But at some point, you’ll encounter a big dataset and then you’ll want to know what to do. Let’s see some tips.

如果您没有遇到问题并且不希望数据激增,请不要担心这些问题。 但是到某个时候,您将遇到一个庞大的数据集,然后您想知道该怎么做。 让我们看看一些技巧。

与相当大的数据(大约数百万行)有关的事情 (Things to do with pretty big data (roughly millions of rows))

desert sun
Like millions of grains of sand. Source: pixabay.com
就像数百万的沙粒一样。 资料来源:foto.com
  1. Use a subset of your data to explore, clean, make a baseline model if you’re doing machine learning. Solve 90% of your problems fast and save time and resources. This technique can save you so much time!

    如果您要进行机器学习,请使用数据的子集来探索,清理和建立基线模型。 快速解决90%的问题,并节省时间和资源。 这种技术可以节省您很多时间!
  2. Load only the columns that you need with the usecols argument when reading in your DataFrame. Less data in = win!

    在读取usecols时,仅使用usecols参数加载所需的列。 更少的数据=赢!

  3. Use dtypes efficiently. Downcast numeric columns to the smallest dtypes that makes sense with pandas.to_numeric(). Convert columns with low cardinality (just a few values) to a categorical dtype. Here’s a pandas guide on efficient dtypes.

    有效地使用dtype。 将pandas.to_numeric()有意义的数字列转换为最小的dtypes。 将具有低基数(仅几个值)的列转换为分类dtype。 这是有关有效dtypes的熊猫指南。

  4. Parallelize model training in scikit-learn to use more processing cores whenever possible. By default, scikit-learn uses just one of your machine’s cores. Many computers have 4 or more cores. You can use them all for parallelizable tasks by passing the argument n_jobs=-1 when doing cross validation with GridSearchCV and many other classes.

    在scikit-learn中并行进行模型训练,以尽可能使用更多处理核心。 默认情况下,scikit-learn仅使用计算机的核心之一。 许多计算机具有4个或更多核心。 在使用GridSearchCV和许多其他类进行交叉验证时,可以通过传递参数n_jobs=-1来将它们全部用于可并行化的任务。

  5. Save pandas DataFrames in feather or pickle formats for faster reading and writing. Hat tip to Martin Skarzynski, who links to evidence and code here.

    将熊猫DataFrame保存为羽毛或泡菜格式,以实现更快的读写速度。 向Martin Skarzynski致谢,他在此处链接了证据和代码。

  6. Use pd.eval to speed up pandas operations. Pass the function your usual code in a string. It does the operation much faster. Here’s a chart from tests with a 100 column DataFrame.

    使用pd.eval可以加快熊猫操作。 将函数的常规代码传递给字符串。 它可以更快地完成操作。 这是带有100列DataFrame的测试的图表。

Image for post
Image from this good article on the topic by Tirthajyoti Sarkar
Tirthajyoti Sarkar 撰写的有关该主题的出色文章的图片

df.query is basically same as pd.eval, but as a DataFrame method instead of a top-level pandas function.

df.query是基本上相同pd.eval ,但作为一个数据帧的方法,而不是顶级大熊猫功能。

See the docs because there are some gotchas. ⚠️

请参阅文档,因为有一些陷阱。 ⚠️

Pandas is using numexpr under the hood. Numexpr also works with NumPy. Hat tip to Chris Conlan in his book Fast Python for pointing me to@Numexpr. Chris’s book is an excellent read for learning how to speed up your Python code. 👍

熊猫在后台使用numexpr 。 Numexpr也可以与NumPy一起使用。 克里斯·康兰(Chris Conlan)在他的书《 快速Python》中给我的提示是@Numexpr。 克里斯的书是学习如何加快Python代码速度的绝佳阅读。 👍

事情涉及真正的大数据(大约数千万行以上) (Things do with really big data (roughly tens of millions of rows and up))

stars
Even more data! Source: pixabay.com
甚至更多的数据! 资料来源:foto.com
  1. Use numba. Numba gives you a big speed boost if you’re doing mathematical calcs. Install numba and import it. Then use the @numba.jit decorator function when you need to loop over NumPy arrays and can't use vectorized methods. It only works with only NumPy arrays. Use .to_numpy() on a pandas DataFrame to convert it to a NumPy array.

    使用numba 。 如果您要进行数学计算,Numba可以大大提高速度。 安装numba并将其导入。 然后,当您需要循环遍历NumPy数组并且不能使用矢量化方法时,请使用@numba.jit装饰器函数。 它仅适用于NumPy数组。 在熊猫DataFrame上使用.to_numpy()将其转换为NumPy数组。

  2. Use SciPy sparse matrices when it makes sense. Scikit-learn outputs sparse arrays automatically with some transformers, such as CountVectorizer. When your data is mostly 0s or missing values, you can convert columns to sparse dtypes in pandas. Read more here.

    在合理的情况下使用SciPy稀疏矩阵 。 Scikit-learn使用某些转换器(例如CountVectorizer)自动输出稀疏数组。 当数据大部分为0或缺少值时,可以将列转换为熊猫中的稀疏dtype。 在这里。

  3. Use Dask to parallelize the reading of datasets into pandas in chunks. Dask can also parallelize data operations across multiple machines. It mimics a subset of the pandas and NumPy APIs. Dask-ML is a sister package to parallelize machine learning algorithms across multiple machines. It mimics the scikit-learn API. Dask plays nicely with other popular machine learning libraries such as XGBoost, LightGBM, PyTorch, and TensorFlow.

    使用Dask将数据集的读取并行化为大块的熊猫。 Dask还可以跨多台机器并行化数据操作。 它模仿了熊猫和NumPy API的子集。 Dask-ML是一个姊妹软件包,用于在多台机器之间并行化机器学习算法。 它模仿了scikit-learn API。 Dask与其他流行的机器学习库(例如XGBoost,LightGBM,PyTorch和TensorFlow)配合得很好。

  4. Use PyTorch with or without a GPU. You can get really big speedups by using PyTorch on a GPU, as I found in this article on sorting.

    在有或没有GPU的情况下使用PyTorch。 正如我在有关sorting的本文中所发现的那样,通过在GPU上使用PyTorch可以大大提高速度。

未来处理大数据时需要密切注意/进行实验的事情 (Things to keep an eye on/experiment with for dealing with big data in the future)

Image for post
Keep an eye on them! Source: pixabay.com
注意他们! 资料来源:foto.com

The following three packages are bleeding edge as of mid-2020. Expect configuration issues and early stage APIs. If you are working locally on a CPU, these are unlikely to fit your needs. But they all look very promising and are worth keeping an eye on. 🔭

截至2020年中,以下三个方案处于前沿。 预期配置问题和早期API。 如果您在本地CPU上工作,那么这些将不太可能满足您的需求。 但是它们看起来都很有前途,值得关注。 🔭

  1. Do you have access to lots of cpu cores? Does your data have more than 32 columns (necessary as of mid-2020)? Then consider Modin. It mimics a subset of the pandas library to speed up operations on large datasets. It uses Apache Arrow (via Ray) or Dask under the hood. The Dask backend is experimental. Some things weren’t fast in my tests — for example reading in data from NumPy arrays was slow and memory management was an issue.

    您可以使用许多cpu核心吗? 您的数据是否有超过32列(从2020年中期开始是必需的)? 然后考虑莫丁 。 它模仿了熊猫库的一个子集,以加快对大型数据集的操作。 它在后台使用Apache Arrow(通过Ray)或Dask。 Dask后端是实验性的。 在我的测试中,有些事情并不快-例如,从NumPy阵列读取数据的速度很慢,并且内存管理是一个问题。

  2. You can use jax in place of NumPy. Jax is an open source google product that’s bleeding edge. It speeds up operations by using five things under the hood: autograd, XLA, JIT, vectorizer, and parallelizer. It works on a CPU, GPU, or TPU and might be simpler than using PyTorch or TensorFlow to get speed boosts. Jax is good for deep learning, too. It has a NumPy version but no pandas version yet. However, you could convert a DataFrame to TensorFlow or NumPy and then use jax. Read more here.

    您可以使用jax代替NumPy。 Jax是一种最新的Google开源产品,具有领先优势。 它通过使用5种功能来加快操作速度:autograd,XLA,JIT,矢量化器和并行化器。 它可以在CPU,GPU或TPU上工作,并且可能比使用PyTorch或TensorFlow来提高速度更为简单。 Jax也适用于深度学习。 它具有NumPy版本,但尚未提供熊猫版本。 但是,您可以将DataFrame转换为TensorFlow或NumPy,然后使用jax。 在这里。

  3. Rapids cuDF uses Apache Arrow on GPUs with a pandas-like API. It’s an open source Python package from NVIDIA. Rapids plays nicely with Dask so you could get multiple GPUs processing data in parallel. For the biggest workloads, it should provide a nice boost.

    Rapids cuDF在具有类似熊猫API的GPU上使用Apache Arrow。 这是NVIDIA的开源Python软件包。 Rapids与Dask配合得很好,因此您可以获得多个GPU并行处理数据。 对于最大的工作负载,它应该提供很好的提升。

其他有关代码速度和大数据的知识 (Other stuff to know about code speed and big data)

计时作业 (Timing operations)

If you want to time an operation in a Jupyter notebook, you can use %time or %%timeit magic commands. They both work on a single line or an entire code cell.

如果要在Jupyter笔记本中计时操作的时间,可以使用%time%%timeit magic命令。 它们都在单行或整个代码单元上工作。

screenshot of timeit magic command

%time runs once and %%timeit runs the code multiple times (the default is seven). Do check out the docs to see some subtleties.

%time运行一次, %%timeit运行代码多次(默认值为7)。 请检查文档以查看一些细节。

If you are in a script or notebook you can import the time module, check the time before and after running some code, and find the difference.

如果您使用的是脚本或笔记本,则可以导入时间模块,检查运行某些代码之前和之后的时间,然后找出不同之处。

timer code

When testing for time, note that different machines and software versions can cause variation. Caching will sometimes mislead if you are doing repeated tests. As with all experimentation, hold everything you can constant. 👍

测试时间时,请注意不同的机器和软件版本可能会导致变化。 如果进行重复测试,缓存有时会误导。 与所有实验一样,保持一切不变。 👍

存储大数据 (Storing big data)

GitHub’s maximum file size is 100MB. You can use Git Large File Storage extension if you want to version large files with GitHub.

GitHub的最大文件大小为100MB 。 如果要使用GitHub对大型文件进行版本控制,则可以使用Git Large File Storage扩展 。

Make sure you aren’t auto-uploading files to Dropbox, iCloud, or some other auto-backup service, unless you want to be.

除非您愿意,否则请确保没有将文件自动上传到Dropbox,iCloud或其他自动备份服务。

想了解更多? (Want to learn more?)

The pandas docs have sections on enhancing performance and scaling to large datasets. Some of these ideas are adapted from those sections.

熊猫文档中有关于增强性能和扩展到大型数据集的部分 。 这些想法中的一些是从那些部分改编而成的。

Have other tips? I’d love to hear them over on Twitter. 🎉

还有其他提示吗? 我很想在Twitter上听到他们的声音。 🎉

(Wrap)

You’ve seen how to write faster code. You’ve also seen how to deal with big data and really big data. Finally, you saw some new libraries that will likely continue to become more popular for processing big data.

您已经了解了如何编写更快的代码。 您还已经了解了如何处理大数据和真正的大数据。 最后,您看到了一些新的库,这些库在处理大数据方面可能会继续变得越来越流行。

I hope you’ve found this guide to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. 😀

希望本指南对您有所帮助。 如果您这样做了,请在您喜欢的社交媒体上分享它,以便其他人也可以找到它。 😀

I write about Python, SQL, Docker, and other tech topics. If any of that’s of interest to you, sign up for my mailing list of awesome data science resources and read more to help you grow your skills here. 👍

我撰写有关Python , SQL , Docker和其他技术主题的文章。 如果您有任何兴趣,请注册我的超棒数据科学资源邮件列表,并在此处内容以帮助您提高技能。 👍

data awesome email signup form
tree in sunset with clouds
Source: pixabay.com
资料来源:foto.com

Happy big data-ing! 😀

大数据快乐! 😀

翻译自: https://towardsdatascience.com/17-strategies-for-dealing-with-data-big-data-and-even-bigger-data-283426c7d260

数据分析和大数据哪个更吃香

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391372.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

MySQL 数据还原

1.1还原使用mysqldump命令备份的数据库的语法如下&#xff1a; mysql -u root -p [dbname] < backup.sq 示例&#xff1a; mysql -u root -p < C:\backup.sql 1.2还原直接复制目录的备份 通过这种方式还原时&#xff0c;必须保证两个MySQL数据库的版本号是相同的。MyISAM…

VueJs学习入门指引

新产品开发决定要用到vuejs&#xff0c;总结一个vuejs学习指引。 1.安装一个Node环境 去Nodejs官网下载windows版本node 下载地址&#xff1a; https://nodejs.org/zh-cn/ 2.使用node的npm工具搭建一个Vue项目&#xff0c;这里混合进入了ElementUI 搭建指引地址: https:…

centos7.4二进制安装mysql

1&#xff1a;下载二进制安装包&#xff08;安装时确保没有mysql数据库服务器端&#xff09;&#xff1a; mariadb-10.2.12-linux-x86_64.tar.gz、 mariadb-10.2.12.tar.gz。2&#xff1a;创建系统账号指定shell类型&#xff08;默认自动创建同名的组&#xff09;3&#xff1a;…

批梯度下降 随机梯度下降_梯度下降及其变体快速指南

批梯度下降 随机梯度下降In this article, I am going to discuss the Gradient Descent algorithm. The next article will be in continuation of this article where I will discuss optimizers in neural networks. For understanding those optimizers it’s important to…

java作业 2.6

//程序猿&#xff1a;孔宏旭 2017.X.XX /**功能&#xff1a;在键盘输入一个三位数&#xff0c;求它们的各数位之和。 *1、使用Scanner关键字来实现从键盘输入的方法。 *2、使用取余的方法将各个数位提取出来。 *3、最后将得到的各个数位相加。 */ import java.util.Scanner; p…

Linux 命令 之查看程序占用内存

2019独角兽企业重金招聘Python工程师标准>>> 查看PID ps aux | grep nginx root 3531 0.0 0.0 18404 832 ? Ss 15:29 0:00 nginx: master process ./nginx 查看占用资源情况 pmap -d 3531 top -p 3531 转载于:https://my.oschina.net/mengzha…

逻辑回归 自由度_回归自由度的官方定义

逻辑回归 自由度Back in middle and high school you likely learned to calculate the mean and standard deviation of a dataset. And your teacher probably told you that there are two kinds of standard deviation: population and sample. The formulas for the two a…

网络对抗技术作业一 201421410031

姓名&#xff1a;李冠华 学号&#xff1a;201421410031 指导教师&#xff1a;高见 1、虚拟机安装与调试 安装windows和linux&#xff08;kali&#xff09;两个虚拟机&#xff0c;均采用NAT网络模式&#xff0c;查看主机与两个虚拟机器的IP地址&#xff0c;并确保其连通性。同时…

生存分析简介:Kaplan-Meier估计器

In my previous article, I described the potential use-cases of survival analysis and introduced all the building blocks required to understand the techniques used for analyzing the time-to-event data.在我的上一篇文章中 &#xff0c;我描述了生存分析的潜在用例…

使用r语言做garch模型_使用GARCH估计货币波动率

使用r语言做garch模型Asset prices have a high degree of stochastic trends inherent in the time series. In other words, price fluctuations are subject to a large degree of randomness, and therefore it is very difficult to forecast asset prices using traditio…

方差偏差权衡_偏差偏差权衡:快速介绍

方差偏差权衡The bias-variance tradeoff is one of the most important but overlooked and misunderstood topics in ML. So, here we want to cover the topic in a simple and short way as possible.偏差-方差折衷是机器学习中最重要但被忽视和误解的主题之一。 因此&…

win10 uwp 让焦点在点击在页面空白处时回到textbox中

原文:win10 uwp 让焦点在点击在页面空白处时回到textbox中在网上 有一个大神问我这样的问题&#xff1a;在做UWP的项目&#xff0c;怎么能让焦点在点击在页面空白处时回到textbox中&#xff1f; 虽然我的小伙伴认为他这是一个 xy 问题&#xff0c;但是我还是回答他这个问题。 首…

重学TCP协议(1) TCP/IP 网络分层以及TCP协议概述

1. TCP/IP 网络分层 TCP/IP协议模型&#xff08;Transmission Control Protocol/Internet Protocol&#xff09;&#xff0c;包含了一系列构成互联网基础的网络协议&#xff0c;是Internet的核心协议&#xff0c;通过20多年的发展已日渐成熟&#xff0c;并被广泛应用于局域网和…

分节符缩写p_p值的缩写是什么?

分节符缩写pp是概率吗&#xff1f; (Is p for probability?) Technically, p-value stands for probability value, but since all of statistics is all about dealing with probabilistic decision-making, that’s probably the least useful name we could give it.从技术…

[测试题]打地鼠

Description 小明听说打地鼠是一件很好玩的游戏&#xff0c;于是他也开始打地鼠。地鼠只有一只&#xff0c;而且一共有N个洞&#xff0c;编号为1到N排成一排&#xff0c;两边是墙壁&#xff0c;小明当然不可能百分百打到&#xff0c;因为他不知道地鼠在哪个洞。小明只能在白天打…

重学TCP协议(2) TCP 报文首部

1. TCP 报文首部 1.1 源端口和目标端口 每个TCP段都包含源端和目的端的端口号&#xff0c;用于寻找发端和收端应用进程。这两个值加上IP首部中的源端IP地址和目的端IP地址唯一确定一个TCP连接 端口号分类 熟知端口号&#xff08;well-known port&#xff09;已登记的端口&am…

机器学习 预测模型_使用机器学习模型预测心力衰竭的生存时间-第一部分

机器学习 预测模型数据科学 &#xff0c; 机器学习 (Data Science, Machine Learning) 前言 (Preface) Cardiovascular diseases are diseases of the heart and blood vessels and they typically include heart attacks, strokes, and heart failures [1]. According to the …

重学TCP协议(3) 端口号及MTU、MSS

1. 端口相关的命令 1.1 查看端口是否打开 使用 nc 和 telnet 这两个命令可以非常方便的查看到对方端口是否打开或者网络是否可达。如果对端端口没有打开&#xff0c;使用 telnet 和 nc 命令会出现 “Connection refused” 错误 1.2 查看监听端口的进程 使用 netstat sudo …

Diffie Hellman密钥交换

In short, the Diffie Hellman is a widely used technique for securely sending a symmetric encryption key to another party. Before proceeding, let’s discuss why we’d want to use something like the Diffie Hellman in the first place. When transmitting data o…

如何通过建造餐厅来了解Scala差异

I understand that type variance is not fundamental to writing Scala code. Its been more or less a year since Ive been using Scala for my day-to-day job, and honestly, Ive never had to worry much about it. 我了解类型差异并不是编写Scala代码的基础。 自从我在日…