数据分析和大数据哪个更吃香

Dealing with big data can be tricky. No one likes out of memory errors. ☹️ No one likes waiting for code to run. ⏳ No one likes leaving Python. 🐍

处理大数据可能很棘手。没有人喜欢内存不足错误。 No️没有人喜欢等待代码运行。 ⏳没有人喜欢离开Python。 🐍

Don’t despair! In this article I’ll provide tips and introduce up and coming libraries to help you efficiently deal with big data. I’ll also point you toward solutions for code that won’t fit into memory. And all while staying in Python. 👍

别失望！在本文中，我将提供一些技巧，并介绍和建立新的库来帮助您有效地处理大数据。我还将向您指出不适合内存的代码的解决方案。而所有这些都停留在Python中。 👍

Python is the most popular language for scientific and numerical computing. Pandas is the most popular for cleaning code and exploratory data analysis.

Python是用于科学和数值计算的最流行的语言。熊猫是最受欢迎的清洁代码和探索性数据分析工具。

Using pandas with Python allows you to handle much more data than you could with Microsoft Excel or Google Sheets.

与Microsoft Excel或Google Sheets相比，在Python中使用pandas可以处理更多的数据。

SQL databases are very popular for storing data, but the Python ecosystem has many advantages over SQL when it comes to expressiveness, testing, reproducibility, and the ability to quickly perform data analysis, statistics, and machine learning.

SQL数据库在存储数据方面非常流行，但是在表达性，测试，可再现性以及快速执行数据分析，统计信息和机器学习的能力方面，Python生态系统具有许多优于SQL的优势。

Unfortunately, if you are working locally, the amount of data that pandas can handle is limited by the amount of memory on your machine. And if you’re working in the cloud, more memory costs more money.

不幸的是，如果您在本地工作，熊猫可以处理的数据量受到计算机内存量的限制。而且，如果您在云中工作，那么更多的内存将花费更多的资金。

Regardless of where you code is running you want operations to happen quickly so you can GSD (Get Stuff Done)! 😀

无论您在哪里运行代码，都希望操作能够快速进行，以便您可以进行GSD(完成工作)！ 😀

永远要做的事情 (Things to always do)

If you’ve ever heard or seen advice on speeding up code you’ve seen the warning. ⚠️ Don’t prematurely optimize! ⚠️

如果您曾经听说过或看到有关加速代码的建议，那么您会看到警告。 ⚠️ 不要过早优化！ ⚠️

This is good advice. But it’s also smart to know techniques so you can write clean fast code the first time. 🚀

这是个好建议。但是了解技术也很聪明，因此您可以在第一时间编写干净的快速代码。 🚀

Image for post — Getting after it! Source: pixabay.com

The following are three good coding practices for any size dataset.

以下是适用于任何大小数据集的三种良好编码实践。

Avoid nested loops whenever possible. Here’s a brief primer on Big-O notation and algorithm analysis. One for loop nested inside another for loop generally leads to polynomial time calculations. If you have more than a few items to search through, you’ll be waiting for a while. See a nice chart and explanation here.
尽可能避免嵌套循环。这是有关Big-O表示法和算法分析的简要介绍。一个for循环嵌套在另一个for循环中通常会导致多项式时间计算。如果您要搜索的项目不止几个，则需要等待一段时间。在这里看到一个不错的图表和说明。
Use list comprehensions (and dict comprehensions) whenever possible in Python. Creating a list on demand is faster than load the append attribute of the list and repeatedly calling it as a function — hat tip to the Stack Overflow answer here. However, in general, don’t sacrifice clarity for speed, so be careful with nesting list comprehensions. ⚠️
尽可能在Python中使用列表推导(和字典推导)。按需创建列表的速度比加载列表的append属性并作为函数重复调用的速度要快- 这里的Stack Overflow答案提示。但是，总的来说，不要为了速度而牺牲清晰度，因此请小心嵌套列表的理解。 ⚠️
In pandas, use built-in vectorized functions. The principle is really the same as the reason for dict comprehensions. Apply a function to a whole data structure at once is much faster than repeatedly calling a function.
在熊猫中，使用内置的矢量化功能。原理实际上与字典理解的原因相同。一次将函数应用于整个数据结构比重复调用函数要快得多。

If you find yourself reaching for apply, think about whether you really need to. It's looping over rows or columns. Vectorized methods are usually faster and less code, so they are a win win. 🚀

如果您发现自己想要apply ，请考虑是否确实需要。它遍历行或列。向量化方法通常更快，代码更少，因此是双赢的。 🚀

Avoid the other pandas Series and DataFrame methods that loop over your data — applymap, itterrows, ittertuples. Use the replace method on a DataFrame instead of any of those other options to save lots of time.

避免其他遍历数据的pandas Series和DataFrame方法applymap ， itterrows ， ittertuples 。在DataFrame上使用replace方法，而不是其他任何选项，可以节省大量时间。

Notice that these suggestions might not hold for very small amounts of data, but in that cases, the stakes are low, so who cares. 😉

请注意，这些建议可能只适用于非常少量的数据，但在那种情况下，风险很低，所以谁在乎。 😉

这将我们带到最重要的规则 (This brings us to our most important rule)

如果可以的话，留在熊猫里。 🐼 (If you can, stay in pandas. 🐼)

It’s a happy place. 😀

这是一个快乐的地方。 😀

Don’t worry about these issues if you aren’t having problems and you don’t expect your data to balloon. But at some point, you’ll encounter a big dataset and then you’ll want to know what to do. Let’s see some tips.

如果您没有遇到问题并且不希望数据激增，请不要担心这些问题。但是到某个时候，您将遇到一个庞大的数据集，然后您想知道该怎么做。让我们看看一些技巧。

与相当大的数据(大约数百万行)有关的事情 (Things to do with pretty big data (roughly millions of rows))

desert sun — Like millions of grains of sand. Source: pixabay.com

Use a subset of your data to explore, clean, make a baseline model if you’re doing machine learning. Solve 90% of your problems fast and save time and resources. This technique can save you so much time!
如果您要进行机器学习，请使用数据的子集来探索，清理和建立基线模型。快速解决90％的问题，并节省时间和资源。这种技术可以节省您很多时间！
Load only the columns that you need with the usecols argument when reading in your DataFrame. Less data in = win!
在读取usecols时，仅使用usecols参数加载所需的列。更少的数据=赢！
Use dtypes efficiently. Downcast numeric columns to the smallest dtypes that makes sense with pandas.to_numeric(). Convert columns with low cardinality (just a few values) to a categorical dtype. Here’s a pandas guide on efficient dtypes.
有效地使用dtype。将pandas.to_numeric()有意义的数字列转换为最小的dtypes。将具有低基数(仅几个值)的列转换为分类dtype。这是有关有效dtypes的熊猫指南。
Parallelize model training in scikit-learn to use more processing cores whenever possible. By default, scikit-learn uses just one of your machine’s cores. Many computers have 4 or more cores. You can use them all for parallelizable tasks by passing the argument n_jobs=-1 when doing cross validation with GridSearchCV and many other classes.
在scikit-learn中并行进行模型训练，以尽可能使用更多处理核心。默认情况下，scikit-learn仅使用计算机的核心之一。许多计算机具有4个或更多核心。在使用GridSearchCV和许多其他类进行交叉验证时，可以通过传递参数n_jobs=-1来将它们全部用于可并行化的任务。
Save pandas DataFrames in feather or pickle formats for faster reading and writing. Hat tip to Martin Skarzynski, who links to evidence and code here.
将熊猫DataFrame保存为羽毛或泡菜格式，以实现更快的读写速度。向Martin Skarzynski致谢，他在此处链接了证据和代码。
Use pd.eval to speed up pandas operations. Pass the function your usual code in a string. It does the operation much faster. Here’s a chart from tests with a 100 column DataFrame.
使用pd.eval可以加快熊猫操作。将函数的常规代码传递给字符串。它可以更快地完成操作。这是带有100列DataFrame的测试的图表。

df.query is basically same as pd.eval, but as a DataFrame method instead of a top-level pandas function.

df.query是基本上相同pd.eval ，但作为一个数据帧的方法，而不是顶级大熊猫功能。

See the docs because there are some gotchas. ⚠️

请参阅文档，因为有一些陷阱。 ⚠️

Pandas is using numexpr under the hood. Numexpr also works with NumPy. Hat tip to Chris Conlan in his book Fast Python for pointing me to@Numexpr. Chris’s book is an excellent read for learning how to speed up your Python code. 👍

熊猫在后台使用numexpr 。 Numexpr也可以与NumPy一起使用。克里斯·康兰(Chris Conlan)在他的书《快速Python》中给我的提示是@Numexpr。克里斯的书是学习如何加快Python代码速度的绝佳阅读。 👍

事情涉及真正的大数据(大约数千万行以上) (Things do with really big data (roughly tens of millions of rows and up))

stars — Even more data! Source: pixabay.com

Use numba. Numba gives you a big speed boost if you’re doing mathematical calcs. Install numba and import it. Then use the @numba.jit decorator function when you need to loop over NumPy arrays and can't use vectorized methods. It only works with only NumPy arrays. Use .to_numpy() on a pandas DataFrame to convert it to a NumPy array.
使用numba 。如果您要进行数学计算，Numba可以大大提高速度。安装numba并将其导入。然后，当您需要循环遍历NumPy数组并且不能使用矢量化方法时，请使用@numba.jit装饰器函数。它仅适用于NumPy数组。在熊猫DataFrame上使用.to_numpy()将其转换为NumPy数组。
Use SciPy sparse matrices when it makes sense. Scikit-learn outputs sparse arrays automatically with some transformers, such as CountVectorizer. When your data is mostly 0s or missing values, you can convert columns to sparse dtypes in pandas. Read more here.
在合理的情况下使用SciPy稀疏矩阵。 Scikit-learn使用某些转换器(例如CountVectorizer)自动输出稀疏数组。当数据大部分为0或缺少值时，可以将列转换为熊猫中的稀疏dtype。在这里。
Use Dask to parallelize the reading of datasets into pandas in chunks. Dask can also parallelize data operations across multiple machines. It mimics a subset of the pandas and NumPy APIs. Dask-ML is a sister package to parallelize machine learning algorithms across multiple machines. It mimics the scikit-learn API. Dask plays nicely with other popular machine learning libraries such as XGBoost, LightGBM, PyTorch, and TensorFlow.
使用Dask将数据集的读取并行化为大块的熊猫。 Dask还可以跨多台机器并行化数据操作。它模仿了熊猫和NumPy API的子集。 Dask-ML是一个姊妹软件包，用于在多台机器之间并行化机器学习算法。它模仿了scikit-learn API。 Dask与其他流行的机器学习库(例如XGBoost，LightGBM，PyTorch和TensorFlow)配合得很好。
Use PyTorch with or without a GPU. You can get really big speedups by using PyTorch on a GPU, as I found in this article on sorting.
在有或没有GPU的情况下使用PyTorch。正如我在有关sorting的本文中所发现的那样，通过在GPU上使用PyTorch可以大大提高速度。

未来处理大数据时需要密切注意/进行实验的事情 (Things to keep an eye on/experiment with for dealing with big data in the future)

The following three packages are bleeding edge as of mid-2020. Expect configuration issues and early stage APIs. If you are working locally on a CPU, these are unlikely to fit your needs. But they all look very promising and are worth keeping an eye on. 🔭

截至2020年中，以下三个方案处于前沿。预期配置问题和早期API。如果您在本地CPU上工作，那么这些将不太可能满足您的需求。但是它们看起来都很有前途，值得关注。 🔭

Do you have access to lots of cpu cores? Does your data have more than 32 columns (necessary as of mid-2020)? Then consider Modin. It mimics a subset of the pandas library to speed up operations on large datasets. It uses Apache Arrow (via Ray) or Dask under the hood. The Dask backend is experimental. Some things weren’t fast in my tests — for example reading in data from NumPy arrays was slow and memory management was an issue.
您可以使用许多cpu核心吗？您的数据是否有超过32列(从2020年中期开始是必需的)？然后考虑莫丁。它模仿了熊猫库的一个子集，以加快对大型数据集的操作。它在后台使用Apache Arrow(通过Ray)或Dask。 Dask后端是实验性的。在我的测试中，有些事情并不快-例如，从NumPy阵列读取数据的速度很慢，并且内存管理是一个问题。
You can use jax in place of NumPy. Jax is an open source google product that’s bleeding edge. It speeds up operations by using five things under the hood: autograd, XLA, JIT, vectorizer, and parallelizer. It works on a CPU, GPU, or TPU and might be simpler than using PyTorch or TensorFlow to get speed boosts. Jax is good for deep learning, too. It has a NumPy version but no pandas version yet. However, you could convert a DataFrame to TensorFlow or NumPy and then use jax. Read more here.
您可以使用jax代替NumPy。 Jax是一种最新的Google开源产品，具有领先优势。它通过使用5种功能来加快操作速度：autograd，XLA，JIT，矢量化器和并行化器。它可以在CPU，GPU或TPU上工作，并且可能比使用PyTorch或TensorFlow来提高速度更为简单。 Jax也适用于深度学习。它具有NumPy版本，但尚未提供熊猫版本。但是，您可以将DataFrame转换为TensorFlow或NumPy，然后使用jax。在这里。
Rapids cuDF uses Apache Arrow on GPUs with a pandas-like API. It’s an open source Python package from NVIDIA. Rapids plays nicely with Dask so you could get multiple GPUs processing data in parallel. For the biggest workloads, it should provide a nice boost.
Rapids cuDF在具有类似熊猫API的GPU上使用Apache Arrow。这是NVIDIA的开源Python软件包。 Rapids与Dask配合得很好，因此您可以获得多个GPU并行处理数据。对于最大的工作负载，它应该提供很好的提升。

其他有关代码速度和大数据的知识 (Other stuff to know about code speed and big data)

计时作业 (Timing operations)

If you want to time an operation in a Jupyter notebook, you can use %time or %%timeit magic commands. They both work on a single line or an entire code cell.

如果要在Jupyter笔记本中计时操作的时间，可以使用%time或%%timeit magic命令。它们都在单行或整个代码单元上工作。

%time runs once and %%timeit runs the code multiple times (the default is seven). Do check out the docs to see some subtleties.

%time运行一次， %%timeit运行代码多次(默认值为7)。请检查文档以查看一些细节。

If you are in a script or notebook you can import the time module, check the time before and after running some code, and find the difference.

如果您使用的是脚本或笔记本，则可以导入时间模块，检查运行某些代码之前和之后的时间，然后找出不同之处。

When testing for time, note that different machines and software versions can cause variation. Caching will sometimes mislead if you are doing repeated tests. As with all experimentation, hold everything you can constant. 👍

测试时间时，请注意不同的机器和软件版本可能会导致变化。如果进行重复测试，缓存有时会误导。与所有实验一样，保持一切不变。 👍

存储大数据 (Storing big data)

GitHub’s maximum file size is 100MB. You can use Git Large File Storage extension if you want to version large files with GitHub.

GitHub的最大文件大小为100MB 。如果要使用GitHub对大型文件进行版本控制，则可以使用Git Large File Storage扩展。

Make sure you aren’t auto-uploading files to Dropbox, iCloud, or some other auto-backup service, unless you want to be.

除非您愿意，否则请确保没有将文件自动上传到Dropbox，iCloud或其他自动备份服务。

想了解更多？ (Want to learn more?)

The pandas docs have sections on enhancing performance and scaling to large datasets. Some of these ideas are adapted from those sections.

熊猫文档中有关于增强性能和扩展到大型数据集的部分。这些想法中的一些是从那些部分改编而成的。

Have other tips? I’d love to hear them over on Twitter. 🎉

还有其他提示吗？我很想在Twitter上听到他们的声音。 🎉

包 (Wrap)

You’ve seen how to write faster code. You’ve also seen how to deal with big data and really big data. Finally, you saw some new libraries that will likely continue to become more popular for processing big data.

您已经了解了如何编写更快的代码。您还已经了解了如何处理大数据和真正的大数据。最后，您看到了一些新的库，这些库在处理大数据方面可能会继续变得越来越流行。

I hope you’ve found this guide to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. 😀

希望本指南对您有所帮助。如果您这样做了，请在您喜欢的社交媒体上分享它，以便其他人也可以找到它。 😀

I write about Python, SQL, Docker, and other tech topics. If any of that’s of interest to you, sign up for my mailing list of awesome data science resources and read more to help you grow your skills here. 👍

我撰写有关Python ， SQL ， Docker和其他技术主题的文章。如果您有任何兴趣，请注册我的超棒数据科学资源邮件列表，并在此处内容以帮助您提高技能。 👍