python 数据科学包

Python is the most popular language for data science. Unfortunately, it can be tricky to know which of the many data science libraries to use when. ☹️

Python是数据科学中最流行的语言。不幸的是，要知道何时使用许多数据科学库中的哪一个可能很棘手。 ☹️

Understanding when to use which library is key for quickly getting up to speed. In this article, I’ll give you the lay of the land for important Python data science libraries. 😀

了解何时使用哪个库是快速入门的关键。在本文中，我将为您介绍重要的Python数据科学库。 😀

Every package you’ll see is free and open source software. 👍 Thank you to all the folks who create, support, and maintain these projects! 🎉 If you’re interested in learning about contributing fixes to open source projects, here’s a good guide. And If you’re interested in the foundations that support these projects, I wrote an overview here.

您将看到的每个软件包都是免费的开源软件。 👍感谢所有创建，支持和维护这些项目的人！ 🎉如果您有兴趣学习有关为开源项目贡献修补程序的知识，那么这里是一个很好的指南。如果您对支持这些项目的基金会感兴趣，那么我在此处撰写了概述。

Let’s get to it! 🚀

让我们开始吧！ 🚀

大熊猫 (Pandas)

Pandas is a workhorse to help you understand and manipulate your data. Use pandas to manipulate tabular data (think spreadsheet tables). Pandas is great for data cleaning, descriptive statistics, and basic visualizations.

熊猫是帮助您理解和操纵数据的主力军。使用熊猫来处理表格数据(请考虑电子表格表格)。 Pandas非常适合用于数据清理，描述性统计和基本可视化。

Pandas is relatively brain-friendly, although the API is gigantic. Check out my book on pandas if you want to get started with the most important parts of the API.

尽管API庞大，但Pandas相对大脑友好。如果您想开始使用API最重要的部分，请查阅我关于熊猫的书。

Unlike a SQL database, pandas stores all your data in-memory. It’s sort of like a hybrid between Microsoft Excel and a SQL database. Pandas makes operations with lots of data fast and repeatable.

与SQL数据库不同，pandas将所有数据存储在内存中。有点像Microsoft Excel和SQL数据库之间的混合体。熊猫可以快速，可重复地处理包含大量数据的操作。

The amount of memory on your machine constrains how many rows and columns pandas can handle. For a rough guideline, if your data is less than thousands of columns and hundreds of millions of rows, pandas should work well on most computers.

机器上的内存量限制了熊猫可以处理的行数和列数。作为一个粗略的指导，如果您的数据少于数千列和数亿行，则熊猫在大多数计算机上都应该工作良好。

Like a real panda bear, pandas is warm and fuzzy to work with. 🐼

就像一只真正的熊猫熊一样，熊猫温暖而又模糊。 🐼

When you have more data than pandas can handle, you might want to drop down to NumPy.

如果您的数据量超出了熊猫的处理能力，则可能需要下拉至NumPy。

NumPy (NumPy)

NumPy ndarrays are like more powerful Python lists. They are the data structure on which the edifice of machine learning is built. They hold the data you need in as many dimensions as you need.

NumPy ndarrays就像更强大的Python列表一样。它们是建立机器学习大厦的数据结构。它们可以根据需要在多个维度上保存您所需的数据。

Do you have video data with three color channels for each pixel and lots of frames? No problem. 😀

您是否具有每个像素具有三个颜色通道和许多帧的视频数据？没问题。 😀

NumPy doesn’t have handy methods for time series data and strings like pandas does. In fact, each NumPy ndarray can have only one data type (hat tip to Kevin Markham for suggesting I include that differentiator).

NumPy没有像pandas那样方便的方法来处理时间序列数据和字符串。实际上，每个NumPy ndarray只能具有一种数据类型( Kevin Markham建议我包括该区分符的提示)。

For tabular data, NumPy is also harder for your brain to work with than pandas. You can’t do things like easily display column names in tables, as you can in pandas.

对于表格数据，NumPy也比熊猫更难与大脑合作。您无法像在熊猫中那样轻松地在表中轻松显示列名称。

NumPy has a bit more speed/memory efficiency than pandas, because it doesn’t have the additional overhead. However, there are other approaches that might scale better if you have really big data. I have an outline on that topic so let me know if you’d be interested in hearing about it on Twitter. 👍

NumPy比熊猫具有更高的速度/内存效率，因为它没有额外的开销。但是，如果您拥有真正的大数据，还有其他方法可能会更好地扩展。我对该主题有一个概述，所以请让我知道您是否有兴趣在Twitter上听到它。 👍

What else is NumPy good for?

NumPy还有什么好处？

mathematical functions for ndarrays.
ndarray的数学函数。
basic statistical functions for ndarrays.
ndarray的基本统计功能。
making random variables from common distributions. NumPy has 27 distributions to randomly sample from.
根据共同分布制作随机变量。 NumPy具有27个分布以从中随机采样。

NumPy is like pandas without convenience functions and column names, but with some speed gains. 🚀

NumPy就像没有便利功能和列名的熊猫，但是速度有所提高。 🚀

Scikit学习 (Scikit-learn)

The scikit-learn library is the Swiss Army Knife of machine learning. If you are doing a prediction task that does not involve deep learning, scikit-learn is what you want to use. It can handle NumPy arrays with no problems and pandas data structures pretty well.

scikit学习库是机器学习的瑞士军刀。如果您正在执行不涉及深度学习的预测任务，则要使用scikit-learn。它可以毫无问题地处理NumPy数组，并且熊猫数据结构也很好。

Scikit-learn pipelines and model selection functions are great for preparing and manipulating data in ways that avoid accidentally peeking at your hold-out (test set) data.

Scikit学习管道和模型选择功能非常适合以避免意外窥视保留(测试集)数据的方式来准备和处理数据。

The scikit-learn API is very consistent for preprocessing transformers and for estimators. This makes it relatively easy to search for the best results over many machine learning algorithms. And it makes it easier to wrap your head around the library. 🧠

scikit-learn API对于预处理转换器和估计器非常一致。这使得在许多机器学习算法中搜索最佳结果相对容易。而且，它可以更轻松地将您的头绕在图书馆周围。 🧠

Scikit-learn accommodates multi-threading so you can speed up your searches. However, it wasn’t built for GPUs, so it can’t take advantage of speedups there.

Scikit-learn可容纳多线程，因此您可以加快搜索速度。但是，它不是为GPU构建的，因此无法利用那里的加速优势。

Scikit-learn also contains handy basic NLP functions.

Scikit学习还包含方便的基本NLP功能。

multi-purpose knife — source: pixabay.com

If you want to do machine learning, scikit-learn is essential to be familiar with.

如果您想进行机器学习，那么熟悉scikit-learn是必不可少的。

The next two libraries are primarily used for deep neural networks. They work well with GPUs, TPUs, and CPUs.

接下来的两个库主要用于深度神经网络。它们与GPU，TPU和CPU配合良好。

TensorFlow (TensorFlow)

TensorFlow is the most popular deep learning library. It is especially common in industry. It was developed by Google.

TensorFlow是最受欢迎的深度学习库。在工业中尤其常见。它是由Google开发的。

The Keras high-level API is now tightly integrated with TensorFlow as of version TF version 2.0.

自TF版本2.0起，Keras高级API现在已与TensorFlow紧密集成。

In addition to working on CPU chips, TensorFlow can use GPUs and TPUs. These matrix-algebra optimized chips provide big speedups for deep learning.

除了使用CPU芯片，TensorFlow还可以使用GPU和TPU。这些矩阵代数优化的芯片为深度学习提供了极大的提速。

火炬 (PyTorch)

PyTorch is the second most popular deep learning library and is now the most common in academic research. It was developed by Facebook and has been growing in popularity. You can see my article on the topic here.

PyTorch是第二受欢迎的深度学习库，现在在学术研究中最常见。它是由Facebook开发的，并且越来越受欢迎。您可以在此处查看有关该主题的文章。

PyTorch and TensorFlow now provide very similar functionality. They both have data structures, called tensors, that are similar to NumPy ndarrays. Tensors and can be converted into ndarrays easily. Both packages also contain some basic statistical functions.

PyTorch和TensorFlow现在提供了非常相似的功能。它们都具有称为张量的数据结构，类似于NumPy ndarrays。张量和可以轻松转换为ndarrays。这两个软件包还包含一些基本的统计功能。

PyTorch’s API is generally considered a bit more pythonic than TensorFlow’s API.

通常认为PyTorch的API比TensorFlow的API更具Python风格。

Skorch, FastAI, and PyTorch Lightening are packages that reduce the amount of code needed to use PyTorch models. PyTorch/XLA lets you use PyTorch with TPUs.

Skorch ， FastAI和PyTorch Lightening是可以减少使用PyTorch模型所需代码量的软件包。 PyTorch / XLA允许您将TyTorch与TPU一起使用。

Both PyTorch and TensorFlow will allow you to make top notch deep learning models. 👍

PyTorch和TensorFlow都将允许您创建一流的深度学习模型。 👍

统计模型 (Statsmodels)

Statsmodels is the statistical modeling library. It’s the place for doing inferential frequentist statistics.

Statsmodels是统计建模库。在这里进行推断频繁性统计。

Want to run a statistical test and get some p-values? Statsmodels is your tool. 🙂

是否想进行统计检验并获得一些p值？ Statsmodels是您的工具。 🙂

Statisticians and scientists coming from R can use statsmodels formula API for a smooth transition to Python land.

来自R的统计学家和科学家可以使用statsmodels公式API顺利过渡到Python领域。

In addition to common statistical tests such as the t-test, ANOVA, and linear regression, what is statsmodels good for?

除了常用的统计检验(例如t检验，ANOVA和线性回归)外，statsmodels还有什么用？

test how closely your data matches a well-known distribution.
测试您的数据与知名分布的匹配程度。
do time series modeling with ARIMA, Holt-Winters, and other algorithms.
使用ARIMA，Holt-Winters和其他算法进行时间序列建模。

Scikit-learn has some overlap with statsmodels when it comes to common formulas such as linear regression. However, the APIs are different. Also, scikit-learn is much more focussed on prediction and statsmodels is much more focussed on inference. ⚠️

当涉及线性回归等常用公式时，Scikit-learn与statsmodels有一些重叠。但是，API是不同的。而且，scikit-learn更侧重于预测，而statsmodels更侧重于推理。 ⚠️

Statsmodels is built on NumPy and SciPy and plays nicely with pandas. 🙂

Statsmodels建立在NumPy和SciPy之上，可与熊猫很好地配合使用。 🙂

Speaking of SciPy, let’s look at when to use it.

说到SciPy，让我们看看何时使用它。

科学 ci (SciPy 🔬)

“SciPy is a collection of mathematical algorithms and convenience functions built on the NumPy extension of Python.” — the docs.

“ SciPy是基于Python的NumPy扩展构建的数学算法和便利函数的集合。” — docs 。

SciPy is like NumPy’s twin. Many NumPy array functions can alternatively be called through SciPy. The two packages even share the same docs website.

SciPy就像NumPy的双胞胎。许多NumPy数组函数也可以通过SciPy调用。这两个软件包甚至共享相同的docs网站。

SciPy sparse matrices are used in scikit-learn. A sparse matrix is one that is optimized to use far less memory that a regular, dense matrix when most elements are zeros.

SciPy稀疏矩阵用于scikit学习中。当大多数元素为零时，稀疏矩阵是经过优化以使用比常规密集矩阵少得多的内存的矩阵。

SciPy contains common constants and linear algebra capabilities. The scipy.stats sub-module is used for probability distributions, descriptive stats, and statistical tests. It has 125 distributions to randomly sample from, nearly 100 more than NumPy. 😲 However, unless you are doing lots of stats, as a practicing data scientist, you’ll likely be fine with the distributions in NumPy.

SciPy包含公共常数和线性代数功能。 scipy.stats 子模块用于概率分布，描述性统计和统计检验。它具有125种分布以从NumPy中随机采样近100种。 😲但是，除非您做大量的统计工作，否则作为一名实践数据科学家，您可能会对NumPy中的分布情况满意。

If statsmodels or NumPy doesn’t have the functionality you need, then go look in SciPy. 👀

如果statsmodels或NumPy没有所需的功能，则请查看SciPy。 👀

达斯克 (Dask)

Dask has an API that mimics pandas and NumPy. Use Dask when you want to use pandas or NumPy but have more data than you can keep in memory.

Dask具有模仿熊猫和NumPy的API。当您想使用pandas或NumPy时，如果您要存储的数据量过多，请使用Dask。

Dask can also speed up calculations for big datasets. It does multi-threading over multiple devices. You can combine Dask with Rapids to gain the performance benefits of distributed computing over GPUs. 👍

Dask还可以加快大型数据集的计算速度。它在多个设备上执行多线程。您可以将Dask与Rapids结合使用以获得通过GPU进行分布式计算的性能优势。 👍

PyMC3 (PyMC3)

PyMC3 is the package for Bayesian statistics. Like Markov Chain Monte Carlo (MCMC) simulations? PyMC3 is your jam. 🎉

PyMC3是用于贝叶斯统计的软件包。像Markov Chain Monte Carlo(MCMC)模拟一样？ PyMC3是您的果酱。 🎉

I find the PyMC3 API a bit confusing, but it’s a powerful library. But that’s might be because I don’t use it a lot.

我发现PyMC3 API有点令人困惑，但这是一个功能强大的库。但这可能是因为我不经常使用它。

其他流行数据科学软件包 (Other Popular Data Science Packages)

I won’t dive deeply into visualization, NLP, gradient boosting, time series, or model serving libraries, but I’ll highlight a few popular packages in each area.

我不会深入研究可视化，NLP，梯度增强，时间序列或模型服务库，但我将重点介绍每个领域中的一些流行软件包。

可视化库📊 (Visualization libraries 📊)

There are gobs of visualization libraries in Python. Matplotlib, seaborn, and Plotly are three of the most popular. I go through some of the options toward the end of this article.

Python中有许多可视化库。 Matplotlib ， seaborn和Plotly是最受欢迎的三种。在本文结尾处，我将介绍一些选项。

NLP库🔠 (NLP libraries 🔠)

Natural language processing (NLP) is a huge and important area of machine learning. Either spaCy or NLTK will have most of the functionality you’ll need. Both are very popular.

自然语言处理(NLP)是机器学习的一个巨大而重要的领域。两种空间或NLTK将具有您所需的大多数功能。两者都很受欢迎。

梯度提升回归树库🌳 (Gradient boosting regression tree libraries 🌳)

LightGBM is the most popular gradient boosting package. Scikit-learn has clones of its algorithms. XGBoost and CatBoost are other boosting algorithm packages that are similar to LightGBM. If you look at the leaderboard of a Kaggle machine learning competition and a deep learning algorithm isn’t a good match for the problem, you’ll likely see one of these gradient boosting libraries used by the winners.

LightGBM是最受欢迎的渐变增强软件包。 Scikit-learn具有其算法的克隆。 XGBoost和CatBoost是其他类似于LightGBM的增强算法程序包。如果您看一下Kaggle机器学习竞赛的排行榜，而深度学习算法并不是解决该问题的理想选择，那么您很可能会看到获胜者使用的其中一种梯度提升库。

时间序列库📅 (Time series libraries 📅)

Pmdarima makes fitting an ARIMA time series model less painful. However, the process of choosing the hyperparameters is not entirely automated.

使用Pmdarima可以简化 ARIMA时间序列模型的拟合过程。但是，选择超参数的过程并非完全自动化。

Prophet is another package for making predictions with time series data. “Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.” — the docs. It was created by Facebook. The API and docs are user-friendly. It’s worth a shot if your data fit the description above.

先知是用于使用时间序列数据进行预测的另一个软件包。 “先知是一种基于加性模型预测时间序列数据的程序，其中非线性趋势与年，周和日的季节性变化以及假日效应相吻合。最适合具有强烈季节性影响和多个季节历史数据的时间序列。” — docs 。它是由Facebook创建的。 API和文档是用户友好的。如果您的数据符合上述说明，则值得一试。

模型服务库🚀 (Model serving libraries 🚀)

When it comes to model serving, Flask, FastAPI, and Streamlit are three popular libraries to actually do something with your model predictions. 😉 Flask is a basic framework for making an API or serving a website that has been battle tested. FastAPI makes setting up REST endpoints faster and easier. Streamlit makes it quick to serve a model in a single page app. If you’re interested in learning more about streamlit, I wrote a guide to getting started with it here.

当涉及模型服务时， Flask ， FastAPI和Streamlit是三个流行的库，它们实际上可以对模型预测进行某些处理。 😉Flask是用于制作API或服务经过实战测试的网站的基本框架。 FastAPI使设置REST端点更快，更轻松。 Streamlit可以快速在一个页面应用程序中提供模型。如果您想了解有关Streamlit的更多信息，请在此处撰写有关入门的指南。

包 (Wrap)

Here’s a quick recap of when to use which major Python data science library:

以下是何时使用哪个主要Python数据科学库的简要概述：

pandas for tabular data exploration and manipulation.
用于表格数据探索和处理的大熊猫 。
NumPy for random samples from common distributions, to save memory, or to speed up operations.
NumPy用于获取来自公共分布的随机样本，以节省内存或加快操作速度。
scikit-learn for machine learning.
scikit-learn用于机器学习。
TensorFlow or PyTorch for deep learning.
TensorFlow或PyTorch用于深度学习。
statsmodels for statistical modeling.
用于统计建模的statsmodels 。
SciPy for statistical tests or distributions you can’t find in NumPy or statsmodels.
在NumPy或statsmodels中找不到的统计测试或分布的SciPy 。
Dask when you want pandas or NumPy but have really big data.
当您想要熊猫或NumPy但拥有真正的大数据时，请花点时间 。
PyMC3 for Bayesian stats.
PyMC3用于贝叶斯统计。

I hope you enjoyed this tour of key Python data science packages. If you did, please share it on your favorite social media so other folks can find it, too. 😀

我希望您喜欢这个关键的Python数据科学软件包之旅。如果您这样做了，请在您喜欢的社交媒体上分享它，以便其他人也可以找到它。 😀

Now you hopefully have a clearer mental model of how the different Python data science libraries relate to each other and reach for each of them.

现在，希望您有了一个更清晰的思维模型，以了解不同的Python数据科学库如何相互关联并达到它们之间的联系。

I write about Python, SQL, Docker, and other tech topics. If any of that’s of interest to you, sign up for my mailing list of awesome data science resources and read more to help you grow your skills here. 👍

我撰写有关Python ， SQL ， Docker和其他技术主题的文章。如果您有任何兴趣，请注册我的超棒数据科学资源邮件列表，并在此处内容以帮助您提高技能。 👍

map of the world to explore with python — source: pixabay.com

Happy exploring! 😀

探索愉快！ 😀

翻译自: https://towardsdatascience.com/which-python-data-science-package-should-i-use-when-e98c701364c