数据探查_数据科学家,开始使用探查器

数据探查

Data scientists often need to write a lot of complex, slow, CPU- and I/O-heavy code — whether you’re working with large matrices, millions of rows of data, reading in data files, or web-scraping.

数据科学家经常需要编写许多复杂,缓慢,CPU和I / O繁重的代码-无论您是使用大型矩阵,数百万行数据,读取数据文件还是网络抓取。

Wouldn’t you hate to waste your time refactoring one section of your code, trying to wring out every last ounce of performance, when a few simple changes to another section could speed up your code tenfold?

当对另一部分进行一些简单的更改可以使您的代码速度提高十倍时,您是否不愿意浪费您的时间来重构代码的一部分,试图浪费每一刻的性能呢?

If you’re looking for a way to speed up your code, a profiler can show you exactly which parts are taking the most time, allowing you to see which sections would benefit most from optimization.

如果您正在寻找一种加快代码速度的方法,则探查器可以准确地向您显示哪些部分花费的时间最多,从而使您可以查看哪些部分将从优化中受益最大。

A profiler measures the time or space complexity of a program. There’s certainly value in theorizing about the big O complexity of an algorithm but it can be equally valuable to examine the real complexity of an algorithm.

探查器测量程序的时间或空间复杂度。 对算法的O复杂度进行理论化肯定具有价值,但检查算法的实际复杂度同样有价值。

Where is the biggest slowdown in your code? Is your code I/O bound or CPU bound? Which specific lines are causing the slowdowns?

您的代码中最慢的地方在哪里? 是代码 I / O绑定还是CPU绑定 哪些特定的行导致了速度下降?

Once you’ve answered those questions you’ll A) have a better understanding of your code and B) know where to target your optimization efforts in order to get the biggest boon with the least effort.

回答完这些问题后,您将A)对您的代码有更好的了解,B)知道在哪里进行优化工作,以便以最少的努力获得最大的收益。

Let’s dive into some quick examples using Python.

让我们来看一些使用Python的快速示例。

基础 (The Basics)

You might already be familiar with a few methods of timing your code. You could check the time before and after a line executes like this:

您可能已经熟悉几种计时代码的方法。 您可以像这样检查行执行前后的时间:

In [1]: start_time = time.time()
...: a_function() # Function you want to measure
...: end_time = time.time()
...: time_to_complete = end_time - start_time
...: time_to_complete
Out[1]: 1.0110783576965332

Or, if you’re in a Jupyter Notebook, you could use the magic %time command to time the execution of a statement, like this:

或者,如果您在Jupyter Notebook中,则可以使用不可思议的%time命令来计时语句的执行时间,如下所示:

In [2]: %time a_function()
CPU times: user 14.2 ms, sys: 41 µs, total: 14.2 ms
Wall time: 1.01 s

Or, you could use the other magic command %timeit which gets a more accurate measurement by running the command multiple times, like this:

或者,您可以使用另一个魔术命令%timeit ,它可以通过多次运行该命令来获得更准确的测量结果,如下所示:

In [3]: %timeit a_function()
1.01 s ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Alternatively, if you want to time your whole script, you can use the bash command time, like so…

另外,如果您想对整个脚本进行计时,则可以使用bash命令time ,就像这样……

$ time python my_script.py
real 0m1.041s
user 0m0.040s
sys 0m0.000s

These techniques are great if you want to get a quick sense of how long a script or a section of code takes to run but they are less useful when you want a more comprehensive picture. It would be a nightmare if you had to wrap each line in time.time() checks. In the next section, we’ll look at how to use Python’s built-in profiler.

如果您想快速了解脚本或一段代码要花多长时间,这些技术非常有用,但是当您想要更全面的了解时,它们就没什么用处了。 如果必须在time.time()检查中包装每一行,那将是一场噩梦。 在下一节中,我们将研究如何使用Python的内置事件探查器。

使用cProfile深入了解 (Diving Deeper with cProfile)

When you’re trying to get a better understanding of how your code is running, the first place to start is cProfile, Python’s built-in profiler. cProfile will keep track of how often and for how long parts of your program were executed.

当您试图更好地了解代码的运行方式时,第一个开始的地方是cProfile ,它是Python的内置探查器。 cProfile将跟踪程序的执行频率和执行时间。

Just keep in mind that cProfile shouldn’t be used to benchmark your code. It’s written in C which makes it fast but it still introduces some overhead that could throw off your times.

请记住,不应将cProfile用作基准测试代码。 它是用C语言编写的,虽然速度很快,但仍然会带来一些开销,这可能会打乱您的时间。

There are multiple ways to use cProfile but one simple way is from the command line.

使用cProfile的方法有多种,但一种简单的方法是从命令行使用。

Before we demo cProfile, let’s start by looking at a basic sample program that will download some text files, count the words in each one, and then save the top 10 words from each to a file. Now that being said, it isn’t too important what the code does, just that we’ll be using it to show how the profiler works.

在演示cProfile之前,我们先来看一个基本的示例程序,该程序将下载一些文本文件,计算每个文本文件中的单词,然后将每个单词中的前10个单词保存到文件中。 话虽如此,代码的功能并不太重要,只是我们将使用它来展示事件探查器的工作方式。

Demo code to test our profiler
演示代码以测试我们的探查器

Now, with the following command, we’ll profile our script.

现在,使用以下命令,我们将分析脚本。

$ python -m cProfile -o profile.stat script.py

The -o flag specifies an output file for cProfile to save the profiling statistics.

-o标志为cProfile指定一个输出文件,以保存性能分析统计信息。

Next, we can fire up python to examine the results using the pstats module (also part of the standard library).

接下来,我们可以使用pstats模块(也是标准库的一部分)启动python来检查结果。

In [1]: import pstats
...: p = pstats.Stats("profile.stat")
...: p.sort_stats(
...: "cumulative" # sort by cumulative time spent
...: ).print_stats(
...: "script.py" # only show fn calls in script.py
...: )Fri Aug 07 08:12:06 2020 profile.stat46338 function calls (45576 primitive calls) in 6.548 secondsOrdered by: cumulative time
List reduced from 793 to 6 due to restriction <'script.py'>ncalls tottime percall cumtime percall filename:lineno(function)
1 0.008 0.008 5.521 5.521 script.py:1(<module>)
1 0.012 0.012 5.468 5.468 script.py:19(read_books)
5 0.000 0.000 4.848 0.970 script.py:5(get_book)
5 0.000 0.000 0.460 0.092 script.py:11(split_words)
5 0.000 0.000 0.112 0.022 script.py:15(count_words)
1 0.000 0.000 0.000 0.000 script.py:32(save_results)

Wow! Look at all that useful info!

哇! 查看所有有用的信息!

For each function called, we’re seeing the following information:

对于每个调用的函数,我们都会看到以下信息:

  • ncalls: number of times the function was called

    ncalls :调用函数的次数

  • tottime: total time spent in the given function (excluding calls to sub-functions)

    tottime :在给定功能上花费的总时间(不包括对子功能的调用)

  • percall: tottime divided by ncalls

    percalltottime除以ncalls

  • cumtime: total time spent in this function and all sub-functions

    cumtime :此功能和所有子功能所花费的总时间

  • percall: (again) cumtime divided by ncalls

    percall :(再次) cumtime除以ncalls

  • filename:lineo(function): the file name, line number, and function name

    filename:lineo(function) :文件名,行号和函数名

When reading this output, note the fact that we’re hiding a lot of data —in fact, we’re only seeing 6 out of 793 rows. Those hidden rows are all the sub-functions being called from within functions like urllib.request.urlopen or re.split. Also, note that the <module> row corresponds to the code in script.py that isn’t inside a function.

在读取此输出时,请注意以下事实:我们隐藏了很多数据-实际上,在793行中,我们仅看到6。 这些隐藏的行是从诸如urllib.request.urlopenre.split之类的函数中调用的所有re.split 。 另外,请注意<module>行对应于script.py中不在函数内的代码。

Now let’s look back at the results, sorted by cumulative duration.

现在,让我们回顾一下按累积持续时间排序的结果。

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.008 0.008 5.521 5.521 script.py:1(<module>)
1 0.012 0.012 5.468 5.468 script.py:19(read_books)
5 0.000 0.000 4.848 0.970 script.py:5(get_book)
5 0.000 0.000 0.460 0.092 script.py:11(split_words)
5 0.000 0.000 0.112 0.022 script.py:15(count_words)
1 0.000 0.000 0.000 0.000 script.py:32(save_results)

Keep in mind the hierarchy of function calls. The top-level, <module>, calls read_books and save_results. read_books calls get_book, split_words, and count_words. By comparing cumulative times, we see that most of <module>’s time is spent in read_books and most of read_books’s time is spent in get_book, where we make our HTTP request, making this script (unsurprisingly) I/O bound.

请记住函数调用的层次结构。 顶层<module>调用read_bookssave_results. read_books调用get_booksplit_wordscount_words 。 通过比较累积时间,我们可以看到<module>的大部分时间都花在了read_books而大多数read_books的时间都花在了get_book ,我们在这里进行HTTP请求,从而使该脚本( 毫不奇怪 )受I / O约束。

Next, let’s take a look at how we can be even more granular by profiling our code line-by-line.

接下来,让我们看看如何通过逐行分析代码来使粒度更细。

逐行分析 (Profiling Line-by-Line)

Once we’ve used cProfile to get a sense of what function calls are taking the most time, we can examine those functions line-by-line to get an even clearer picture of where our time is being spent.

一旦使用cProfile来了解哪些函数调用花费了最多的时间,我们就可以逐行检查这些函数,以更清楚地了解我们的时间花在哪里。

For this, we’ll need to install the line-profiler library with the following command:

为此,我们需要使用以下命令安装line-profiler库:

$ pip install line-profiler

Once installed, we just need to add the @profile decorator to the function we want to profile. Here’s the updated snippet from our script:

安装完成后,我们只需要将@profile装饰器添加到我们要分析的函数中即可。 这是脚本中的更新片段:

Note the fact that we don’t need to import the profile decorator function — it will be injected by line-profiler.

请注意,我们不需要导入profile装饰器功能,它将由line-profiler注入。

Now, to profile our function, we can run the following:

现在,要分析我们的功能,我们可以运行以下命令:

$ kernprof -l -v script-prof.py

kernprof is installed along with line-profiler. The -l flag tells line-profiler to go line-by-line and the -v flag tells it to print the result to the terminal rather than save it to a file.

kernprofline-profiler一起安装。 -l标志告诉line-profiler逐行进行, -v标志告诉它将结果打印到终端,而不是将其保存到文件。

The result for our script would look something like this:

我们的脚本的结果如下所示:

The key column to focus on here is % Time. As you can see, 89.5% of our time parsing each book is spent in the get_book function — making the HTTP request — further validation that our program is I/O bound rather than CPU bound.

这里要重点关注的关键列是% Time 。 如您所见,解析每本书的时间中有89.5%花费在get_book函数(发出HTTP请求)中,这进一步验证了我们的程序是I / O绑定而不是CPU绑定。

Now, with this new info in mind, if we wanted to speed up our code we wouldn’t want to waste our time trying to make our word counter more efficient. It only takes a fraction of the time compared to the HTTP request. Instead, we’d focus on speeding up our requests — possibly by making them asynchronously.

现在,有了这些信息,如果我们想加快代码的速度,我们就不会浪费时间试图使我们的单词计数器更有效。 与HTTP请求相比,它只花费一小部分时间。 取而代之的是,我们将专注于加快我们的请求-可能通过使其异步进行。

Here, the results are hardly surprising, but on a larger and more complicated program, line-profiler is an invaluable tool in our programming tool belt, allowing us to peer under the hood of our program and find the computational bottlenecks.

在这里,结果不足为奇,但是在更大,更复杂的程序上, line-profiler是我们编程工具带中的宝贵工具,它使我们能够窥视程序的底层并找到计算瓶颈。

分析内存 (Profiling Memory)

In addition to profiling the time-complexity of our program, we can also profile its memory-complexity.

除了分析程序的时间复杂度之外,我们还可以分析其内存复杂度。

In order to do line-by-line memory profiling, we’ll need to install the memory-profiler library which also uses the same @profile decorator to determine which function to profile.

为了进行逐行内存分析,我们需要安装memory-profiler库,该库也使用相同的@profile装饰器来确定要分析的函数。

$ pip install memory-profiler$ python -m memory_profiler script.py

The result of running memory-profiler on our same script should look something like the following:

在同一脚本上运行memory-profiler的结果应类似于以下内容:

There are currently some issues with the accuracy of the “Increment” so just focus on the “Mem usage” column for now.

当前,“增量”的准确性存在一些问题 ,因此暂时仅关注“内存使用量”列。

Our script had peak memory usage on line 28 when we split the books up into words.

当我们将书分成单词时,脚本在第28行的内存使用量达到峰值。

结论 (Conclusion)

Hopefully, now you’ll have a few new tools in your programming tool belt to help you write more efficient code and quickly determine how to best spend your optimization-time.

希望您现在在编程工具带中拥有一些新工具,可以帮助您编写更有效的代码并快速确定如何最佳地利用优化时间。

You can read more about cProfile here, line-profiler here, and memory-profiler here. I also highly recommend the book High Performance Python, by Micha Gorelick and Ian Ozsvald [1].

你可以关于CPROFILE 这里 ,行探查这里 ,和内存分析器这里 。 我也强烈推荐Micha Gorelick和Ian Ozsvald [1]一书“ 高性能Python ”。

Thanks for reading! I’d love to hear your thoughts on profilers or data science or anything else. Comment below or reach out on LinkedIn or Twitter!

谢谢阅读! 我很想听听您对分析器或数据科学或其他方面的想法。 在下面发表评论,或在 LinkedIn Twitter上联系

翻译自: https://towardsdatascience.com/data-scientists-start-using-profilers-4d2e08e7aec0

数据探查

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387922.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Node.js Streams:你需要知道的一切

Node.js Streams&#xff1a;你需要知道的一切 图像来源 Node.js流以难以使用而闻名&#xff0c;甚至更难理解。好吧&#xff0c;我有个好消息 - 不再是这样了。 多年来&#xff0c;开发人员在那里创建了许多软件包&#xff0c;其唯一目的是简化流程。但在本文中&#xff0c;我…

oracle表分区

1.表空间:是一个或多个数据文件的集合,主要存放的是表,所有的数据对象都存放在指定的表空间中;一个数据文件只能属于一个表空间,一个数据库空间由若干个表空间组成,其中包括:a.系统表空间:10g以前,默认系统表空间是System,10g包括10g以后,默认系统表空间是User,存放数据字典和视…

oracle异机恢复 open resetlogs 报:ORA-00392

参考文档&#xff1a;ALTER DATABASE OPEN RESETLOGS fails with ORA-00392 (Doc ID 1352133.1) 打开一个克隆数据库报以下错误&#xff1a; SQL> alter database open resetlogs; alter database open resetlogs * ERROR at line 1: ORA-00392: log 1 of thread 1 is being…

从ncbi下载数据_如何从NCBI下载所有细菌组件

从ncbi下载数据One of the most important steps in genome analysis is gathering the data required for downstream research. This sometimes requires us to have the assembled reference genomes (mostly bacterial) so we can verify the classifiers trained or bins …

shell之引号嵌套引号大全

万恶的引号 这个能看懂你就出师了! 转载于:https://www.cnblogs.com/theodoric008/p/10000480.html

oracle表分区详解

oracle表分区详解 从以下几个方面来整理关于分区表的概念及操作: 表空间及分区表的概念表分区的具体作用表分区的优缺点表分区的几种类型及操作方法对表分区的维护性操作 1.表空间及分区表的概念 表空间&#xff1a; 是一个或多个数据文件的集合&#xff0c;所有的数据对象都存…

线性插值插值_揭秘插值搜索

线性插值插值搜索算法指南 (Searching Algorithm Guide) Prior to this article, I have written about Binary Search. Check it out if you haven’t seen it. In this article, we will be discussing Interpolation Search, which is an improvement of Binary Search when…

其他命令

keys *这个可以全部的值del name 这个可以删除某个127.0.0.1:6379> del s_set(integer) 1127.0.0.1:6379> keys z*&#xff08;匹配&#xff09;1) "z_set2"2) "z_set"127.0.0.1:6379> exists sex(integer) 0 127.0.0.1:6379> get a"3232…

建按月日自增分区表

一、建按月自增分区表&#xff1a; 1.1建表SQL> create table month_interval_partition_table (id number,time_col date) partition by range(time_col)2 interval (numtoyminterval(1,month))3 (4 partition p_month_1 values less than (to_date(2014-01-01,yyyy-mm…

#1123-JSP隐含对象

JSP 隐含对象 JSP隐含对象是JSP容器为每个页面提供的Java对象&#xff0c;开发者可以直接使用它们而不用显式声明。JSP隐含对象也被称为预定义变量。 JSP所支持的九大隐含对象&#xff1a; 对象&#xff0c;描述 request&#xff0c;HttpServletRequest类的实例 response&#…

按照时间,每天分区;按照数字,200000一个分区

按照时间&#xff0c;每天分区 create table test_p(id number,createtime date) partition by range(createtime) interval(numtodsinterval(1,day)) store in (users) ( partition test_p_p1 values less than(to_date(20140110,yyyymmdd)) ); create index index_test_p_id …

如果您不将Docker用于数据科学项目,那么您将生活在1985年

重点 (Top highlight)One of the hardest problems that new programmers face is understanding the concept of an ‘environment’. An environment is what you could say, the system that you code within. In principal it sounds easy, but later on in your career yo…

jmeter对oracle压力测试

下载Oracle的jdbc数据库驱动包&#xff0c;注意Oracle数据库的版本&#xff0c;这里使用的是&#xff1a;Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production&#xff1b; 一般数据库的驱动包文件在安装路径下&#xff1a;D:\oracle\product\10.2.…

集合里面的 E是泛型 暂且认为是object

集合里面的 E是泛型 暂且认为是object转载于:https://www.cnblogs.com/classmethond/p/10011374.html

docker部署flask_使用Docker,GCP Cloud Run和Flask部署Scikit-Learn NLP模型

docker部署flaskA brief guide to building an app to serve a natural language processing model, containerizing it and deploying it.构建用于服务自然语言处理模型&#xff0c;将其容器化和部署的应用程序的简要指南。 By: Edward Krueger and Douglas Franklin.作者&am…

异常处理的原则

1&#xff1a;函数内部如果抛出需要检测的异常&#xff0c;那么函数上必须要声明&#xff0c;否则必须在函数内用try catch捕捉&#xff0c;否则编译失败。2&#xff1a;如果调用到了声明异常的函数&#xff0c;要么try catch 要么throws&#xff0c;否则编译失败。3&#xff…

模块化整理

#region常量#endregion#region 事件#endregion#region 字段#endregion#region 属性#endregion#region 方法#endregion#region Unity回调#endregion#region 事件回调#endregion#region 帮助方法#endregion来自为知笔记(Wiz)转载于:https://www.cnblogs.com/soviby/p/10013294.ht…

在oracle中处理日期大全

在oracle中处理日期大全 TO_DATE格式 Day: dd number 12 dy abbreviated fri day spelled out friday ddspth spelled out, ordinal twelfth Month: mm number 03 mon abbreviated mar month spelled out march Year: yy two digits 98 yyyy four …

BZOJ4868 Shoi2017期末考试(三分+贪心)

容易想到枚举最晚发布成绩的课哪天发布&#xff0c;这样与ti和C有关的贡献固定。每门课要么贡献一些调节次数&#xff0c;要么需要一些调节次数&#xff0c;剩下的算贡献也非常显然。这样就能做到平方级别了。 然后大胆猜想这是一个凸函数三分就能A掉了。具体的&#xff0c;延迟…

SQL的执行计划

SQL的执行计划实际代表了目标SQL在Oracle数据库内部的具体执行步骤&#xff0c;作为调优&#xff0c;只有知道了优化器选择的执行计划是否为当前情形下最优的执行计划&#xff0c;才能够知道下一步往什么方向。 执行计划的定义&#xff1a;执行目标SQL的所有步骤的组合。 我们首…