从Jupyter Notebook切换到脚本的5个理由

意见 (Opinion)

动机 (Motivation)

Like most people, the first tool I used when started learning data science is Jupyter Notebook. Most of the online data science courses use Jupyter Notebook as a medium to teach. This makes sense because it is easier for beginners to start writing code in Jupyter Notebook’s cells than writing a script with classes and functions.

与大多数人一样,我开始学习数据科学时使用的第一个工具是Jupyter Notebook。 大多数在线数据科学课程都使用Jupyter Notebook作为教学手段。 这是有道理的,因为对于初学者来说,在Jupyter Notebook的单元格中开始编写代码比编写具有类和函数的脚本要容易得多。

Another reason why Jupyter Notebook is such a common tool in data science is that Jupyter Notebook makes it easy to explore and plot the data. When we type ‘Shift + Enter’, we will immediately see the results of the code, which makes it easy for us to identify whether our code works or not.

Jupyter Notebook之所以成为数据科学中如此普遍的工具的另一个原因是,Jupyter Notebook使其易于浏览和绘制数据。 当我们键入“ Shift + Enter”时,我们将立即看到代码的结果,这使我们很容易确定我们的代码是否有效。

However, I realized several fallbacks of Jupyter Notebook as I work with more data science projects:

但是,当我处理更多数据科学项目时,我意识到了Jupyter Notebook的一些后备功能:

  • Unorganized: As my code gets bigger, it becomes increasingly difficult for me to keep track of what I write. No matter how many markdowns I use to separate the notebook into different sections, the disconnected cells make it difficult for me to concentrate on what the code does.

    杂乱无章 :随着我的代码变得越来越大,对我而言,跟踪自己的编写变得越来越困难。 无论我使用多少次降价将笔记本分成不同的部分,断开的单元格都使我难以集中精力执行代码。

  • Difficult to experiment: You may want to test with different methods of processing your data, choose different parameters for your machine learning algorithm to see if the accuracy increases. But every time you experiment with new methods, you need to rerun the entire notebook. This is time-consuming, especially when the processing procedure or the training takes a long time to run.

    难以实验: 可能想用不同的数据处理方法进行测试,为机器学习算法选择不同的参数以查看准确性是否提高。 但是,每次尝试新方法时,都需要重新运行整个笔记本。 这非常耗时,尤其是在处理过程或培训需要很长时间才能运行时。

  • Not ideal for reproducibility: If you want to use new data with a slightly different structure, it would be difficult to identify the source of error in your notebook.

    对于重现性而言并不理想:如果要使用结构略有不同的新数据,则很难在笔记本中识别错误源。

  • Difficult to debug: When you get an error in your code, it is difficult to know whether the reason for the error is the code or the change in data. If the error is in the code, which part of the code is causing the problem?

    难以调试:当您得到 代码中的错误,很难知道错误的原因是代码还是数据更改。 如果错误出在代码中,则代码的哪一部分导致了问题?

  • Not ideal for production: Jupyter Notebook does not play very well with other tools. It is not easy to run the code from Jupyter Notebook while using other tools.

    对于生产而言并不理想: Jupyter Notebook在与其他工具配合使用时效果不佳。 使用其他工具时,从Jupyter Notebook运行代码并不容易。

I knew there must be a better way to handle my code so I decided to give scripts a try. These are the benefits I found when using scripts:

我知道必须有一种更好的方式来处理我的代码,所以我决定尝试一下脚本。 这些是我在使用脚本时发现的好处:

有组织的 (Organized)

The cells in Jupyter Notebook make it difficult to organize the code into different parts. With a script, we could create several small functions with each function specifies what the code does like this

Jupyter Notebook中的单元格使得很难将代码组织成不同的部分。 使用脚本,我们可以创建几个小函数,每个函数指定代码的功能,如下所示

Image for post

Better yet, if these functions could be categorized in the same category such as functions to process the data, we could put them in the same class!

更好的是,如果可以将这些函数归为同一类,例如处理数据的函数,我们可以将它们归为同一类!

Image for post

Whenever we want to process our data, we know the functions in the class Preprocess can be used for this purpose.

每当我们要处理数据时,我们都知道Preprocess类中的函数可用于此目的。

鼓励实验 (Encourage Experiment)

When we want to experiment with a different approach to preprocess data, we could just add or remove a function by commenting out like this without being afraid to break the code! Even if we happen to break the code, we know exactly where to fix it.

当我们想尝试另一种预处理数据的方法时,我们可以通过注释掉这样的方式来添加或删除函数,而不必担心破坏代码! 即使我们碰巧破坏了代码,我们也知道在哪里修复它。

Image for post

We could also experiment with different parameters by changing the input of the functions. For example, if we want to see how different methods of resampling my Pandas series affect my results, we could just switch from method_of_resample='sum’ to method_of_resample= 'average'. How neat!

我们还可以通过更改函数的输入来试验不同的参数。 例如,如果要查看对熊猫系列进行重采样的不同方法如何影响我的结果,可以将其从method_of_resample='sum'切换到method_of_resample= 'average' 。 多么整洁!

Image for post

You can still use functions in a notebook, but when your number of functions gets really big, you might want to split the functions in different notebooks. Importing functions across different notebook is not easy.

您仍然可以在笔记本中使用功能,但是当功能数量真的很大时,您可能希望将功能拆分到不同的笔记本中。 跨不同笔记本导入功能并不容易。

重现性的理想选择 (Ideal for Reproducibility)

With classes and functions, we could make the code general enough so that it will be able to work with other data.

使用类和函数,我们可以使代码足够通用,以便能够与其他数据一起使用。

For example, if we want to drop different columns in my new data, we just need to change columns_to_drop to a list of columns, we want to drop and the code will run smoothly!

例如,如果我们想在新数据中删除不同的列,我们只需要将columns_to_drop更改为列的列表,我们就可以删除并且代码将平稳运行!

columns_to_drop = config.columns.to_dropdatetime_column = config.columns.datetime.sentimentdropna_columns = config.columns.drop_naprocessor = Preprocess(columns_to_drop, datetime_column, dropna_columns)

I can also create a pipeline that specifies steps to process and train the data! Once I have a pipeline, all I need to do is to use

我还可以创建一个管道来指定处理和训练数据的步骤! 一旦有了管道,我要做的就是使用

pipline.fit_transform(data)

to apply the same processing to both the train and test data.

对火车和测试数据进行相同的处理。

易于调试 (Easy to Debug)

With functions, it is easier to test whether that function produces the output we expect. We can quickly spot out where in the code we should change to produce the output we want

使用函数,可以更轻松地测试该函数是否产生我们期望的输出。 我们可以快速找出应该在代码中更改的位置以产生所需的输出

def extract_date_hour_minute(string: str):'''Extract data hour and minute from datetime string'''try:return string[:16]except TypeError:return np.nandef test_extract_date_hour_minute():'''Test whether the function extract date, hour, and minute '''string = '2020-07-30T23:25:31.036+03:00'assert extract_date_hour_minute(string) == '2020-07-30T23:25'

If all of the tests pass but there is still an error in running our code, we know the data is where we should look next.

如果所有测试都通过了,但是在运行我们的代码时仍然存在错误,那么我们知道数据是我们下一步应该去的地方。

For example, after passing the test above, I still have a TypeError when running the script, which gives me the idea that my data has null values. I just need to take care of that to run the code smoothly.

例如,通过上述测试后,运行脚本时我仍然遇到TypeError,这使我想到了我的数据具有空值。 我只需要注意这一点即可顺利运行代码。

生产的理想选择 (Ideal for Production)

We can use different functions in multiple scripts on top of something else like this

我们可以在类似这样的其他东西的多个脚本中使用不同的功能

from preprocess import preprocess
from model import run_model
from predict import predictdef main(config):df = preprocess(config)df = run_model(config)df, df_scale, min_day, max_day, accuracy = predict(df, config)

or to add a config file to control the values of the variables. This prevents us from wasting time tracking down a specific variable in the code just to change its value.

或添加配置文件以控制变量的值。 这样可以避免我们浪费时间跟踪代码中的特定变量以更改其值。

columns:to_drop:#- keywords#- entities- code- error- warningsbinary_columns: - sentiment - Diffdatetime:time: Date sentiment: crawleddrop_na: - sentiment- usage- crawled- emotionto_predict: sentiment

We could also easily add tools to track the experiment such as MLFlow or tools to handle configuration such as Hydra.cc!

我们还可以很容易地添加工具来跟踪实验,如MLFlow或工具来处理配置,如Hydra.cc !

我不喜欢使用Jupyter Notebook的想法,直到我将自己推出舒适区 (I didn’t like the Idea of Using Jupyter Notebook until I Pushed myself out of my Comfort Zone)

I used to use Jupyter Notebook all the time. When some data scientists advise me to switch from Jupyter Notebook to script to prevent some problems listed above, I didn’t understand and felt resistant to do so. I didn’t like the uncertainty of not being able to see the outcome when I run the cell.

我曾经一直使用Jupyter Notebook。 当一些数据科学家建议我从Jupyter Notebook切换到脚本以防止上面列出的某些问题时,我并不理解,并且对此感到抵触。 我不喜欢在运行单元时无法看到结果的不确定性。

But the disadvantage of Jupyter Notebook grew as I started my first real data science project in my new company so I decided to push myself out of my comfort zone and experiment with scripts.

但是Jupyter Notebook的劣势随着我在新公司中开始第一个真实数据科学项目而变得越来越严重,因此我决定将自己从舒适的领域中脱身出来,并尝试使用脚本。

In the beginning, I felt uncomfortable but started to notice the benefits of using scripts. I started to feel more organized when my code is organized into different functions, classes, and into multiple scripts with each script serving different purposes such as preprocessing, training, and testing.

一开始,我感到不舒服,但是开始注意到使用脚本的好处。 当我的代码被组织成不同的函数,类和多个脚本,并且每个脚本具有不同的目的(例如预处理,培训和测试)时,我开始变得井井有条。

所以,您是否建议我停止使用Jupyter Notebook? (So are you Suggesting me to Stop Using Jupyter Notebook?)

Don’t get me wrong. I still use Jupyter Notebook if my code is small and if I don’t plan to put my code into production. I use Jupyter Notebook when I want to explore and visualize the data. I also use it to explain how to use some python libraries. For example, I write use mostly Jupyter Notebooks in this repository as the medium to explain the code mentioned in all of my articles.

不要误会我的意思。 如果我的代码很小并且我不打算将代码投入生产,我仍然会使用Jupyter Notebook。 当我想浏览和可视化数据时,我使用Jupyter Notebook。 我也用它来解释如何使用一些python库。 例如,我在这个存储库中主要使用Jupyter Notebooks作为媒介来解释我所有文章中提到的代码。

If you don’t feel comfortable with coding everything in scripts, you could use both scripts and Jupyter Notebook for different purposes. For example, you could create classes and functions in scripts then import them in the notebook so that the notebook is less messy.

如果您不满意用脚本编写所有代码,则可以将脚本和Jupyter Notebook都用于不同的目的。 例如,您可以在脚本中创建类和函数,然后将其导入笔记本中,以使笔记本不那么混乱。

Another alternative is to turn the notebook into the script after writing the notebook. I personally don't prefer this approach because it often takes me longer to organize the code in my notebook such as put them into functions and classes and write test functions.

另一种选择是在编写笔记本后将笔记本变成脚本。 我个人不喜欢这种方法,因为通常需要我花费更长的时间在笔记本中组织代码,例如将它们放入函数和类中以及编写测试函数。

I find writing a small function then writing a small test function is faster and safer. If I happen to want to speeds up my code with the new Python library, I could use the test function I already wrote to make sure it still works as I expected.

我发现编写一个小的函数然后编写一个小的测试函数会更快,更安全。 如果我碰巧想用新的Python库加速代码,则可以使用已经编写的测试函数来确保它仍然可以按预期工作。

With that being said, I believe there are more ways to solve the disadvantage of Jupyter Notebook than what I mentioned here such as how Netflix uses put the notebook into production and schedule the notebook to run at a certain time.

话虽这么说,我相信比我在这里提到的解决Jupyter Notebook的缺点还有更多的方法,例如Netflix如何使用Netflix将笔记本电脑投入生产并安排笔记本电脑在特定时间运行 。

结论 (Conclusion)

Everybody has their own way to make their workflow more efficient and to me, it is to leverage the utility of scripts. If you have just switched from Jupyter Notebook to script, it might not be intuitive to write code in scripts, but trust me, you will get used to using scripts eventually.

每个人都有自己的方法来提高工作流程的效率,对我来说,这是利用脚本的实用程序。 如果您刚刚从Jupyter Notebook切换到脚本,那么用脚本编写代码可能并不直观,但是请相信我,您最终将习惯于使用脚本。

Once that happens, you will start to realize many benefits of the scripts over the messy Jupyter Notebook and want to write most of your code in scripts.

一旦发生这种情况,相对于凌乱的Jupyter Notebook,您将开始意识到脚本的许多优点,并希望将大多数代码编写在脚本中。

If you don’t feel comfortable with the big change, start small.

如果您对较大的变化不满意,请从小处着手。

Big changes start with small steps

大变化始于小步

I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on LinkedIn and Twitter.

我喜欢写有关基本数据科学概念的文章,并喜欢使用不同的算法和数据科学工具。 您可以在LinkedIn和Twitter上与我联系。

Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these

如果您想查看我编写的所有文章的代码,请给此回购加注星号。 在Medium上关注我,以了解有关这些最新数据科学文章的最新信息

翻译自: https://towardsdatascience.com/5-reasons-why-you-should-switch-from-jupyter-notebook-to-scripts-cb3535ba9c95

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390563.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

leetcode 1833. 雪糕的最大数量

夏日炎炎,小男孩 Tony 想买一些雪糕消消暑。 商店中新到 n 支雪糕,用长度为 n 的数组 costs 表示雪糕的定价,其中 costs[i] 表示第 i 支雪糕的现金价格。Tony 一共有 coins 现金可以用于消费,他想要买尽可能多的雪糕。 给你价格…

MVC架构 -- 初学试水选课管理系统

项目文件网站地址:http://www.gegecool.cn:90/ 第一次对MVC 进行转载于:https://www.cnblogs.com/wtusoso/p/8032120.html

rest api 示例2_REST API教程– REST Client,REST Service和API调用通过代码示例进行了解释

rest api 示例2Ever wondered how login/signup on a website works on the back-end? Or how when you search for "cute kitties" on YouTube, you get a bunch of results and are able to stream off of a remote machine?有没有想过网站上的登录/注册在后端如…

win10子系统linux编译ffmpeg

android-ndk-r14b(linux版) ffmpeg-4.0 开启win10子系统(控制面板-》程序和功能-》启用或关闭Windows功能 然后在 适用与 Linux 的 Windows 子系统前面打勾) 然后点击确定,等待安装,电脑会重启 然后在win10应用商店 搜索ubuntu安装…

ip登录打印机怎么打印_不要打印,登录。

ip登录打印机怎么打印Often on Python, especially as a beginner, you might print( ) a variable in order to see what is happening in your program. It is possible if you rely on too many print statements throughout your program you will face the nightmare of h…

leetcode 451. 根据字符出现频率排序

给定一个字符串,请将字符串里的字符按照出现的频率降序排列。 示例 1:输入: "tree"输出: "eert"解释: e出现两次,r和t都只出现一次。 因此e必须出现在r和t之前。此外,"eetr"也是一个有效的答案。 示例 2:输入…

Spring-Security 自定义Filter完成验证码校验

Spring-Security的功能主要是由一堆Filter构成过滤器链来实现,每个Filter都会完成自己的一部分工作。我今天要做的是对UsernamePasswordAuthenticationFilter进行扩展,新增一个Filter,完成对登录页面的校验码的验证。下面先给一张过滤器的说明…

如何使用Ionic和Firebase在短短三天内创建冠状病毒跟踪器应用程序

I am really fond of Hybrid App technologies – they help us achieve so much in a single codebase. Using the Ionic Framework, I developed a cross-platform mobile solution for tracking Coronavirus cases in just 3 days. 我真的很喜欢Hybrid App技术-它们可以帮助…

二、Java面向对象(7)_封装思想——this关键字

2018-04-30 this关键字 什么是this: 表示当前对象本身,或当前类的一个实例,通过 this 可以调用本对象的所有方法和属性。 this主要存在于两个地方: 1)构造函数:此时this表示调用当前创建的对象 2)成员方法中…

机器学习模型 非线性模型_调试机器学习模型的终极指南

机器学习模型 非线性模型You’ve divided your data into a training, development and test set, with the correct percentage of samples in each block, and you’ve also made sure that all of these blocks (specially development and test set) come from the same di…

leetcode 645. 错误的集合

集合 s 包含从 1 到 n 的整数。不幸的是,因为数据错误,导致集合里面某一个数字复制了成了集合里面的另外一个数字的值,导致集合 丢失了一个数字 并且 有一个数字重复 。 给定一个数组 nums 代表了集合 S 发生错误后的结果。 请你找出重复出…

Linux环境变量总结

现在每天测试到时候会与Linux打交道,自然也会用到环境变量了。看了网上几篇文章,结合自己到实践和看法,总结以下Linux的环境变量吧。一、什么是环境变量?环境变量相当于给系统或用户应用程序设置的一些参数, 具体起什么作用这当然…

目录指南中的Python列表文件-listdir VS system(“ ls”)通过示例进行解释

🔹欢迎 (🔹 Welcome) If you want to learn how these functions work behind the scenes and how you can use their full power, then this article is for you.如果您想了解这些功能在后台如何工作以及如何充分利用它们的功能,那么本文适合…

Java多线程并发学习-进阶大纲

1、synchronized 的实现原理以及锁优化?2、volatile 的实现原理?3、Java 的信号灯?4、synchronized 在静态方法和普通方法的区别?5、怎么实现所有线程在等待某个事件的发生才会去执行?6、CAS?CAS 有什么缺陷…

大数据定律与中心极限定理_为什么中心极限定理对数据科学家很重要?

大数据定律与中心极限定理数据科学 (Data Science) The Central Limit Theorem is at the center of statistical inference what each data scientist/data analyst does every day.中心极限定理是每个数据科学家/数据分析师每天所做的统计推断的中心。 Central Limit Theore…

useEffect语法讲解

useEffect语法讲解 用法 useEffect(effectFn, deps)能力 useEffect Hook 相当于 componentDidMount,componentDidUpdate 和 componentWillUnmount 这三个函数的组合。 可以模拟渲染后、更新后、销毁三个动作。 案例演示 渲染后更新标题 useEffect(()>{doc…

leetcode 726. 原子的数量

给定一个化学式formula(作为字符串),返回每种原子的数量。 原子总是以一个大写字母开始,接着跟随0个或任意个小写字母,表示原子的名字。 如果数量大于 1,原子后会跟着数字表示原子的数量。如果数量等于 1…

web相关基础知识1

2017-12-13 09:47:11 关于HTML 1.绝对路径和相对路径 相对路径:相对于文件自身为参考。 (工作中一般是使用相对路径) 这里我们用html文件为参考。如果说html和图片平级,那直接使用src 如果说图片在和html平级的文件夹里面&#xf…

JavaScript循环:标签语句,继续语句和中断语句说明

标签声明 (Label Statement) The Label Statement is used with the break and continue statements and serves to identify the statement to which the break and continue statements apply. Label语句与break和continue语句一起使用,用于标识break和continue语…

马约拉纳费米子:推动量子计算的“天使粒子”

据《人民日报》报道,以华人科学家为主体的科研团队找到了正反同体的“天使粒子”——马约拉纳费米子,从而结束了国际物理学界对这一神秘粒子长达80年的漫长追寻。该成果由加利福尼亚大学洛杉矶分校何庆林、王康隆课题组,美国斯坦福大学教授张…