从Jupyter Notebook切换到脚本的5个理由

意见 (Opinion)

动机 (Motivation)

Like most people, the first tool I used when started learning data science is Jupyter Notebook. Most of the online data science courses use Jupyter Notebook as a medium to teach. This makes sense because it is easier for beginners to start writing code in Jupyter Notebook’s cells than writing a script with classes and functions.

与大多数人一样,我开始学习数据科学时使用的第一个工具是Jupyter Notebook。 大多数在线数据科学课程都使用Jupyter Notebook作为教学手段。 这是有道理的,因为对于初学者来说,在Jupyter Notebook的单元格中开始编写代码比编写具有类和函数的脚本要容易得多。

Another reason why Jupyter Notebook is such a common tool in data science is that Jupyter Notebook makes it easy to explore and plot the data. When we type ‘Shift + Enter’, we will immediately see the results of the code, which makes it easy for us to identify whether our code works or not.

Jupyter Notebook之所以成为数据科学中如此普遍的工具的另一个原因是,Jupyter Notebook使其易于浏览和绘制数据。 当我们键入“ Shift + Enter”时,我们将立即看到代码的结果,这使我们很容易确定我们的代码是否有效。

However, I realized several fallbacks of Jupyter Notebook as I work with more data science projects:

但是,当我处理更多数据科学项目时,我意识到了Jupyter Notebook的一些后备功能:

  • Unorganized: As my code gets bigger, it becomes increasingly difficult for me to keep track of what I write. No matter how many markdowns I use to separate the notebook into different sections, the disconnected cells make it difficult for me to concentrate on what the code does.

    杂乱无章 :随着我的代码变得越来越大,对我而言,跟踪自己的编写变得越来越困难。 无论我使用多少次降价将笔记本分成不同的部分,断开的单元格都使我难以集中精力执行代码。

  • Difficult to experiment: You may want to test with different methods of processing your data, choose different parameters for your machine learning algorithm to see if the accuracy increases. But every time you experiment with new methods, you need to rerun the entire notebook. This is time-consuming, especially when the processing procedure or the training takes a long time to run.

    难以实验: 可能想用不同的数据处理方法进行测试,为机器学习算法选择不同的参数以查看准确性是否提高。 但是,每次尝试新方法时,都需要重新运行整个笔记本。 这非常耗时,尤其是在处理过程或培训需要很长时间才能运行时。

  • Not ideal for reproducibility: If you want to use new data with a slightly different structure, it would be difficult to identify the source of error in your notebook.

    对于重现性而言并不理想:如果要使用结构略有不同的新数据,则很难在笔记本中识别错误源。

  • Difficult to debug: When you get an error in your code, it is difficult to know whether the reason for the error is the code or the change in data. If the error is in the code, which part of the code is causing the problem?

    难以调试:当您得到 代码中的错误,很难知道错误的原因是代码还是数据更改。 如果错误出在代码中,则代码的哪一部分导致了问题?

  • Not ideal for production: Jupyter Notebook does not play very well with other tools. It is not easy to run the code from Jupyter Notebook while using other tools.

    对于生产而言并不理想: Jupyter Notebook在与其他工具配合使用时效果不佳。 使用其他工具时,从Jupyter Notebook运行代码并不容易。

I knew there must be a better way to handle my code so I decided to give scripts a try. These are the benefits I found when using scripts:

我知道必须有一种更好的方式来处理我的代码,所以我决定尝试一下脚本。 这些是我在使用脚本时发现的好处:

有组织的 (Organized)

The cells in Jupyter Notebook make it difficult to organize the code into different parts. With a script, we could create several small functions with each function specifies what the code does like this

Jupyter Notebook中的单元格使得很难将代码组织成不同的部分。 使用脚本,我们可以创建几个小函数,每个函数指定代码的功能,如下所示

Image for post

Better yet, if these functions could be categorized in the same category such as functions to process the data, we could put them in the same class!

更好的是,如果可以将这些函数归为同一类,例如处理数据的函数,我们可以将它们归为同一类!

Image for post

Whenever we want to process our data, we know the functions in the class Preprocess can be used for this purpose.

每当我们要处理数据时,我们都知道Preprocess类中的函数可用于此目的。

鼓励实验 (Encourage Experiment)

When we want to experiment with a different approach to preprocess data, we could just add or remove a function by commenting out like this without being afraid to break the code! Even if we happen to break the code, we know exactly where to fix it.

当我们想尝试另一种预处理数据的方法时,我们可以通过注释掉这样的方式来添加或删除函数,而不必担心破坏代码! 即使我们碰巧破坏了代码,我们也知道在哪里修复它。

Image for post

We could also experiment with different parameters by changing the input of the functions. For example, if we want to see how different methods of resampling my Pandas series affect my results, we could just switch from method_of_resample='sum’ to method_of_resample= 'average'. How neat!

我们还可以通过更改函数的输入来试验不同的参数。 例如,如果要查看对熊猫系列进行重采样的不同方法如何影响我的结果,可以将其从method_of_resample='sum'切换到method_of_resample= 'average' 。 多么整洁!

Image for post

You can still use functions in a notebook, but when your number of functions gets really big, you might want to split the functions in different notebooks. Importing functions across different notebook is not easy.

您仍然可以在笔记本中使用功能,但是当功能数量真的很大时,您可能希望将功能拆分到不同的笔记本中。 跨不同笔记本导入功能并不容易。

重现性的理想选择 (Ideal for Reproducibility)

With classes and functions, we could make the code general enough so that it will be able to work with other data.

使用类和函数,我们可以使代码足够通用,以便能够与其他数据一起使用。

For example, if we want to drop different columns in my new data, we just need to change columns_to_drop to a list of columns, we want to drop and the code will run smoothly!

例如,如果我们想在新数据中删除不同的列,我们只需要将columns_to_drop更改为列的列表,我们就可以删除并且代码将平稳运行!

columns_to_drop = config.columns.to_dropdatetime_column = config.columns.datetime.sentimentdropna_columns = config.columns.drop_naprocessor = Preprocess(columns_to_drop, datetime_column, dropna_columns)

I can also create a pipeline that specifies steps to process and train the data! Once I have a pipeline, all I need to do is to use

我还可以创建一个管道来指定处理和训练数据的步骤! 一旦有了管道,我要做的就是使用

pipline.fit_transform(data)

to apply the same processing to both the train and test data.

对火车和测试数据进行相同的处理。

易于调试 (Easy to Debug)

With functions, it is easier to test whether that function produces the output we expect. We can quickly spot out where in the code we should change to produce the output we want

使用函数,可以更轻松地测试该函数是否产生我们期望的输出。 我们可以快速找出应该在代码中更改的位置以产生所需的输出

def extract_date_hour_minute(string: str):'''Extract data hour and minute from datetime string'''try:return string[:16]except TypeError:return np.nandef test_extract_date_hour_minute():'''Test whether the function extract date, hour, and minute '''string = '2020-07-30T23:25:31.036+03:00'assert extract_date_hour_minute(string) == '2020-07-30T23:25'

If all of the tests pass but there is still an error in running our code, we know the data is where we should look next.

如果所有测试都通过了,但是在运行我们的代码时仍然存在错误,那么我们知道数据是我们下一步应该去的地方。

For example, after passing the test above, I still have a TypeError when running the script, which gives me the idea that my data has null values. I just need to take care of that to run the code smoothly.

例如,通过上述测试后,运行脚本时我仍然遇到TypeError,这使我想到了我的数据具有空值。 我只需要注意这一点即可顺利运行代码。

生产的理想选择 (Ideal for Production)

We can use different functions in multiple scripts on top of something else like this

我们可以在类似这样的其他东西的多个脚本中使用不同的功能

from preprocess import preprocess
from model import run_model
from predict import predictdef main(config):df = preprocess(config)df = run_model(config)df, df_scale, min_day, max_day, accuracy = predict(df, config)

or to add a config file to control the values of the variables. This prevents us from wasting time tracking down a specific variable in the code just to change its value.

或添加配置文件以控制变量的值。 这样可以避免我们浪费时间跟踪代码中的特定变量以更改其值。

columns:to_drop:#- keywords#- entities- code- error- warningsbinary_columns: - sentiment - Diffdatetime:time: Date sentiment: crawleddrop_na: - sentiment- usage- crawled- emotionto_predict: sentiment

We could also easily add tools to track the experiment such as MLFlow or tools to handle configuration such as Hydra.cc!

我们还可以很容易地添加工具来跟踪实验,如MLFlow或工具来处理配置,如Hydra.cc !

我不喜欢使用Jupyter Notebook的想法,直到我将自己推出舒适区 (I didn’t like the Idea of Using Jupyter Notebook until I Pushed myself out of my Comfort Zone)

I used to use Jupyter Notebook all the time. When some data scientists advise me to switch from Jupyter Notebook to script to prevent some problems listed above, I didn’t understand and felt resistant to do so. I didn’t like the uncertainty of not being able to see the outcome when I run the cell.

我曾经一直使用Jupyter Notebook。 当一些数据科学家建议我从Jupyter Notebook切换到脚本以防止上面列出的某些问题时,我并不理解,并且对此感到抵触。 我不喜欢在运行单元时无法看到结果的不确定性。

But the disadvantage of Jupyter Notebook grew as I started my first real data science project in my new company so I decided to push myself out of my comfort zone and experiment with scripts.

但是Jupyter Notebook的劣势随着我在新公司中开始第一个真实数据科学项目而变得越来越严重,因此我决定将自己从舒适的领域中脱身出来,并尝试使用脚本。

In the beginning, I felt uncomfortable but started to notice the benefits of using scripts. I started to feel more organized when my code is organized into different functions, classes, and into multiple scripts with each script serving different purposes such as preprocessing, training, and testing.

一开始,我感到不舒服,但是开始注意到使用脚本的好处。 当我的代码被组织成不同的函数,类和多个脚本,并且每个脚本具有不同的目的(例如预处理,培训和测试)时,我开始变得井井有条。

所以,您是否建议我停止使用Jupyter Notebook? (So are you Suggesting me to Stop Using Jupyter Notebook?)

Don’t get me wrong. I still use Jupyter Notebook if my code is small and if I don’t plan to put my code into production. I use Jupyter Notebook when I want to explore and visualize the data. I also use it to explain how to use some python libraries. For example, I write use mostly Jupyter Notebooks in this repository as the medium to explain the code mentioned in all of my articles.

不要误会我的意思。 如果我的代码很小并且我不打算将代码投入生产,我仍然会使用Jupyter Notebook。 当我想浏览和可视化数据时,我使用Jupyter Notebook。 我也用它来解释如何使用一些python库。 例如,我在这个存储库中主要使用Jupyter Notebooks作为媒介来解释我所有文章中提到的代码。

If you don’t feel comfortable with coding everything in scripts, you could use both scripts and Jupyter Notebook for different purposes. For example, you could create classes and functions in scripts then import them in the notebook so that the notebook is less messy.

如果您不满意用脚本编写所有代码,则可以将脚本和Jupyter Notebook都用于不同的目的。 例如,您可以在脚本中创建类和函数,然后将其导入笔记本中,以使笔记本不那么混乱。

Another alternative is to turn the notebook into the script after writing the notebook. I personally don't prefer this approach because it often takes me longer to organize the code in my notebook such as put them into functions and classes and write test functions.

另一种选择是在编写笔记本后将笔记本变成脚本。 我个人不喜欢这种方法,因为通常需要我花费更长的时间在笔记本中组织代码,例如将它们放入函数和类中以及编写测试函数。

I find writing a small function then writing a small test function is faster and safer. If I happen to want to speeds up my code with the new Python library, I could use the test function I already wrote to make sure it still works as I expected.

我发现编写一个小的函数然后编写一个小的测试函数会更快,更安全。 如果我碰巧想用新的Python库加速代码,则可以使用已经编写的测试函数来确保它仍然可以按预期工作。

With that being said, I believe there are more ways to solve the disadvantage of Jupyter Notebook than what I mentioned here such as how Netflix uses put the notebook into production and schedule the notebook to run at a certain time.

话虽这么说,我相信比我在这里提到的解决Jupyter Notebook的缺点还有更多的方法,例如Netflix如何使用Netflix将笔记本电脑投入生产并安排笔记本电脑在特定时间运行 。

结论 (Conclusion)

Everybody has their own way to make their workflow more efficient and to me, it is to leverage the utility of scripts. If you have just switched from Jupyter Notebook to script, it might not be intuitive to write code in scripts, but trust me, you will get used to using scripts eventually.

每个人都有自己的方法来提高工作流程的效率,对我来说,这是利用脚本的实用程序。 如果您刚刚从Jupyter Notebook切换到脚本,那么用脚本编写代码可能并不直观,但是请相信我,您最终将习惯于使用脚本。

Once that happens, you will start to realize many benefits of the scripts over the messy Jupyter Notebook and want to write most of your code in scripts.

一旦发生这种情况,相对于凌乱的Jupyter Notebook,您将开始意识到脚本的许多优点,并希望将大多数代码编写在脚本中。

If you don’t feel comfortable with the big change, start small.

如果您对较大的变化不满意,请从小处着手。

Big changes start with small steps

大变化始于小步

I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on LinkedIn and Twitter.

我喜欢写有关基本数据科学概念的文章,并喜欢使用不同的算法和数据科学工具。 您可以在LinkedIn和Twitter上与我联系。

Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these

如果您想查看我编写的所有文章的代码,请给此回购加注星号。 在Medium上关注我,以了解有关这些最新数据科学文章的最新信息

翻译自: https://towardsdatascience.com/5-reasons-why-you-should-switch-from-jupyter-notebook-to-scripts-cb3535ba9c95

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390563.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

win10子系统linux编译ffmpeg

android-ndk-r14b(linux版) ffmpeg-4.0 开启win10子系统(控制面板-》程序和功能-》启用或关闭Windows功能 然后在 适用与 Linux 的 Windows 子系统前面打勾) 然后点击确定,等待安装,电脑会重启 然后在win10应用商店 搜索ubuntu安装…

leetcode 451. 根据字符出现频率排序

给定一个字符串,请将字符串里的字符按照出现的频率降序排列。 示例 1:输入: "tree"输出: "eert"解释: e出现两次,r和t都只出现一次。 因此e必须出现在r和t之前。此外,"eetr"也是一个有效的答案。 示例 2:输入…

Spring-Security 自定义Filter完成验证码校验

Spring-Security的功能主要是由一堆Filter构成过滤器链来实现,每个Filter都会完成自己的一部分工作。我今天要做的是对UsernamePasswordAuthenticationFilter进行扩展,新增一个Filter,完成对登录页面的校验码的验证。下面先给一张过滤器的说明…

如何使用Ionic和Firebase在短短三天内创建冠状病毒跟踪器应用程序

I am really fond of Hybrid App technologies – they help us achieve so much in a single codebase. Using the Ionic Framework, I developed a cross-platform mobile solution for tracking Coronavirus cases in just 3 days. 我真的很喜欢Hybrid App技术-它们可以帮助…

二、Java面向对象(7)_封装思想——this关键字

2018-04-30 this关键字 什么是this: 表示当前对象本身,或当前类的一个实例,通过 this 可以调用本对象的所有方法和属性。 this主要存在于两个地方: 1)构造函数:此时this表示调用当前创建的对象 2)成员方法中…

机器学习模型 非线性模型_调试机器学习模型的终极指南

机器学习模型 非线性模型You’ve divided your data into a training, development and test set, with the correct percentage of samples in each block, and you’ve also made sure that all of these blocks (specially development and test set) come from the same di…

web相关基础知识1

2017-12-13 09:47:11 关于HTML 1.绝对路径和相对路径 相对路径:相对于文件自身为参考。 (工作中一般是使用相对路径) 这里我们用html文件为参考。如果说html和图片平级,那直接使用src 如果说图片在和html平级的文件夹里面&#xf…

您的第一个简单的机器学习项目

This article is for those dummies like me, who’ve never tried to know what machine learning was or have left it halfway for the sole reason of being overwhelmed. Follow through every line and stay along. I promise you’d be quite acquainted with giving yo…

eclipse报Access restriction: The type 'BASE64Decoder' is not API处理方法

今天从svn更新代码之后,由于代码中使用了BASE64Encoder 更新之后报如下错误: Access restriction: The type ‘BASE64Decoder’ is not API (restriction on required library ‘D:\java\jdk1.7.0_45\jre\lib\rt.jar’) 解决其实很简单,把JR…

简单团队-爬取豆瓣电影T250-项目进度

本次主要讲解一下我们的页面设计及展示最终效果: 页面设计主要用到的软件是:html,css,js, 主要用的编译器是:sublime,dreamweaver,eclipse,由于每个人使用习惯不一样&…

鸽子为什么喜欢盘旋_如何为鸽子回避系统设置数据收集

鸽子为什么喜欢盘旋鸽子回避系统 (Pigeon Avoidance System) Disclaimer: You are reading Part 2 that describes the technical setup. Part 1 gave an overview of the Pigeon Avoidance System and Part 3 provides details about the Pigeon Recognition Model.免责声明&a…

前端开发-DOM

文档对象模型(Document Object Model,DOM)是一种用于HTML和XML文档的编程接口。它给文档提供了一种结构化的表示方法,可以改变文档的内容和呈现方式。我们最为关心的是,DOM把网页和脚本以及其他的编程语言联系了起来。…

JAVA-初步认识-第十三章-多线程(验证同步函数的锁)

一. 至于同步函数用的是哪个锁,我们可以验证一下,借助原先卖票的例子 对于程序中的num,从100改为400,DOS的结果显示的始终都是0线程,票号最小都是1。 票号是没有问题的,因为同步了。 有人针对只出现0线程&a…

追求卓越追求完美规范学习_追求新的黄金比例

追求卓越追求完美规范学习The golden ratio is originally a mathematical term. But art, architecture, and design are inconceivable without this math. Everyone aspires to golden proportions as beautiful and unattainable perfection. By visualizing data, we chal…

leetcode 275. H 指数 II

给定一位研究者论文被引用次数的数组(被引用次数是非负整数),数组已经按照 升序排列 。编写一个方法,计算出研究者的 h 指数。 h 指数的定义: “h 代表“高引用次数”(high citations),一名科研…

leetcode 218. 天际线问题

城市的天际线是从远处观看该城市中所有建筑物形成的轮廓的外部轮廓。给你所有建筑物的位置和高度,请返回由这些建筑物形成的 天际线 。 每个建筑物的几何信息由数组 buildings 表示,其中三元组 buildings[i] [lefti, righti, heighti] 表示&#xff1a…

[Android Pro] 终极组件化框架项目方案详解

cp from : https://blog.csdn.net/pochenpiji159/article/details/78660844 前言 本文所讲的组件化案例是基于自己开源的组件化框架项目github上地址github.com/HelloChenJi…其中即时通讯(Chat)模块是单独的项目github上地址github.com/HelloChenJi… 1.什么是组件化&#xff…

leetcode 1818. 绝对差值和

给你两个正整数数组 nums1 和 nums2 &#xff0c;数组的长度都是 n 。 数组 nums1 和 nums2 的 绝对差值和 定义为所有 |nums1[i] - nums2[i]|&#xff08;0 < i < n&#xff09;的 总和&#xff08;下标从 0 开始&#xff09;。 你可以选用 nums1 中的 任意一个 元素来…

【转载】keil5中加入STM32F10X_HD,USE_STDPERIPH_DRIVER的原因

初学STM32&#xff0c;在RealView MDK 环境中使用STM32固件库建立工程时&#xff0c;初学者可能会遇到编译不通过的问题。出现如下警告或错误提示&#xff1a; warning: #223-D: function "assert_param" declared implicitly;assert_param(IS_GPIO_ALL_PERIPH(GPIOx…

剑指 Offer 53 - I. 在排序数组中查找数字 I(二分法)

统计一个数字在排序数组中出现的次数。 示例 1: 输入: nums [5,7,7,8,8,10], target 8 输出: 2 示例 2: 输入: nums [5,7,7,8,8,10], target 6 输出: 0 限制&#xff1a; 0 < 数组长度 < 50000 解题思路 先用二分法查找出其中一个目标元素再向目标元素两边查找…