大数据 notebook_Dockerless Notebook:数据科学期待已久的未来

大数据 notebook

Data science is hard. Data scientists spend hours figuring out how to install that Python package on their laptops. Data scientists read many pages of Google search results to connect to that database. Data scientists write a detailed document for engineers to deploy machine learning models into production. Data scientists prepare nice slides to convince business guys on how to improve retention rates. Data scientists worry about their data pipeline breaks which cause data quality issues.

数据科学很难。 数据科学家花了数小时来弄清楚如何在笔记本电脑上安装该Python软件包。 数据科学家阅读了许多Google搜索结果页面以连接到该数据库。 数据科学家为工程师编写了详细的文档,以将机器学习模型部署到生产中。 数据科学家准备了不错的幻灯片,以说服业务人员如何提高保留率。 数据科学家担心他们的数据管道中断会导致数据质量问题。

The challenge of data science is real. There are steep learning curves of new languages that they are not familiar with. There are business impact requirements that no one knows how to meet in limited time. There are the best engineering practices to follow to ensure the quality of their deliverables. There is limited engineering support for the data science team.

数据科学的挑战是真实的。 他们不熟悉的新语言有很多陡峭的学习曲线。 有一些业务影响需求,没人会在有限的时间内满足。 有最佳的工程实践可遵循,以确保其交付成果的质量。 数据科学团队的工程支持有限。

docker容器可以解决什么问题? (What problems do docker containers solve?)

For individual data scientists and other data team members: It is a frustrating experience to set up a development environment and maintain a consistent operating environment. The installation instructions often do not cover all dependency required. Some GPU-based AI libraries require data scientists to be familiar with low-level details of the hardware. The error information is not informative enough to explain the causes of the error. The dependency conflicts between libraries make it is hard to maintain a working development environment for multiple projects. The collaboration between data scientists and engineers requires extra and unnecessary works from both.

对于单个数据科学家和其他数据团队成员:设置开发环境并维护一致的操作环境是令人沮丧的经验。 安装说明通常不会涵盖所有必需的依赖项。 一些基于GPU的AI库要求数据科学家熟悉硬件的底层细节。 错误信息的信息不足以解释错误的原因。 库之间的依赖关系冲突使得很难为多个项目维护有效的开发环境。 数据科学家和工程师之间的合作需要双方的额外和不必要的工作。

Python虚拟环境如何? (How about Python virtual environment?)

Admittedly, Python virtual environment works for some data scientists nicely. However, it does not meet the diverse requirements for data science tasks:

诚然,Python虚拟环境非常适合某些数据科学家。 但是,它不能满足数据科学任务的各种要求:

  1. It’s become more common that data scientists are using Spark, R, and SQL daily. How can Python virtual environment work for different languages and frameworks other than Python?

    数据科学家每天使用Spark,R和SQL变得越来越普遍。 Python虚拟环境如何在Python以外的其他语言和框架下工作?
  2. Some data scientists mainly work with their engineering teammates to deploy machine learning models to production. How does Python virtual environment if there is a dependency on the operating system rather than the python library?

    一些数据科学家主要与工程团队合作,将机器学习模型部署到生产环境中。 如果依赖操作系统而不是python库,那么Python虚拟环境如何处理?

The birth of conda alleviates these two issues and it is a fact that conda is quite popular among the data science community. The installation of conda itself is not difficult and it ships environments with many common data science packages.

conda的诞生缓解了这两个问题,事实是conda在数据科学界非常流行。 conda本身的安装并不困难,它随环境提供了许多常见的数据科学软件包。

However, not all packages that are available in pip are available on conda. If one package cannot be found on conda, then data scientists may have to use pip alongside conda which is a major source of confusion and unexpected issues. For example, in this unsolved Github issue, there are many arguments over how does pip work with conda.

然而,并非在所有可用的软件包pip可在conda 。 如果不能找到一个包conda ,那么数据科学家可能需要使用PIP一起conda这是混乱和意外问题的主要来源。 例如,在这个尚未解决的Github问题中 ,关于pip如何与conda一起使用存在许多争论。

Ironically, the VP of Anaconda once made a speech titled “Conda, Docker, and Kubernetes: The cloud-native future of data science”. It is useless if the environment-related issue is solved by 99%. It is the 1% issue left that makes the developer experience unacceptable.

具有讽刺意味的是,Anaconda的副总裁曾经发表过一篇题为“ Conda,Docker和Kubernetes:数据科学的云原生未来”的演讲。 如果与环境有关的问题解决了99%,那就没有用了。 剩下的1%问题使开发人员无法接受。

泊坞窗容器如何提供帮助? (How does a docker container help?)

Loosely speaking, a docker container is a “lightweight virtual machine” that packages everything needed to run applications into one docker image. Docker image is designed to move between servers and guarantee the environments are consistent.

松散地说,泊坞窗容器是“轻量级虚拟机”,它将运行应用程序所需的所有内容打包到一个泊坞窗映像中。 Docker映像旨在在服务器之间移动并确保环境一致。

As a result, data scientists would not worry anymore about the dependency breaks when deploying machine learning models into production. The new graduate onboarded last week can start to make contributions to the team as soon as the docker container is running, rather than secretly searching for new positions at companies that have a better infrastructure set up data science teams.

因此,在将机器学习模型部署到生产环境中时,数据科学家将不再担心依赖关系中断。 上周入职的新毕业生可以在Docker容器运行后立即开始为团队做出贡献,而不是在具有更好基础架构的公司中秘密寻找新职位,以建立数据科学团队。

Why are docker containers not popular among the data science community?

为什么Docker容器在数据科学界不受欢迎?

Docker is not a new technology at all, why the majority of data scientists have not adopted it? There are mainly two reasons:

Docker根本不是一种新技术,为什么大多数数据科学家都没有采用它? 主要有两个原因:

  1. The learning curve is steep.

    学习曲线陡峭。
  2. The developer experience is bad.

    开发人员体验很差。

To get started with docker containers, one has to learn at least how to

要开始使用Docker容器,必须至少学习如何

  1. start/stop a container

    启动/停止容器
  2. attach the shell to a running container

    将外壳连接到正在运行的容器
  3. mount the local volume to a container

    将本地卷安装到容器

In reality, these are not enough: how to sudo inside a container that I do not know the password? Why my docker container lost all the data after it is stopped? How do I set up a private docker registry so I can pull the docker image from my remote clusters? How can I kill the processes that are using port 8808?

实际上,这些还不够:如何在我不知道密码的容器内进行sudo操作? 为什么我的Docker容器停止后会丢失所有数据? 如何设置私有Docker注册表,以便可以从远程集群中提取Docker映像? 如何杀死正在使用端口8808的进程?

When it comes to writing Dockerfile, one has to be familiar with Linux Shell command and Dockerfile syntax. If one project is going to use one docker image, there are so many docker images to manage than a software engineer may have.

在编写Dockerfile ,必须熟悉Linux Shell命令和Dockerfile语法。 如果一个项目要使用一个docker映像,那么要管理的docker映像太多了,而软件工程师可能没有。

So data scientists either having a hard time fixing environment-related issues, giving up reproducibility and suffering from bad engineering practice, or spend too much time learning and operating docker.

因此,数据科学家要么很难解决与环境相关的问题,要么放弃可重复性并遭受不良的工程实践之苦,要么花太多时间学习和操作docker。

It is NOT data scientists’ job to take care of the environment

照顾环境不是数据科学家的工作

Data scientists should NOT spend time on environments so that they can focus on what they are good at building dashboards, developing machine learning models, informing business teammates with actionable insights.

数据科学家应该把时间花在环境,使他们能够在构建仪表板,开发机器学习模型,提供可操作的见解通知业务的队友们专注于他们所擅长。

Dockerless Notebook是未来 (Dockerless Notebook is the future)

Imagine there is a smart and capable docker helper that does everything for you: When you start the notebook, it can automatically start the container and attach it to the notebook. When you want to move your notebook to run on a remote cluster, it can commit your local docker container, send it to a remote local cluster, and manage it automatically.

想象一下,有一个聪明而功能强大的docker helper可以为您完成所有工作:启动笔记本计算机时,它可以自动启动容器并将其连接到笔记本计算机。 当您要移动笔记本以在远程群集上运行时,它可以提交本地docker容器,将其发送到远程本地群集,并自动进行管理。

The idea “Dockerless notebook” is that it allows you to develop and operate notebooks without thinking about docker containers. It is tightly integrated with the notebook data scientists use everyday. It eliminates learning docker container and operating tasks such as start/stop container, attach the shell to containers, and mount volumes to containers. You won’t even notice that a docker is running on your laptop like the way that you won’t notice how Jupyter Notebook exchanges data between browser and memory.

“无Docker笔记本 ”的想法是,它使您无需考虑Docker容器即可开发和操作笔记本。 它与科学家每天使用的笔记本电脑紧密集成。 它消除了学习docker容器和操作任务(例如启动/停止容器,将外壳连接到容器以及将卷安装到容器)的麻烦。 您甚至不会注意到docker在笔记本电脑上运行,就像您不会注意到Jupyter Notebook如何在浏览器和内存之间交换数据的方式一样。

The “Dockerless notebook” will help the Data Science community move closer to “reproducible data science” and “frictionless data science” without unacceptable costs.

“无Docker笔记本 ”将帮助数据科学界向“可复制数据科学”和“无摩擦数据科学”靠拢,而不会产生不可接受的成本。

翻译自: https://towardsdatascience.com/dockerless-notebook-the-long-awaited-future-of-data-science-7cde7707f7ff

大数据 notebook

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392193.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【NGN学习笔记】6 代理(Proxy)和背靠背用户代理(B2BUA)

1. 什么是Proxy模式? 按照RFC3261中的定义,Proxy服务器是一个中间的实体,它本身即作为客户端也作为服务端,为其他客户端提供请求的转发服务。一个Proxy服务器首先提供的是路由服务,也就是说保证请求被发到更加”靠近”…

分布与并行计算—并行计算π(Java)

并行计算π public class pithread extends Thread {private static long mini1000000000;private long start,diff;double sum0;double cur1/(double)mini;public pithread(long start,long diff) {this.startstart;this.diffdiff;}Overridepublic void run() {long istart;f…

linux复制文件跳过相同,Linux cp指令,怎么跳过相同的文件

1、使用cp命令的-n参数即可跳过相同的文件 。2、cp命令使用详解:1)、用法:cp [选项]... [-T] 源文件 目标文件或:cp [选项]... 源文件... 目录或:cp [选项]... -t 目录 源文件...将源文件复制至目标文件,或将多个源文件…

eclipse类自动生成注释

1.创建新类时自动生成注释 window->preference->java->code styple->code template 当你选择到这部的时候就会看见右侧有一个框显示出code这个选项,你点开这个选项,点一下他下面的New …

rman恢复

--建表create table sales( product_id number(10), sales_date date, sales_cost number(10,2), status varchar2(20));--插数据insert into sales values (1,sysdate-90,18.23,inactive);commit; --启用rman做全库备份 运行D:\autobackup\rman\backup_orcl.bat 生成…

微软大数据_我对Microsoft的数据科学采访

微软大数据Microsoft was one of the software companies that come to hire interns at my university for 2021 summers. This year, it was the first time that Microsoft offered any Data Science Internship for pre-final year undergraduate students.微软是到2021年夏…

再次检查打印机名称 并确保_我们的公司名称糟透了。 这是确保您没有的方法。...

再次检查打印机名称 并确保by Dawid Cedrych通过戴维德塞德里奇 我们的公司名称糟透了。 这是确保您没有的方法。 (Our company name sucked. Here’s how to make sure yours doesn’t.) It is harder than one might think to find a good business name. Paul Graham of Y …

linux中文本查找命令,Linux常用的文本查找命令 find

一、常用的文本查找命令grep、egrep命令grep:文本搜索工具,根据用户指定的文本模式对目标文件进行逐行搜索,先是能够被模式匹配到的行。后面跟正则表达式,让grep工具相当强大。-E之后还支持扩展的正则表达式。# grep [options] …

分布与并行计算—日志挖掘(Java)

日志挖掘——处理数据、计费统计 1、读取附件中日志的内容,找出自己学号停车场中对应的进出车次数(in/out配对的记录数,1条in、1条out,视为一个车次,本日志中in/out为一一对应,不存在缺失某条进或出记录&a…

《人人都该买保险》读书笔记

内容目录: 1.你必须知道的保险知识 2.家庭理财的必需品 3.保障型保险产品 4.储蓄型保险产品 5.投资型保险产品 6.明明白白买保险 现在我所在的公司Manulife是一家金融保险公司,主打业务就是保险,因此我需要熟悉一下保险的基础知识&#xff0c…

Linux下查看txt文档

当我们在使用Window操作系统的时候,可能使用最多的文本格式就是txt了,可是当我们将Window平台下的txt文本文档复制到Linux平台下查看时,发现原来的中文所有变成了乱码。没错, 引起这个结果的原因就是两个平台下,编辑器…

如何击败腾讯_击败股市

如何击败腾讯个人项目 (Personal Proyects) Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an…

滑块 组件_组件制作:如何使用链接的输入创建滑块

滑块 组件by Robin Sandborg罗宾桑德伯格(Robin Sandborg) 组件制作:如何使用链接的输入创建滑块 (Component crafting: how to create a slider with a linked input) Here at Stacc, we’re huge fans of React and the render-props pattern. When it came time…

配置静态IPV6 NAT-PT

一.概述: IPV6 NAT-PT( Network Address Translation - Port Translation)应用与ipv4和ipv6网络互访的情况,根据参考链接配置时出现一些问题,所以记录下来。参考链接:http://www.cisco.com/en/US/tech/tk648/tk361/technologies_c…

linux 线程与进程 pid,linux下线程所属进程号问题

这一段看《unix环境高级编程》,一个关于线程的小例子。#include#include#includepthread_t ntid;void printids(const char *s){pid_t pid;pthread_t tid;pidgetpid();tidpthread_self();printf("%s pid %u tid %u (0x%x)n",s,(unsigned int)pid,(unsigne…

python3虚拟环境中解决 ModuleNotFoundError: No module named '_ssl'

前提是已经安装了openssl 问题 当我在python3虚拟环境中导入ssl模块时报错,报错如下: (py3) [rootlocalhost Python-3.6.3]# python3 Python 3.6.3 (default, Nov 19 2018, 14:18:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux Type "help…

python 使用c模块_您可能没有使用(但应该使用)的很棒的Python模块

python 使用c模块by Adam Goldschmidt亚当戈德施密特(Adam Goldschmidt) 您可能没有使用(但应该使用)的很棒的Python模块 (Awesome Python modules you probably aren’t using (but should be)) Python is a beautiful language, and it contains many built-in modules that…

分布与并行计算—生产者消费者模型实现(Java)

在实际的软件开发过程中,经常会碰到如下场景:某个模块负责产生数据,这些数据由另一个模块来负责处理(此处的模块是广义的,可以是类、函数、线程、进程等)。产生数据的模块,就形象地称为生产者&a…

通过Xshell登录远程服务器实时查看log日志

主要想总结以下几点: 1.如何使用生成密钥的方式来登录Xshell连接远端服务器 2.在远程服务器上如何上传和下载文件(下载log文件到本地) 3.如何实时查看log,提取错误信息 一. 使用生成密钥的方式来登录Xshell连接远端服务器 ssh登录…

如何将Jupyter Notebook连接到远程Spark集群并每天运行Spark作业?

As a data scientist, you are developing notebooks that process large data that does not fit in your laptop using Spark. What would you do? This is not a trivial problem.作为数据科学家,您正在开发使用Spark处理笔记本电脑无法容纳的大数据的笔记本电脑…