如何将Jupyter Notebook连接到远程Spark集群并每天运行Spark作业?

As a data scientist, you are developing notebooks that process large data that does not fit in your laptop using Spark. What would you do? This is not a trivial problem.

作为数据科学家,您正在开发使用Spark处理笔记本电脑无法容纳的大数据的笔记本电脑。 你会怎么做? 这不是一个小问题。

Let’s start with the most naive solution without install anything on your laptop.

让我们从最简单的解决方案开始,而不在笔记本电脑上安装任何东西。

  1. “No notebook”: SSH into the remote clusters and use Spark shell on the remote cluster.

    “无笔记本”:SSH进入远程群集,并在远程群集上使用Spark Shell。
  2. “Local notebook”: downsample the data and pull the data to your laptop.

    “本地笔记本”:对数据进行下采样并将数据拉到您的笔记本上。

The problem of “No notebook” is the developer experience is unacceptable on Spark shell:

“没有笔记本”的问题是在Spark shell上无法接受开发人员的体验:

  1. You cannot easily change the code and get the result printed like what you have in Jupyter notebook or Zeppelin notebook.

    您无法像Jupyter笔记本电脑或Zeppelin笔记本电脑那样轻松地更改代码并获得打印结果。
  2. It is hard to show images/charts from Shell.

    很难显示来自Shell的图像/图表。
  3. It is painful to do version control by git on a remote machine because you have to set up from the very beginning and make git operations like git diff.

    在远程计算机上通过git进行版本控制很痛苦,因为您必须从一开始就进行设置并进行git diff之类的git操作。

The second option “Local notebook”: You have to downsample the data and pull the data to your laptop (downsample: if you have 100GB data on your clusters, you downsample the data to 1GB without losing too much important information). Then you could process the data on your local Jupyter notebook.

第二个选项“本地笔记本”:您必须对数据进行降采样并将数据拉至笔记本电脑(降采样:如果群集上有100GB数据,则可以将数据降采样为1GB,而不会丢失太多重要信息)。 然后,您可以在本地Jupyter笔记本上处理数据。

it creates a few new painful problems:

它带来了一些新的痛苦问题:

  1. You have to write extra code to downsample the data.

    您必须编写额外的代码才能对数据进行下采样。
  2. Downsampling could lose vital information about the data, especially when you are working on visualization or machine learning models.

    下采样可能会丢失有关数据的重要信息,尤其是在使用可视化或机器学习模型时。
  3. You have to spend extra hours to make sure your code for original data. If not, it takes extra extra hours to figure out what’s wrong.

    您必须花费额外的时间来确保原始数据的代码。 如果不是这样,则需要花费额外的时间才能找出问题所在。
  4. You have to guarantee the local development environment is the same as the remote cluster. If not, it is error-prone and it may cause data issues that are hard to detect.

    您必须保证本地开发环境与远程集群相同。 如果不是这样,则容易出错,并且可能导致难以检测的数据问题。

Ok, “No notebook” and “Local notebook” are obviously not the best approach. What if your data team has access to the cloud, e.g. AWS? Yes, AWS provides Jupyter notebook on their EMR clusters and SageMaker. The notebook server is accessed through AWS Web console and it is ready to use when the clusters are ready.

好的,“没有笔记本”和“本地笔记本”显然不是最好的方法。 如果您的数据团队可以访问云(例如AWS)怎么办? 是的,AWS在其EMR群集和SageMaker上提供Jupyter笔记本。 笔记本服务器可通过AWS Web控制台访问,并且在群集准备就绪后即可使用。

This approach is called “Remote notebook on a cloud”.

这种方法称为“云上的远程笔记本”。

Image for post
AWS EMR with Jupyter Notebook by AWS
AWS EMR与Jupyter Notebook by AWS

The problems of “remote notebook on the cloud” are

“远程笔记本在云上”的问题是

  1. You have to set up your development environment every time the clusters get to spin up.

    每次集群启动时,您都必须设置开发环境。
  2. If you want your notebook run on different clusters or regions, you have to manually & repeatedly get it done.

    如果要让笔记本在不同的群集或区域上运行,则必须手动重复执行此操作。
  3. If the clusters are terminated unexpectedly, you lost your work on those clusters.

    如果群集意外终止,则您将失去在这些群集上的工作。

This approach, ironically, is the most popular one among the data scientists who have access to AWS. This can be explained by the principle of least effort: It provides one-click access to remote clusters so that data scientists can focus on their machine learning models, visualization, and business impact without spending too much time on clusters.

具有讽刺意味的是,这种方法是可以访问AWS的数据科学家中最受欢迎的一种。 可以用最少努力原理来解释 一键式访问远程集群,因此数据科学家可以专注于他们的机器学习模型,可视化和业务影响,而无需在集群上花费太多时间。

Besides “No notebook”, “Local notebook”, and “Remote notebook on Cloud”, there are options that point spark on a laptop to remote spark clusters. The code is submitted via a local notebook and send to a remote spark cluster. This approach is called “Bridge local & remote spark”.

除了“无笔记本”,“本地笔记本”和“云上的远程笔记本”之外,还有一些选项可将笔记本电脑上的火花指向远程火花群集。 该代码通过本地笔记本提交,并发送到远程Spark集群。 这种方法称为“桥接本地和远程火花”。

You can use set the remote master when you create sparkSession

创建sparkSession时可以使用set remote master

val spark = SparkSession.builder()
.appName(“SparkSample”)
.master(“spark://123.456.789:7077”)
.getOrCreate()

The problems are

问题是

  1. you have to figure out how to authenticate your laptop to remote spark clusters.

    您必须弄清楚如何对远程火花群集进行身份验证。
  2. it only works when Spark is deployed as Standalone not YARN. If your spark cluster is deployed on YARN, then you have to copy the configuration files/etc/hadoop/conf on remote clusters to your laptop and restart your local spark, assuming you have already figured out how to install Spark on your laptop.

    它仅在将Spark部署为独立版本而不是YARN时有效。 如果您的Spark集群已部署在YARN上,那么您必须将远程集群上的配置文件/etc/hadoop/conf复制到您的笔记本电脑上,然后重新启动本地spark,前提是您已经弄清楚了如何在笔记本电脑上安装Spark。

If you have multiple spark clusters, then you have to switch back and forth by copy configuration files. If the clusters are ephemeral on Cloud, then it easily becomes a nightmare.

如果您有多个Spark集群,则必须通过复制配置文件来回切换。 如果集群是短暂的,那么它很容易成为噩梦。

“Bridge local & remote spark” does not work for most of the data scientists. Luckily, we can switch back our attention to Jupyter notebook. There is a Jupyter notebook kernel called “Sparkmagic” which can send your code to a remote cluster with the assumption that Livy is installed on the remote spark clusters. This assumption is met for all cloud providers and it is not hard to install on in-house spark clusters with the help of Apache Ambari.

“桥接本地和远程火花”不适用于大多数数据科学家。 幸运的是,我们可以将注意力转移到Jupyter笔记本上。 有一个名为“ Sparkmagic”的Jupyter笔记本内核,可以将Livy安装在远程Spark群集上,从而将您的代码发送到远程群集。 所有云提供商均满足此假设,并且在Apache Ambari的帮助下将其安装在内部Spark集群上并不困难。

Image for post
Sparkmagic Architecture
Sparkmagic架构

It seems “Sparkmagic” is the best solution at this point but why it is not the most popular one. There are 2 reasons:

目前看来,“ Sparkmagic”是最好的解决方案,但为什么它不是最受欢迎的解决方案。 有两个原因:

  1. Many data scientists have not heard of “Sparkmagic”.

    许多数据科学家还没有听说过“ Sparkmagic”。
  2. There are installation, connection, and authentication issues that are hard for data scientists to fix.

    存在安装,连接和身份验证问题,数据科学家很难修复。

To solve problem 2, sparkmagic introduces Docker containers that are ready to use. Docker container, indeed, has solved some of the issues in installation, but it also introduces new problems for data scientists:

为了解决问题2,sparkmagic引入了可立即使用的Docker容器。 Docker容器确实解决了安装中的一些问题,但是它也为数据科学家带来了新的问题:

  1. Designed for shipping applications, the learning curve of docker container is not considered friendly for data scientists.

    Docker容器的学习曲线专为运输应用而设计,对数据科学家而言并不友好。
  2. It is not designed to used intuitively for data scientists who come from diverse technical backgrounds.

    它并不是为具有不同技术背景的数据科学家而直观地使用的。

The discussion of docker containers will stop here and another article that explains how to make Docker containers actually work for data scientists will be published in a few days.

关于Docker容器的讨论将在这里停止,另一篇文章将解释如何使Docker容器真正为数据科学家服务,将在几天后发布。

To summarize, we have two categories of solutions:

总而言之,我们有两种解决方案:

  1. Notebook & notebook kernel: “No notebook”, “Local notebook”, “Remote notebook on cloud”, “Sparkmagic”

    笔记本和笔记本内核:“无笔记本”,“本地笔记本”,“云上的远程笔记本”,“ Sparkmagic”
  2. Spark itself: “Bridge local & remote spark”.

    Spark本身:“桥接本地和远程火花”。

Despite installation and connection issues, “Sparkmagic” is the recommended solution. However, there are often other unsolved issues that reduce productivity and hurt developer experience:

尽管存在安装和连接问题,但建议使用“ Sparkmagic”解决方案。 但是,通常还有其他未解决的问题会降低生产率并损害开发人员的经验:

  1. What if other languages, python and R, are required to run on clusters?

    如果要求其他语言python和R在群集上运行怎么办?
  2. What if the notebook is going to be run everyday? What if the notebook is going to be run only if another notebook run succeed?

    如果笔记本要每天运行怎么办? 如果仅在另一个笔记本运行成功的情况下才要运行笔记本怎么办?

Let’s go over the current solutions:

让我们来看一下当前的解决方案:

  1. Set up a remote Jupyter server and SSH tunneling (Reference). This definitely works but it takes time to set it up, and notebooks are on the remote servers.

    设置远程Jupyter服务器和SSH隧道(R eference )。 绝对可以,但是设置起来很费时间,并且笔记本在远程服务器上。

  2. Set up a cron scheduler. Most data scientists are OK with cron scheduler, but what if the notebook failed to run? Yes, a shell script can help but are the most majority of data scientists comfortable writing shell script? Even if the answer is yes, data scientists have to 1. get access to finished notebook 2. to get a status update. Even if there are data scientists that are happy with writing shell scripts, why should every data scientist write their own scripts to automate the exact same stuff?

    设置一个cron调度程序。 大多数数据科学家都可以使用cron计划程序,但是如果笔记本无法运行怎么办? 是的,shell脚本可以提供帮助,但是大多数数据科学家是否愿意编写shell脚本? 即使答案是肯定的,数据科学家也必须1.可以访问完成的笔记本电脑2.可以获取状态更新。 即使有些数据科学家对编写Shell脚本感到满意,但为什么每个数据科学家都应该编写自己的脚本来自动化完全相同的东西呢?
  3. Set up Airflow. This is a very popular solution among data engineers and it can get stuff done. If there are Airflow servers supported by data engineers or data platform engineers, data scientists can manage to learn the operators of Airflow and get it to work for Jupyter Notebook.

    设置气流。 这是数据工程师中非常流行的解决方案,它可以完成工作。 如果有数据工程师或数据平台工程师支持的Airflow服务器,则数据科学家可以设法学习Airflow的操作员,并使它适用于Jupyter Notebook。
  4. Set up Kubeflow and other Kubernetes-based solutions. Admittedly kubeflow can get stuff done, but in reality how many data scientists have access to Kubernetes clusters, including managed solutions running on the cloud?

    设置Kubeflow和其他基于Kubernetes的解决方案。 诚然,kubeflow可以完成工作,但实际上有多少数据科学家可以访问Kubernetes集群,包括在云上运行的托管解决方案?

Let’s reframe the problems:

让我们重新构造问题:

  1. How to develop on the local laptop with access to remote clusters?

    如何在可访问远程群集的本地笔记本电脑上进行开发?
  2. How to operate on the remote clusters?

    如何在远程集群上运行?

The solutions implemented by Bayesnote (a new open source Notebook project https://github.com/Bayesnote/Bayesnote) follows this principle:

Bayesnote(一个新的开源Notebook项目https://github.com/Bayesnote/Bayesnote )实现的解决方案遵循以下原则:

  1. “Auto installation, not manual”: Data scientists should not waste their time on installing anything on remote servers.

    “自动安装,而不是手动”:数据科学家不应浪费时间在远程服务器上安装任何东西。
  2. “Local notebook, not remote notebooks”: local notebooks makes better development experience and makes version control easier.

    “本地笔记本,而不是远程笔记本”:本地笔记本可提供更好的开发体验,并使版本控制更加容易。
  3. “Works for everyone, not someone”: assuming data scientists have no access to help from the engineering team. Works for data scientists come from diverse technical backgrounds.

    “为所有人服务,而不是为每个人服务”:假设数据科学家无法获得工程团队的帮助。 数据科学家的作品来自不同的技术背景。
  4. “Works for every language/framework”. Works for any languages, python, SQL, R, and Spark, etc.

    “适用于每种语言/框架”。 适用于任何语言,python,SQL,R和Spark等。
  5. “Combining development and operation, not separate them”: development and operations of a notebook can be done in one place. Data scientists should not spend time on fix issues in the disparity of development and operation

    “将开发和操作结合在一起,而不是将它们分开”:笔记本的开发和操作可以在一个地方完成。 数据科学家不应将时间花在解决开发和运营差异方面的修复问题上

These ideas are implemented by feature “auto self-deployment” of Bayesnote. In the development phase, the only required input from data scientists is authentication information, like IP and password. Then Bayesnote deploys itself to remote servers and started to listen for socket messages. The code will be sent to a remote server and get results back for users.

这些想法是通过Bayesnote的功能“自动自我部署”实现的。 在开发阶段,数据科学家唯一需要的输入就是身份验证信息,例如IP和密码。 然后,Bayesnote将自己部署到远程服务器,并开始侦听套接字消息。 该代码将被发送到远程服务器,并为用户返回结果。

Image for post
Bayesnote: auto self-deployment
Bayesnote:自动自我部署

In the operation phase, a YAML file is specified and Bayesnote would run notebooks on remote servers, get finished notebooks back, and send a status update to emails or slack.

在操作阶段,将指定一个YAML文件,并且Bayesnote将在远程服务器上运行笔记本,取回已​​完成的笔记本,并将状态更新发送到电子邮件或备用服务器。

Image for post
Workflow YAML by Bayesnote
Bayesnote的工作流程YAML

(Users will configure by filling out forms rather than YAML files, and the dependency of notebooks will be visualized nicely. )

(用户将通过填写表单(而不是YAML文件)进行配置,并且笔记本的依赖关系将得到很好的可视化。)

The (partial) implementation can be found on Github. https://github.com/Bayesnote/Bayesnote

可以在Github上找到(部分)实现。 https://github.com/Bayesnote/Bayesnote

Free data scientists from tooling issues so they can be happy and productive in their jobs.

使数据科学家从工具问题中解放出来,使他们在工作中感到快乐和高效率。

翻译自: https://towardsdatascience.com/how-to-connect-jupyter-notebook-to-remote-spark-clusters-and-run-spark-jobs-every-day-2c5a0c1b61df

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392173.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

是银弹吗?业务基线方法论

Fred.Brooks在1987年就提出:没有银弹。没有任何一项技术或方法可以能让软件工程的生产力在十年内提高十倍。 我无意挑战这个理论,只想讨论一个方案,一个可能大幅提高业务系统开发效率的方案。 方案描述 我管这个方案叫做“由基线扩展…

同一服务器部署多个tomcat时的端口号修改详情

2019独角兽企业重金招聘Python工程师标准>>> 同一服务器部署多个tomcat时&#xff0c;存在端口号冲突的问题&#xff0c;所以需要修改tomcat配置文件server.xml&#xff0c;以tomcat7为例。 首先了解下tomcat的几个主要端口&#xff1a;<Connector port"808…

第一章-从双向链表学习设计

链表学习链表是一种动态的数据结构使用节点作为链表的基本单位存储在节点包括数据元素和节点指针一个完整的数据链表应包括转载于:https://www.cnblogs.com/cjxltd/p/7125747.html

思维导图分析http之http协议版本

1.结构总览 在http协议这一章&#xff0c;我将先后介绍上图六个部分&#xff0c;本文先介绍http的协议版本。 2.http协议版本 http协议的历史并不长&#xff0c;从1991的0.9版本到现在(2017)仅仅才20多年&#xff0c;算算下来,http还是正处青年&#xff0c;正是大好发展的好时光…

使用管道符组合使用命令_如何使用管道的魔力

使用管道符组合使用命令Surely you have heard of pipelines or ETL (Extract Transform Load), or seen some method in a library, or even heard of any tool to create pipelines. However, you aren’t using it yet. So, let me introduce you to the fantastic world of…

C# new关键字和对象类型转换(双括号、is操作符、as操作符)

一、new关键字 CLR要求所有的对象都通过new来创建,代码如下: Object objnew Object(); 以下是new操作符做的事情 1、计算类型及其所有基类型(一直到System.Object,虽然它没有定义自己的实例字段)中定义的所有实例字段需要的字节数.堆上每个对象都需要一些额外的成员,包括“类型…

JDBC01 利用JDBC连接数据库【不使用数据库连接池】

目录&#xff1a; 1 什么是JDBC 2 JDBC主要接口 3 JDBC编程步骤【学渣版本】 5 JDBC编程步骤【学神版本】 6 JDBC编程步骤【学霸版本】 1 什么是JDBC JDBC是JAVA提供的一套标准连接数据库的接口&#xff0c;规定了连接数据库的步骤和功能&#xff1b;不同的数据库提供商提供了一…

编译原理—词法分析器(Java)

1.当运行程序时&#xff0c;程序会读取项目下的program.txt文件 2. 程序将会逐行读取program.txt中的源程序&#xff0c;进行词法分析&#xff0c;并将分析的结果输出。 3. 如果发现错误&#xff0c;程序将会中止读取文件进行分析&#xff0c;并输出错误提示 所用单词的构词规…

为什么我们需要使用Pandas新字符串Dtype代替文本数据对象

We have to represent every bit of data in numerical values to be processed and analyzed by machine learning and deep learning models. However, strings do not usually come in a nice and clean format and require a lot preprocessing.我们必须以数值表示数据的每…

递归方程组解的渐进阶的求法——代入法

递归方程组解的渐进阶的求法——代入法 用这个办法既可估计上界也可估计下界。如前面所指出&#xff0c;方法的关键步骤在于预先对解答作出推测&#xff0c;然后用数学归纳法证明推测的正确性。 例如&#xff0c;我们要估计T(n)的上界&#xff0c;T(n)满足递归方程&#xff1a;…

编译原理—语法分析器(Java)

递归下降语法分析 1. 语法成分说明 <语句块> :: begin<语句串> end <语句串> :: <语句>{&#xff1b;<语句>} <语句> :: <赋值语句> | <循环语句> | <条件语句> <关系运算符> :: < | < | > | > | |…

编译原理—语义分析(Java)

递归下降语法制导翻译 实现含多条简单赋值语句的简化语言的语义分析和中间代码生成。 测试样例 begin a:2; b:4; c:c-1; area:3.14*a*a; s:2*3.1416*r*(hr); end #词法分析 public class analyzer {public static List<String> llistnew ArrayList<>();static …

linux boot菜单列表,Bootstrap 下拉菜单(Dropdowns)简介

Bootstrap 下拉菜单是可切换的&#xff0c;是以列表格式显示链接的上下文菜单。这可以通过与 下拉菜单(Dropdown) JavaScript 插件 的互动来实现。如需使用下拉菜单&#xff0c;只需要在 class .dropdown 内加上下拉菜单即可。下面的实例演示了基本的下拉菜单&#xff1a;实例主…

数据挖掘—Apriori算法(Java实现)

算法描述 &#xff08;1&#xff09;扫描全部数据&#xff0c;产生候选1-项集的集合C1&#xff1b; &#xff08;2&#xff09;根据最小支持度&#xff0c;由候选1-项集的集合C1产生频繁1-项集的集合L1&#xff1b; &#xff08;3&#xff09;对k>1&#xff0c;重复执行步骤…

泰晤士报下载_《泰晤士报》和《星期日泰晤士报》新闻编辑室中具有指标的冒险活动-第1部分:问题

泰晤士报下载TLDR: Designing metrics that help you make better decisions is hard. In The Times and The Sunday Times newsrooms, we have spent a lot of time trying to tackle three particular problems.TLDR &#xff1a;设计度量标准以帮助您做出更好的决策非常困难…

背景消除的魔力

图片的功能非常强大&#xff0c;有一图胜千言的效果&#xff0c;所以在文档或演示文稿中使用图片来增加趣味性是一种很棒的想法。但问题是&#xff0c;图片通常会变为文字中间的独立矩形&#xff0c;而不是真正与内容融合在一起。您可以在图片中放置边框或效果&#xff0c;使其…

数据挖掘—BP神经网络(Java实现)

public class Test {public static void main(String args[]) throws Exception {ArrayList<ArrayList<Double>> alllist new ArrayList<ArrayList<Double>>(); // 存放所有数据ArrayList<String> outlist new ArrayList<String>(); // …

特征工程tf-idf_特征工程-保留和删除的内容

特征工程tf-idfThe next step after exploring the patterns in data is feature engineering. Any operation performed on the features/columns which could help us in making a prediction from the data could be termed as Feature Engineering. This would include the…

monkey测试===通过monkey测试检查app内存泄漏和cpu占用

最近一直在研究monkey测试。网上资料很多&#xff0c;但都是一个抄一个的。原创的很少 我把检查app内存泄漏的情况梳理一下&#xff1a; 参考资料&#xff1a; Monkey测试策略&#xff1a;https://testerhome.com/topics/597 Android Monkey测试详细介绍&#xff1a;http://www…

三维空间两直线/线段最短距离、线段计算算法 【转】

https://segmentfault.com/a/1190000006111226d(ls,lt)|sj−tj||s0−t0(be−cd)u⃗ −(ae−bd)v⃗ ac−bd(ls,lt)|sj−tj||s0−t0(be−cd)u⃗ −(ae−bd)v⃗ ac−b2|具体实现代码如下&#xff08;C#实现&#xff09;&#xff1a; public bool IsEqual(double d1, double d2) { …