Spark 1.1.1 Submitting Applications

回到目录

Submitting Applications

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managersthrough a uniform interface so you don’t have to configure your application specially for each one.

Bundling Your Application’s Dependencies

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, to create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

For Python, you can use the --py-files argument of spark-submit to add .py.zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

Launching Applications with spark-submit

Once a user application is bundled, it can be launched using the bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:

./bin/spark-submit \--class <main-class>--master <master-url> \--deploy-mode <deploy-mode> \--conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments] 

Some of the commonly used options are:

  • --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
  • --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
  • --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)*
  • --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
  • application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
  • application-arguments: Arguments passed to the main method of your main class, if any

*A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the client spark-submit process, with the input and output of the application attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).

Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to usecluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or python applications.

For Python applications, simply pass a .py file in the place of <application-jar> instead of a JAR, and add Python .zip.egg or .py files to the search path with --py-files.

To enumerate all options available to spark-submit run it with --help. Here are a few examples of common options:

# Run application locally on 8 cores
./bin/spark-submit \--class org.apache.spark.examples.SparkPi \--master local[8] \ /path/to/examples.jar \ 100 # Run on a Spark standalone cluster ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000 # Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ # can also be `yarn-client` for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000 # Run a Python application on a cluster ./bin/spark-submit \ --master spark://207.184.161.138:7077 \ examples/src/main/python/pi.py \ 1000 

Master URLs

The master URL passed to Spark can be in one of the following formats:

Master URLMeaning
localRun Spark locally with one worker thread (i.e. no parallelism at all).
local[K]Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*]Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORTConnect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
mesos://HOST:PORTConnect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://....
yarn-clientConnect to a YARN cluster in client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable.
yarn-clusterConnect to a YARN cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.

Loading Configuration from a File

The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application. By default it will read options from conf/spark-defaults.conf in the Spark directory. For more detail, see the section on loading default configurations.

Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.

If you are ever unclear where configuration options are coming from, you can print out fine-grained debugging information by running spark-submit with the --verbose option.

Advanced Dependency Management

When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars:

  • file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
  • hdfs:http:https:ftp: - these pull down files and JARs from the URI as expected
  • local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the spark.worker.cleanup.appDataTtl property.

For python, the equivalent --py-files option can be used to distribute .egg.zip and .py libraries to executors.

More Information

Once you have deployed your application, the cluster mode overview describes the components involved in distributed execution, and how to monitor and debug applications.

转载于:https://www.cnblogs.com/njuzhoubing/p/4170984.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/460625.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

如何描述变量:存储类、生命周期,作用域、链接属性

可以根据一个变量的存储类、作用域、链接属性、生命周期来描述该变量。 其中&#xff0c;存储类决定了生命周期&#xff0c;作用域决定了链接属性。 存储类 存储类表明变量在哪里存储。见博文Linux下C语言程序的内存布局_天糊土的博客-CSDN博客 作用域 作用域表明变量起作用的…

mysql 修改表名的方法:sql语句

在使用mysql时&#xff0c;经常遇到表名不符合规范或标准&#xff0c;但是表里已经有大量的数据了&#xff0c;如何保留数据&#xff0c;只更改表名呢&#xff1f; 可以通过建一个相同的表结构的表&#xff0c;把原来的数据导入到新表中&#xff0c;但是这样视乎很麻烦。 能否简…

java String类 常用函数

为什么80%的码农都做不了架构师&#xff1f;>>> 1. 获取 int indexOf(int c) int indexOf(int c, int start) char charAt(int index) 2.判断 判断是否包含一个字符串 boolean contains(CharSequence cs) indexOf() //也可以用来判断是否包含 判断是否有内容 boole…

设备驱动程序的简介

以下内容源于朱有鹏嵌入式课程的学习与整理&#xff0c;如有侵权请告知删除。 一、驱动的概念 设备驱动程序&#xff08;Device Driver&#xff09;&#xff0c;简称驱动程序、驱动&#xff08;Driver&#xff09;&#xff0c;指操作系统中用来操控硬件的代码。 驱动是硬件与操…

Android开发实践:掌握Camera的预览方向和拍照方向

Android的Camera相关应用开发中&#xff0c;有一个必须搞清楚的知识点&#xff0c;就是Camera的预览方向和拍照方向&#xff0c;本文就重点讨论一下这个问题。图像的Sensor方向&#xff1a;手机Camera的图像数据都是来自于摄像头硬件的图像传感器&#xff08;Image Sensor&…

mknod命令:创建设备文件

参考博客&#xff1a;mknod_liangkaiming的博客-CSDN博客 参考资料&#xff1a;man手册 可以通过man 1 mknod查看mknod命令的内容。 1、mknod 命令的作用是make block or character special files&#xff0c;即创建块设备或者字符设备文件。 2、mknod 命令的格式是&#xf…

DreamWeaver使用技巧学习心得

全是我在平时学习网页时积累的&#xff0c;觉得会对遇到同样问题的友人有帮助&#xff0c;都是一些觉得困惑好久然后豁然开朗的心得。 希望大家都能体会到&#xff0c;解决难题后的快乐。 都是我恍然大悟的地方&#xff0c;不够恍然大悟的就不贴上来了。 1.让一个区块居中&…

【转】每天一个linux命令(39):grep 命令

原文网址&#xff1a;http://www.cnblogs.com/peida/archive/2012/12/17/2821195.html Linux系统中grep命令是一种强大的文本搜索工具&#xff0c;它能使用正则表达式搜索文本&#xff0c;并把匹 配的行打印出来。grep全称是Global Regular Expression Print&#xff0c;表示全…

SecureCRT显示乱码的解决办法

发现问题 在Ubuntu中编写代码&#xff0c;输出语句里带有中文&#xff0c;比如"printf("读出来的内容是&#xff1a;%s.\n", buf);"。使用交叉编译工具链编译后&#xff0c;将可执行程序转移至开发板系统运行&#xff0c;并使用SCRT来观测测试结果。此时发…

WCF数据契约

当使用DataMember时&#xff0c;和访问符无关&#xff0c;及时使用了private&#xff0c;成员都是可见的。相反如果使用static&#xff0c;为不可见。 上述的两个数据成员是等效的&#xff0c;如果是等效的话 数据成员的顺序也必须是相同的。 4.数据契约已知类型——使用KownTy…

Linux中NFS服务器的配置(/etc/export)

本文转载于NFS /etc/exports参数解释&#xff0c;有修改。 问题引入 之前利用NFS从ubuntu中下载根文件系统到开发板&#xff08;见博客以NFS方式挂载rootfs的设置方法&#xff09;&#xff0c;但只是遵照教程安装的&#xff0c;对里面的设置含义不是很清楚。后来在开发板上上进…

cocos2d 很水很水的看法

这几天接了个扯淡的项目 cocos2d的 。 本来以为是Cpp的&#xff0c; 结果不是2dx &#xff0c;而是OC的2d。看了几天的官方的dome &#xff0c;大概知道是什么样子的。我就简单的纪录一下好了: cocos2d的整个框架呢&#xff0c; 分为3层 — 类似舞台剧 演戏嘛 肯定有个boss的 …

详解EBS接口开发之采购申请导入

更多内容可以参考我的博客 &#xfeff;&#xfeff;详解EBS接口开发之采购订单导入 http://blog.csdn.net/cai_xingyun/article/details/17114697 /*--将数据写入至采购申请接口表*/PROCEDURE insert_procure_main(errbuf OUT NOCOPY VARCHAR2,retcode OUT NOCOPY VARCHAR2)…

sys文件系统

以下内容源于网络资源的学习与整理&#xff0c;如有侵权请告知删除。 前言 Linux2.6版本的内核引入了sys文件系统。 在 proc文件系统介绍和使用 中&#xff0c;介绍了sys文件系统与proc文件系统的差异。它们都是虚拟文件系统&#xff0c;都是内核中的数据结构的可视化接口。它…

spring mvc 配置解析之xml

2019独角兽企业重金招聘Python工程师标准>>> ##mvc.xml中可配置的元素## 既然是xml,当然是要遵循schema的规定. 那么schema文件在哪呢? 定位方法就是解开这个jar文件,找到META-INF/spring.schema文件,这是个文本文件,里面包含了namespace以及其对应的xsd文件的位置…

IOS-UITextField类

文字属性 text placeholder   //默认使用70%灰色 font textColor textAlignment 文字大小 adjustsFontSizeToFitWidth minimumFontSize 编辑行为 editing     //是否正在编辑(read-only) clearsOnBeginEditing 展现形态 borderStyle     //默认UITextBord…

tree命令:以树的形式列出目录中的文件

在linux命令行中&#xff0c;输入“man 1 tree”可以得知tree命令的用法。 这里讲解几个常用的选项。 tree -L level_num &#xff1a;Max display depth of the directory tree. 比如“tree -L 1”&#xff0c;表示深度为1层。 rootubuntu:/sys# ls block bus class de…

JDBC学习笔记——事务、存储过程以及批量处理

2019独角兽企业重金招聘Python工程师标准>>> 1、事务 1.1、事务的基本概念和使用示例 数据库事务&#xff0c;是指作为单个逻辑工作单元执行的一系列操作&#xff0c;要么完整…

验证码识别笔记(二)

这是验证码识别的第二篇&#xff0c;先看一下样图吧&#xff0c;就是下面那张。 看到这张图片&#xff0c;直观上就知道比第一篇中的要简单&#xff0c;这个“简单”用语言来描述&#xff0c;可以得到下面的几条结论&#xff1a; 1. 图片中的字符边界比较清晰&#xff0c;并且单…

dd命令:用于读取、转换并输出数据

以下内容源于网络资源的学习与整理&#xff0c;如有侵权请告知删除。 命令作用 从标准输入或文件中读取数据&#xff0c;根据指定的格式来转换数据&#xff0c;再输出到文件、设备或标准输出。 参数说明 if文件名&#xff1a;输入文件名&#xff0c;默认为标准输入。即指定源文…