Apache nifi 集群安装

原文地址:https://pierrevillard.com/2016/08/13/apache-nifi-1-0-0-cluster-setup/
文章写的很好了,步骤性的英文写得也比较易懂,原样搬过来了,没有再翻译

As you may know a version 1.0.0-BETA of Apache NiFi has been released few days ago. The upcoming 1.0.0 release will be a great moment for the community as it it will mark a lot of work over the last few months with many new features being added.

The objective of the Beta release is to give people a chance to try this new version and to give a feedback before the official major release which will come shortly. If you want to preview this new version with a completely new look, you can download the binaries here, unzip it, and run it (‘./bin/nifi.sh start‘ or ‘./bin/run-nifi.bat‘ for Windows), then you just have to access http://localhost:8080/nifi/.

The objective of this post is to briefly explain how to setup an unsecured NiFi cluster with this new version (a post for setting up a secured cluster will come shortly with explanations on how to use a new tool that will be shipped with NiFi to ease the installation of a secured cluster).

One really important change with this new version is the new paradigm around cluster installation. From the NiFi documentation, we can read:

Starting with the NiFi 1.0 release, NiFi employs a Zero-Master Clustering paradigm. Each of the nodes in a NiFi cluster performs the same tasks on the data but each operates on a different set of data. Apache ZooKeeper elects one of the nodes as the Cluster Coordinator, and failover is handled automatically by ZooKeeper. All cluster nodes report heartbeat and status information to the Cluster Coordinator. The Cluster Coordinator is responsible for disconnecting and connecting nodes. As a DataFlow manager, you can interact with the NiFi cluster through the UI of any node in the cluster. Any change you make is replicated to all nodes in the cluster, allowing for multiple entry points to the cluster.

这里写图片描述

zero-master-cluster

OK, let’s start with the installation. As you may know it is greatly recommended to use an odd number of ZooKeeper instances with at least 3 nodes (to maintain a majority also called quorum). NiFi comes with an embedded instance of ZooKeeper, but you are free to use an existing cluster of ZooKeeper instances if you want. In this article, we will use the embedded ZooKeeper option.

I will use my computer as the first instance. I also launched two virtual machines (with a minimal Centos 7). All my 3 instances are able to communicate to each other on requested ports. On each machine, I configure my /etc/hosts file with:

192.168.1.17 node-3
192.168.56.101 node-2
192.168.56.102 node-1

I deploy the binaries file on my three instances and unzip it. I now have a NiFi directory on each one of my nodes.

The first thing is to configure the list of the ZK (ZooKeeper) instances in the configuration file ‘./conf/zookeep.properties‘. Since our three NiFi instances will run the embedded ZK instance, I just have to complete the file with the following properties:

server.1=node-1:2888:3888
server.2=node-2:2888:3888
server.3=node-3:2888:3888

Then, everything happens in the ‘./conf/nifi.properties‘. First, I specify that NiFi must run an embedded ZK instance, with the following property:

nifi.state.management.embedded.zookeeper.start=true
I also specify the ZK connect string:

nifi.zookeeper.connect.string=node-1:2181,node-2:2181,node-3:2181

As you can notice, the ./conf/zookeeper.properties file has a property named dataDir. By default, this value is set to ./state/zookeeper. If more than one NiFi node is running an embedded ZK, it is important to tell the server which one it is.

To do that, you need to create a file name myid and placing it in ZK’s data directory. The content of this file should be the index of the server as previously specify by the server. On node-1, I’ll do:

mkdir ./state
mkdir ./state/zookeeper
echo 1 > ./state/zookeeper/myid 

operation needs to be done on each node (don’t forget to change the ID).

If you don’t do this, you may see the following kind of exceptions in the logs:Caused by: java.lang.IllegalArgumentException: ./state/zookeeper/myid file is missing

Then we go to clustering properties. For this article, we are setting up an unsecured cluster, so we must keep:

nifi.cluster.protocol.is.secure=false

Then, we have the following properties:

nifi.cluster.is.node=true
nifi.cluster.node.address=node-1
nifi.cluster.node.protocol.port=9999
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.firewall.file=

I set the FQDN of the node I am configuring, and I choose the arbitrary 9999 port for the communication with the elected cluster coordinator. I apply the same configuration on my other nodes:

nifi.cluster.is.node=true
nifi.cluster.node.address=node-2
nifi.cluster.node.protocol.port=9999
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.firewall.file=

and

nifi.cluster.is.node=true
nifi.cluster.node.address=node-3
nifi.cluster.node.protocol.port=9999
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.firewall.file=

We have configured the exchanges between the nodes and the cluster coordinator, now let’s move to the exchanges between the nodes (to balance the data of the flows).

We have the following properties:

nifi.remote.input.host=node-1
nifi.remote.input.secure=false
nifi.remote.input.socket.port=9998
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec

Again, I set the FQDN of the node I am configuring and I choose the arbitrary 9998 port for the Site-to-Site (S2S) exchanges between the nodes of my cluster. The same applies for all the nodes (just change the host property with the correct FQDN).

It is also important to set the FQDN for the web server property, otherwise we may get strange behaviors with all nodes identified as ‘localhost’ in the UI. Consequently, for each node, set the following property with the correct FQDN:

nifi.web.http.host=node-1

And that’s all! Easy, isn’t it?

OK, let’s start our nodes and let’s tail the logs to see what’s going on there!

./bin/nifi.sh start && tail -f ./logs/nifi-app.log

If you look at the logs, you should see that one of the node gets elected as the cluster coordinator and then you should see heartbeats created by the three nodes and sent to the cluster coordinator every 5 seconds.

You can connect to the UI using the node you want (you can have multiple users connected to different nodes, modifications will be applied on each node). Let’s go to:

http://node-2:8080/nifi

Here is what it looks like:

这里写图片描述

Screen Shot 2016-08-13 at 7.33.08 PM

As you can see in the top-left corner, there are 3 nodes in our cluster. Besides, if we go in the menu (button in the top-right corner) and select the cluster page, we have details on our three nodes:

这里写图片描述

Screen Shot 2016-08-13 at 7.35.28 PM

We see that my node-2 has been elected as cluster coordinator, and that my node-3 is my primary node. This distinction is important because some processors must run on a unique node (for data consistency) and in this case we will want it to run “On primary node” (example below).

We can display details on a specific node (“information” icon on the left):

这里写图片描述

Screen Shot 2016-08-13 at 7.35.48 PM

OK, let’s add a processor like GetTwitter. Since the flow will run on all nodes (with balanced data between the nodes), this processor must run on a unique processor if we don’t want to duplicate data. Then, in the scheduling strategy, we will choose the strategy “On primary node”. This way, we don’t duplicate data, and if the primary node changes (because my node dies or gets disconnected), we won’t loose data, the workflow will still be executed.

这里写图片描述

Screen Shot 2016-08-13 at 7.45.19 PM

Then I can connect my processor to a PutFile processor to save the tweets in JSON by setting a local directory (/tmp/twitter):

这里写图片描述
Screen Shot 2016-08-13 at 7.52.25 PM

If I run this flow, all my JSON tweets will be stored on the primary node, the data won’t be balanced. To balance the data, I need to use a RPG (Remote Process Group), the RPG will connect to the coordinator to evaluate the load of each node and balance the data over the nodes. It gives us the following flow:

这里写图片描述

Screen Shot 2016-08-13 at 8.00.26 PM

I have added an input port called “RPG”, then I have added a Remote Process Group that I connected to ” http://node-2:8080/nifi ” and I enabled transmission so that the Remote Process Group was aware of the existing input ports on my cluster. Then in the Remote Process Group configuration, I enabled the RPG input port. I then connected my GetTwitter to the Remote Process Group and selected the RPG input port. Finally, I connected my RPG input port to my PutFile processor.

When running the flow, I now have balanced data all over my nodes (I can check in the local directory ‘/tmp/twitter‘ on each node).

That’s all for this post. I hope you enjoyed it and that it will be helpful for you if setting up a NiFi cluster. All comments/remarks are very welcomed and I kindly encourage you to download Apache NiFi, to try it and to give a feedback to the community if you have any.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/570766.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

stixel提升思路总结

1.用psmnet获得更好的disparity 2.用edgebox获得整个rgb图片的边缘,然后通过原本的stixel的上下边缘去寻找最优,用两个的边缘去重新得到一个新的边缘,但获得的轮廓不仅仅是外轮廓还有内部的轮廓,得出的结果比之前没有太多提升. 目前可以尝试在disparity图求边缘这种方式 3.使用…

Hive分区表count(*)不起mapreduce的真相

问题背景: 在对Hive求count(*)时,发现有些表会启mapreduce计算、返回 结果,比较耗时,有的表1秒之内返回结果 刚开始以为刚刚执行过一次count()后会对结果进行缓存,不用再去跑mapreduce,但经进一步实验发现…

进阶攻略|前端完整的学习路线

最近写了一篇关于前端一些常见轻便耐用的UI框架的小总结,很多小伙伴私信问我,要怎么学习前端,没有明确的方向,为了感谢大家的关注点赞打赏和喜欢,决定把前端的学习进阶之路稍微整理一下,也为了自己能在工作…

Kylin下构建Cube第一步出错:shell-init: error retrieving current directory

问题背景: 生产环境部署的Kylin-2.1,官方发布的最新安装包并不支持更改hbase存储的namespace,修改源码后重新打包部署过程中,build cube第一步出错 大概错误信息是: OS command error exit with 5 – hive -e "…

python 字典练习 记录学生是否交作业的小程序

#记录学生是否交作业的小程序 #包括:学生名字、日期、状态 1 data{2 taotao:{3 2018-6-3:已交,4 2018-6-4:未交,5 2018-6-5:已交6 } 7 mingming:{8 2018-6-3:未交,9 2018-6-4:已交 10 } 11 } #1、判断名字和日期是否…

boost::timer库使用

boost::timer boost库定时器使用,需要在编译时加相关链接库 -lboost_timer -lboost_system boost::timer::cpu_timer 和boost::timer::auto_cpu_timer用于精确定时,有start(),elapsed(),is_stopped()等方法,elapsed()方法返回的时结构体boost…

Kylin 2.0升级总结

文章转载,原文地址:https://blog.bcmeng.com/post/kylin-upgrade.html #6-给kylin社区的建议 引用于个人自查、学习 Kylin 2.0的升级节奏 升级的大原则 升级的目标 1 Kylin 2.0 升级流程 1.1 Kylin 2.0 代码合入 1.2 配置更新和梳理 1.3 兼容性测…

Html5 学习笔记 --》html基础 css 基础

HTML5 功能 HTML5特点 <!DOCTYPE html> <html lang"zh-cn"> <head><meta charset"utf-8"><title>基本格式</title> </head> <body><a href"http://www.baidu.com">百度</a> </b…

Kylin修改默认hbase namespace命名空间default的解决方案

问题及背景&#xff1a;同一用户的三家公司的物理集群合并&#xff0c;合并后用dataspacekerberos控制不同公司对集群资料的访问权限&#xff0c;三家公司分别使用独立的kerberos票据访问&#xff0c;特定的namespace,而生产环境部署的kylin-2.0/2.1只能保存cuboid到hbase 的 d…

pip download timeout 下载慢,超时解决方法

更换国内的pypi源&#xff1a; 如&#xff1a; pip install -i https://pypi.tuna.tsinghua.edu.cn/simple –upgrade tensorflow-gpu

test'

message.info(Click on left button.);直接弹出提示信息console.log(click left button, e); 后台输出区别 import { Pagination } from antd;function onShowSizeChange(current, pageSize) {console.log(current, pageSize); }ReactDOM.render(<Pagination showSizeChang…

Error:-81024 LR_VUG:The 'QTWeb' type is not supported on win32 platforms

在LR的bin目录下&#xff0c;选择Wlrun.exe文件&#xff0c;右键单击&#xff0c;选择属性&#xff1b;在兼容性里面把兼容性模式改为Windows XP (Service Pack 3),应用保存&#xff1b;然后再关闭controller&#xff0c;重新打开运行就可以了&#xff1b;

VMware仅主机模式访问外网

原文转载至&#xff1a;https://blog.csdn.net/eussi/article/details/79054622 保证VMware Network Adapter VMnet1是启用状态 将可以连接外网的连接共享属性设置成如下图所示 将VMware Network Adapter VMnet1的IP地址设置成与本机IP不同的网段即可 VMware虚拟网络编辑器VMne…

Spark学习之RDD的概念

RDD又叫弹性分布式数据集&#xff0c;是Spark数据的基础单元&#xff0c;Spark编程是围绕着在RDD上创建和执行操作来进行的。它们是跨集群进行分区的不可变集合&#xff08;immutable collection&#xff09;&#xff0c;如果某个分区丢失&#xff0c;这些分区可以重建&#xf…

我的ELK搭建笔记(阿里云上部署)

文章转载&#xff1a;http://www.jianshu.com/p/797073c1913f 仅用作个人学习&#xff0c;收藏 我的 ELK 搭建笔记&#xff08;基于阿里云&#xff09; “不是最好的&#xff0c;但一定是有良心的操作记录。”目录一览 0 重不重要都得有的开头 1 安装配置 1.1 CentOS 7…

HBase regions分布不均匀的解决

1、先确定master页面是否还有region in transition,如果有并且长时间未变化&#xff0c;可以考虑重启master&#xff0c;重新触发容灾。 2、region都加载后进入hbase shell balance_swith ture 开启balancer balancer 手动触发balance 即可。

莫队分块

今天兔哥讲了一波莫队&#xff0c;比较有趣&#xff0c;先加一个链接,这是她的教程 rabbithu.cnblogs.com 这里就不详细说了&#xff0c;其实就是两个指针来优化的暴力。一开始排序函数有问题&#xff0c;没用上莫队的核心思想&#xff1a;把查询区间先排序&#xff0c;第一关键…

Linux Kettle 闪退问题解决方案

我们在搭建kettle平台时&#xff0c;往往会搭建两种平台&#xff0c;一种win、一种是linux。在windows上进行kettle ETL测试工作&#xff0c;测试成功之后&#xff0c;会发布到linux服务器上&#xff0c;这就出现了一下问题——linux执行ktr文件&#xff0c;界面闪退&#xff0…

django-总体

纲领 建立项目时&#xff0c;首先需要以规范的方式对项目进行描述&#xff0c;再建立虚拟环境&#xff0c;以便在其中创建项目。 创建项目后&#xff0c;创建app&#xff0c;并在项目的settings.py中“安装”该app 随后&#xff0c;就是根据项目描述编写urls.py、view层、model…

Python的元组被设计成不可变的影响

实际上元组是跟列表非常相近的另一种容器类型&#xff0c;元组和列表看起来不同的一点是元组用的是圆括号而列表用的是方括号。而功能上&#xff0c;元组和列表相比有一个很重要的区别&#xff0c;元组是一种不可变类型。正是因为这个原因元组能做一些列表不能做的事情……用做…