rcp rapido_为什么气流非常适合Rapido

rcp rapido

Back in 2019, when we were building our data platform, we started building the data platform with Hadoop 2.8 and Apache Hive, managing our own HDFS. The need for managing workflows whether it’s data pipelines, i.e. ETL’s, machine learning predictive, and general generic pipelines was a core requirement where each task is different.

在2019年其他回,当我们建设我们的数据平台,我们开始用Hadoop 2.8和Apache蜂巢建设数据平台,管理我们自己的HDFS。 无论是数据管道(即ETL,机器学习预测性管道还是通用管道),都需要管理工作流,这是每个任务都不同的核心要求。

Some just schedule a task on GCP like creating a Cloud Run an instance or Dataproc Spark cluster at a scheduled time and some tasks are different because they can only be scheduled based on data availability or Hive partition availability.

有些仅在GCP上安排任务,例如在创建Cloud时运行实例或Dataproc Spark集群,而有些任务则不同,因为它们只能基于数据可用性或Hive分区可用性进行计划。

To sum up, the requirements at that time in the data platform which required a scheduling system were like:

综上所述,当时需要调度系统的数据平台需求如下:

  • ETL pipelines

    ETL管道
  • Machine learning workflows

    机器学习工作流程
  • Maintenance: Database backups

    维护:数据库备份
  • API Calls: Example can be Kafka Connect connectors management

    API调用:示例可以是Kafka Connect连接 器管理

  • Run naive cron based jobs where the task is to spin up some infra in a public cloud, for example, to spin up a new cluster or scale-up existing Dataproc cluster, stop the instances or scale the Cloud Run, etc..

    运行基于天真的cron的作业,其中的任务是在公共云中启动一些基础架构,例如,启动新集群或扩展现有Dataproc集群,停止实例或扩展Cloud Run等。

  • Deployments: Git -> Code deployments

    部署:Git->代码部署

ETL pipelines consist of a complex network of dependencies which are not just data-dependent, and these dependencies can vary based on the use cases metrics, creating canonical forms of datasets, model training

ETL管道由复杂的依赖关系网络组成,这些依赖关系不仅仅依赖于数据,而且这些依赖关系可以根据用例指标,创建数据集的规范形式,模型训练而变化

To create immutable datasets (no update, upsert, or delete) we started with a stack of Apache Hive, Apache Hadoop (HDFS), Apache Druid, Apache Kafka, and Apache Spark with the following requirements or goals:

为了创建不可变的数据集(不进行更新,更新或删除),我们从满足以下要求或目标的Apache Hive , Apache Hadoop ( HDFS ), Apache Druid , Apache Kafka和Apache Spark堆栈开始:

  • Creating reproducible pipelines, i.e. Pipelines output need to be deterministic like with Functional programming if there is retry or we retrigger the tasks the outcome should be same, i.e. Pipelines and tasks need to be idempotent

    创建可重现的管道,例如,如果有重试,则管道输出必须像函数式编程一样具有确定性;如果重试或重新触发任务, 结果应该是相同的,即管道和任务必须是幂等的

  • Backfilling is a must since data can evolve.

    因为数据可以发展,所以回填是必须的。

  • Robust — Easy changes to the configuration

    坚固耐用 —轻松更改配置

  • Versioning of configuration, data, and tasks, i.e. easily add or remove new tasks over time or update the existing dags code

    配置,数据和任务的版本控制 ,即随时间轻松添加或删除新任务,或更新现有的dags代码

  • Transparency with data flow: we discussed that something similar to Jaeger Tracing for data platform would be tremendous and checked at possible options like Atlas and Falcon

    数据流的透明度 :我们讨论过,类似于Jaeger Tracing的数据平台将是巨大的,并在Atlas和Falcon等可能的选项中进行了检查

Cloud Scheduler之旅(托管cron) (Journey with Cloud Scheduler(Managed cron))

We started using Cloud Scheduler with shell scripts, and python scripts since it was fully managed by google cloud, and setting up cron jobs is just a few clicks. Cloud Scheduler is a fully managed cron job scheduler.

我们开始将Cloud Scheduler与Shell脚本和python脚本一起使用,因为它完全由Google Cloud管理,并且只需单击几下即可设置cron作业。 Cloud Scheduler是完全托管的cron作业调度程序。

The main reason to go with Cloud Scheduler was unlike the self-managed cron instance, there is no single point of failure, and it’s designed to provide “at least once” delivery on jobs from cron tasks to automating resource creation like we used to run jobs which were creating Virtual Machine’s, Cloud Run, etc.

使用Cloud Scheduler的主要原因与自我管理的cron实例不同, 没有单点故障 ,它旨在为从cron任务到自动化资源创建的作业提供“ 至少一次”交付,就像我们以前运行的那样创建虚拟机,Cloud Run等的作业。

Image for post
Cloud Scheduler Web UI
Cloud Scheduler Web UI
GCP Cloud Scheduler Create Job via UI
Cloud Scheduler Create Job via UI
Cloud Scheduler通过UI创建作业

Cloud scheduler or cron doesn’t offer dependency management, so we have to “hope dependent pipelines finished in the correct order”. Had to scratch from the start for each pipeline or task (starting from blank for each pipeline won’t scale), though cloud scheduler has timezone supported we faced few timezone problems in druid ingestions and subsequent dependent tasks since the jobs were submitted manually via UI brings it can introduce human errors in pipelines, Cloud scheduler can also retry in case of failure to reduce manual toil and intervention, but there is no task dependency and managing complex workflows or backfilling was not available out of the box. So in few weeks, we decided that this may not be suitable for running data and ML pipelines since these involve lof of backfilling requirements, also cross DAG dependencies, and also may require data sharing between tasks.

云调度程序或cron不提供依赖项管理,因此我们必须“希望以正确的顺序完成依赖的管道”。 尽管云调度程序已支持时区,但由于调度程序是通过UI手动提交的,因此在德鲁伊和后续依赖任务中几乎没有时区问题 ,尽管云调度程序已支持时区,但必须从头开始为每个管道或任务从头开始(每个管道的空白都不会扩展)。带来的好处是可以在管道中引入人为错误,Cloud Scheduler还可以在失败的情况下重试,以减少人工劳动和干预,但是它没有任务依赖性 ,管理复杂的工作流程或开箱即用也无法回填 。 因此,在几周内,我们认为这可能不适合运行数据和ML管道,因为它们涉及回填要求的不足,还涉及DAG依赖性,并且还可能需要任务之间进行数据共享。

Then, we started looking into the popular open-source workflow management platforms which can handle 100’s of task with failures and callback strategies and tried to code a few tasks, and deploy them in GCP to complete the POC. Projects which were considered were Apache Airflow, Apache Oozie, Luigi, Argo, and Azkaban.

然后,我们开始研究流行的开源工作流管理平台 ,该平台可处理带有故障和回调策略的 100多个任务,并尝试编写一些任务,并将其部署在GCP中以完成POC。 被考虑的项目是Apache Airflow , Apache Oozie , Luigi , Argo和Azkaban 。

Both Apache Oozie and Azkaban were top projects at that time with the stable release. Oozie is a reliable workflow scheduler system to manage Apache Hadoop jobs. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop, and Distcp) as well as system-specific tasks (such as Java programs and shell scripts).

稳定版发布时,Apache Oozie和Azkaban都是当时的顶级项目。 Oozie是一个可靠的工作流计划程序系统,用于管理Apache Hadoop作业。 Oozie与其余Hadoop堆栈集成在一起,支持开箱即用的几种类型的Hadoop作业(例如Java map-reduce ,Streaming map-reduce,Pig,Hive,Sqoop和Distcp)以及特定于系统的任务(例如Java程序和Shell脚本)。

Image for post
Apache Oozie Job Browser
Apache Oozie作业浏览器

Still, with Oozie, we had to deal with XML definitions or had to zip a directory which contains the task-related dependencies, development workflow wasn’t as convincing as Airflow. Instead of managing multiple directories of XML configs and worrying about the dependencies between directories, the option to write python code can be tested, and since it’s a code all the software best practices can be applied to it.

仍然,对于Oozie,我们必须处理XML定义或压缩包含与任务相关的依赖关系的目录,开发工作流程并不像Airflow那样令人信服。 无需管理XML配置的多个目录并担心目录之间的依赖关系,而是可以测试编写python代码的选项,并且由于它是一种代码,因此可以将所有软件最佳实践应用于该代码。

Image for post
Azkaban Flow
阿兹卡班流

Azkaban has distributed multiple-executor mode and beautiful UI visualizations, option to retry failed jobs, good alerting options are available, can track user actions, i.e. auditing was available, but since the workflows are defined using property files finally, we didn’t consider this option.

Azkaban已分发了多个执行程序模式和精美的UI可视化效果,可以选择重试失败的作业,可以使用良好的警报选项,可以跟踪用户操作,即可以进行审核,但是由于最终使用属性文件定义了工作流程,因此我们没有考虑此选项。

Luigi was promising since the deployment in Kubernetes was so simple, and it handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Luigi很有前途,因为在Kubernetes中的部署是如此简单,它可以处理依赖关系解析,工作流管理,可视化,处理故障,命令行集成等。

Image for post
Luigi Task Status
路易吉任务状态

But re-running old tasks or pipelines was not clear, i.e. No option to retrigger the tasks. Since it uses the cron feature to schedule we have to wait for Luigi scheduler to schedule it again after updating the code where Airflow has its own scheduler hence we can retrigger the dag using Airflow CLI or UI.

但是重新运行旧任务或管道并不清楚,即没有重新触发任务的选项。 由于它使用cron功能进行计划,因此我们必须等待Luigi计划程序在更新代码后重新安排它,其中Airflow拥有自己的计划程序,因此我们可以使用Airflow CLI或UI重新触发dag。

Luigi being cron scheduler scaling seem difficult whereas in Airflow the Kubernetes executor was very promising. Creating a task was not as simple as Airflow, also maintaining the dag is a bit difficult as there were no resources for dag versioning.

Luigi作为cron调度程序的扩展似乎很困难,而在Airflow中,Kubernetes执行器非常有前途。 创建任务并不像Airflow那样简单,而且维护dag有点困难,因为没有用于dag版本控制的资源。

Image for post
Comparison Between Luigi, Airflow & Oozie on the basis of Features
Luigi,Airflow和Oozie之间基于功能的比较

Finally, it was down to Airflow and Argo:

最后,归结为Airflow和Argo

  • Both are designed for batch workflows involving the directed Acyclic Graph (DAG) of tasks.

    两者都是为涉及任务的有向无环图(DAG)的批处理工作流而设计的。
  • Both provide flow control for error handling and conditional logic based on the output of upstream steps.

    两者都基于上游步骤的输出提供错误处理和条件逻辑的流控制。
  • Both have a great community and actively maintained by a community of contributors.

    两者都有一个伟大的社区,并由贡献者社区积极维护。

为什么选择Airflow而不是Argo? (Why choose Airflow over Argo?)

But few main points at the time of decision were Airflow is tightly coupled to the Python ecosystem, and it’s all about dag code. At the same time, Argo provides flexibility to schedule steps which is very useful as anything which can run in the container may be used with Argo. Still, the problem is a longer development time since we will have to prepare each task’s docker container and push to Google Container Registry, which is our private Docker repository via CI/CD.

但是在做出决定时,很少有要点是Airflow与Python生态系统紧密相关,而这全都是关于dag代码的。 同时, Argo可以灵活地安排步骤,这非常有用, 因为可以在容器中运行的任何内容都可以与Argo一起使用。 尽管如此,问题仍然是更长的开发 时间,因为我们将不得不准备每个任务的docker容器并通过CI / CD推送到Google Container Registry ,这是我们的私有Docker存储库 。

Image for post
Argo Workflow UI
Argo工作流程用户界面

Argo natively schedules steps to run in a Kubernetes cluster, potentially across several hosts or nodes. Airflow also has K8 Pod Operator and Kubernetes Executor which sounded exciting since it will create a new pod for every task instance and no need worry about scaling celery pods

Argo本地计划在Kubernetes集群中运行的步骤 ,可能跨多个主机或节点运行。 Airflow还具有K8 Pod Operator和Kubernetes Executor ,这听起来很令人兴奋,因为它将为每个任务实例创建一个新的Pod,而无需担心缩放芹菜Pod

Airflow and Argo CLI are equally good, Airflow DAGs are expressed in a Python-based DSL, while Argo DAGs are expressed in a K8s YAML syntax with docker containers packing all the task code.

Airflow和Argo CLI都一样好,Airflow DAG用基于Python的DSL表示,而Argo DAG用K8s YAML语法表示,并且Docker容器包装了所有任务代码。

Image for post
Airflow DAG — ETL job which runs Spark job and updates the Hive table
Airflow DAG —运行Spark作业并更新Hive表的ETL作业

Airflow has a colossal adoption; hence there is a massive list of “Operators” and “Hooks” with support for other runtimes like Bash, Spark, Hive, Druid, Pinot, etc. Hence, Airflow was the clear winner.

气流的采用非常广泛; 因此,有大量的“ 操作员”和“ 钩子”列表支持其他运行时,例如Bash,Spark,Hive,Druid,Pinot等。因此,Airflow无疑是赢家。

To sum up, Airflow provides:

综上所述,Airflow提供:

  1. Reliability: Airflow provides retries, i.e. can handle task failures by retrying it, that is if upstream dependencies succeed, then downstream tasks can retry if things fail.

    可靠性 :Airflow提供重试功能,即可以通过重试来处理任务失败,也就是说,如果上游依赖项成功,那么如果事情失败,则下游任务可以重试。

  2. Alerting: Airflow can report if dag failed or if dag didn’t meet an SLA and inform on any failure.

    警报 :Airflow可以报告dag是否失败或dag不符合SLA,并通知任何失败。

  3. Priority-based queue management which ensures the most critical tasks are completed first

    基于优先级的队列管理 ,可确保首先完成最关键的任务

  4. Resource pools can be used to limit the execution of parallelism on arbitrary sets of tasks.

    资源池可用于限制并行执行任意任务集。

  5. Centralized configuration

    集中配置

  6. Centralized metrics of tasks

    集中的任务指标

  7. Centralized Performace Views: With views like Gantt we can look at the actual performance of the dag and check if this specific dag has spent five or ten minutes waiting for some data to land in then once data arrived it trigger the spark job which might do some aggregation on that data. So these Views help us to analyze the performance over time.

    集中的Performace视图 :使用类似Gantt的视图,我们可以查看dag的实际性能,并检查此特定dag是否花了五到十分钟等待一些数据进入,一旦数据到达,它就会触发火花作业,这可能会做一些对该数据进行汇总。 因此,这些视图可以帮助我们分析一段时间内的效果。

Future of Airflow:

气流的未来:

Since we already have 100s of dags running in production and with Fluidity(inhouse airflow dag generator) we expect the number of dags to grow by twice or thrice in the next few months itself, one of the most-watched features from Airflow 2.0 is the separation of the DAG parsing from DAG scheduling which can reduce the amount of time(time where no tasks are running) wasted in waiting and reduce the task time via fast follow of airflow tasks from workers.

由于我们已经有100多个dag在生产中运行,并且使用Fluidity(内部气流dag发生器),我们预计在接下来的几个月中dag的数量将增长两倍或三倍,因此Airflow 2.0最受关注的功能之一是DAG解析与DAG调度的分离 ,可以减少等待中浪费的时间(无任务运行的时间),并通过快速跟踪工人的气流任务来减少任务时间

Improving the DAG versioning to avoid manual creation of versioned dag [i.e. to add new task we go from DAG_SCALE_UP to DAG_SCALE_UP_V1]

改进DAG版本控制,以避免手动创建版本控制的数据[例如,添加新任务,我们从DAG_SCALE_UP转到DAG_SCALE_UP_V1]

High availability of scheduler for performance scalability and resiliency reasons is most with Active-Active models.

出于性能可伸缩性和弹性的原因,调度程序的高可用性在Active-Active模型中最为常见。

The development speed of Airflow is generally slow, and it involves a steep learning curve, so there is Fluidity(full blog coming soon), and the same dev replica exactly like the production environment using Docker and Minikube is spawned.

Airflow的开发速度通常很慢,并且学习曲线陡峭,因此存在Fluidity(即将推出完整博客),并且产生了与使用Docker和Minikube的生产环境完全相同的相同dev副本。

Work on data evaluation, reports and data lineage with Airflow

使用Airflow进行数据评估,报告和数据沿袭

If you enjoyed this blog post, check out what we’ve posted so far over here, and keep an eye out on the same space for some really cool upcoming blogs in the near future. If you have any questions about the problems we face as Data Platform at Rapido or about anything else, please reach out to chethan@rapido.bike, looking forward to answering any questions!

如果您喜欢这篇博客文章,请查看我们到目前为止在这里发布的内容,并在不久的将来留意相同的空间来关注一些即将发布的非常酷的博客。 如果您对我们在Rapido上使用数据平台时遇到的问题或其他任何问题有任何疑问,请联系chethan@rapido.bike ,期待回答任何问题!

*Special Thanks to the Rapido Data Team for making this blog possible.

*特别感谢Rapido数据团队使此博客成为可能。

翻译自: https://medium.com/rapido-labs/why-is-airflow-a-great-fit-for-rapido-d8438ca0d1ab

rcp rapido

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389336.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Mysql5.7开启远程

2019独角兽企业重金招聘Python工程师标准>>> 1.注掉bind-address #bind-address 127.0.0.1 2.开启远程访问权限 grant all privileges on *.* to root"xxx.xxx.xxx.xxx" identified by "密码"; 或 grant all privileges on *.* to root"%…

分类结果可视化python_可视化分类结果的另一种方法

分类结果可视化pythonI love good data visualizations. Back in the days when I did my PhD in particle physics, I was stunned by the histograms my colleagues built and how much information was accumulated in one single plot.我喜欢出色的数据可视化。 早在我获得…

算法组合 优化算法_算法交易简化了风险价值和投资组合优化

算法组合 优化算法Photo by Markus Spiske (left) and Jamie Street (right) on UnsplashMarkus Spiske (左)和Jamie Street(右)在Unsplash上的照片 In the last post, we saw how actual algorithms are developed and tested. In this post, we will figure out the level of…

PS抠发丝技巧 「选择并遮住…」

PS抠发丝技巧 「选择并遮住…」 现在的海报设计,大多数都有模特MM,然而MM的头发实用太多了,有的还飘起来…… 对于设计师(特别是淘宝美工)没有一个强大、快速、实用的抠发丝技巧真的混不去哦。而PS CC 2017版本开始,就有了一个强大…

covid 19如何重塑美国科技公司的工作文化

未来 , 技术 , 观点 (Future, Technology, Opinion) Who would have thought that a single virus would take down the whole world and make us stay inside our homes? A pandemic wave that has altered our lives in such a way that no human (bi…

python生日悖论分析_生日悖论

python生日悖论分析If you have a group of people in a room, how many do you need to for it to be more likely than not, that two or more will have the same birthday?如果您在一个房间里有一群人,那么您需要多少个才能使两个或两个以上的人有相同的生日&a…

rstudio 管道符号_R中的管道指南

rstudio 管道符号R基础知识 (R Fundamentals) Data analysis often involves many steps. A typical journey from raw data to results might involve filtering cases, transforming values, summarising data, and then running a statistical test. But how can we link al…

蒙特卡洛模拟预测股票_使用蒙特卡洛模拟来预测极端天气事件

蒙特卡洛模拟预测股票In a previous article, I outlined the limitations of conventional time series models such as ARIMA when it comes to forecasting extreme temperature values, which in and of themselves are outliers in the time series.在上一篇文章中 &#…

直方图绘制与直方图均衡化实现

一,直方图的绘制 1.直方图的概念: 在图像处理中,经常用到直方图,如颜色直方图、灰度直方图等。 图像的灰度直方图就描述了图像中灰度分布情况,能够很直观的展示出图像中各个灰度级所 占的多少。 图像的灰度直方图是灰…

时间序列因果关系_分析具有因果关系的时间序列干预:货币波动

时间序列因果关系When examining a time series, it is quite common to have an intervention influence that series at a particular point.在检查时间序列时,在特定时间点对该序列产生干预影响是很常见的。 Some examples of this could be:例如: …

微生物 研究_微生物监测如何工作,为何如此重要

微生物 研究Background背景 While a New York Subway station is bustling with swarms of businessmen, students, artists, and millions of other city-goers every day, its floors, railings, stairways, toilets, walls, kiosks, and benches are teeming with non-huma…

Linux shell 脚本SDK 打包实践, 收集assets和apk, 上传FTP

2019独角兽企业重金招聘Python工程师标准>>> git config user.name "jenkins" git config user.email "jenkinsgerrit.XXX.net" cp $JENKINS_HOME/maven.properties $WORKSPACE cp $JENKINS_HOME/maven.properties $WORKSPACE/app cp $JENKINS_…

opencv:卷积涉及的基础概念,Sobel边缘检测代码实现及卷积填充模式

具体参考我的另一篇文章: opencv:卷积涉及的基础概念,Sobel边缘检测代码实现及Same(相同)填充与Vaild(有效)填充 这里是对这一篇文章的补充! 卷积—三种填充模式 橙色部分为image, 蓝色部分为…

无法从套接字中获取更多数据_数据科学中应引起更多关注的一个组成部分

无法从套接字中获取更多数据介绍 (Introduction) Data science, machine learning, artificial intelligence, those terms are all over the news. They get everyone excited with the promises of automation, new savings or higher earnings, new features, markets or te…

web数据交互_通过体育运动使用定制的交互式Web应用程序数据科学探索任何数据...

web数据交互Most good data projects start with the analyst doing something to get a feel for the data that they are dealing with.大多数好的数据项目都是从分析师开始做一些事情,以便对他们正在处理的数据有所了解。 They might hack together a Jupyter n…

PCA(主成分分析)思想及实现

PCA的概念: PCA是用来实现特征提取的。 特征提取的主要目的是为了排除信息量小的特征,减少计算量等。 简单来说: 当数据含有多个特征的时候,选取主要的特征,排除次要特征或者不重要的特征。 比如说:我们要…

【安富莱二代示波器教程】第8章 示波器设计—测量功能

第8章 示波器设计—测量功能 二代示波器测量功能实现比较简单,使用2D函数绘制即可。不过也专门开辟一个章节,为大家做一个简单的说明,方便理解。 8.1 水平测量功能 8.2 垂直测量功能 8.3 总结 8.1 水平测量功能 水平测量方…

深度学习数据更换背景_开始学习数据科学的最佳方法是了解其背景

深度学习数据更换背景数据科学教育 (DATA SCIENCE EDUCATION) 目录 (Table of Contents) The Importance of Context Knowledge 情境知识的重要性 (Optional) Research Supporting Context-Based Learning (可选)研究支持基于上下文的学习 The Context of Data Science 数据科学…

熊猫数据集_用熊猫掌握数据聚合

熊猫数据集Data aggregation is the process of gathering data and expressing it in a summary form. This typically corresponds to summary statistics for numerical and categorical variables in a data set. In this post we will discuss how to aggregate data usin…

IOS CALayer的属性和使用

一、CALayer的常用属性 1、propertyCGPoint position; 图层中心点的位置,类似与UIView的center;用来设置CALayer在父层中的位置;以父层的左上角为原点(0,0); 2、 property CGPoint anchorPoint…