aws emr 大数据分析_DataOps —使用AWS Lambda和Amazon EMR的全自动,低成本数据管道

aws emr 大数据分析

Progression is continuous. Taking a flashback journey through my 25 years career in information technology, I have experienced several phases of progression and adaptation.

进步是连续的。 在我25年的信息技术职业生涯中经历了一次闪回之旅,我经历了发展和适应的多个阶段。

From a newly hired recruit who carefully watched every single SQL command run to completion to a confident DBA who scripted hundreds of SQL’s and ran them together as batch jobs using the Cron scheduler. In the modern era I adapted to DAG tools like Oozie and Airflow that not only provide job scheduling but can to run a series of jobs as data pipelines in an automated fashion.

从新雇用的新手,他仔细地观察每条SQL命令的执行情况,到自信的DBA,他用Cron调度程序编写了数百个SQL脚本并将它们作为批处理作业一起运行。 在现代时代,我适应了OAGie和Airflow之类的DAG工具,这些工具不仅可以提供作业调度,还可以以自动化方式将一系列作业作为数据管道运行。

Lately, the adoption of cloud has changed the whole meaning of automation.

最近,云的采用改变了自动化的整体含义。

存储价格便宜,计算机价格昂贵 (STORAGE is cheap, COMPUTE is expensive)

In the cloud era, we can design automation methods that were previously unheard of. I admit that cloud storage resources are getting cheaper by the day but the compute resources (high CPU and memory) are still relatively expensive. Keeping that in mind, wouldn’t it be super cool if DataOps can help us save on compute costs. Let’s find out how this can be done:

在云时代,我们可以设计以前闻所未闻的自动化方法。 我承认云存储资源一天比一天便宜,但是计算资源(高CPU和内存)仍然相对昂贵。 记住这一点,如果DataOps可以帮助我们节省计算成本,那岂不是超级酷。 让我们找出如何做到这一点:

Typically, we run data pipelines as follows:

通常,我们按以下方式运行数据管道:

Data collection at regular time intervals (daily, hourly or by the minute) saved to storage like S3. This is usually followed up by data processing jobs using permanently spawned distributed computing clusters like EMR.

按固定的时间间隔(每天,每小时或每分钟)收集数据,并像S3一样保存到存储中。 接下来通常是使用永久生成的分布式计算集群(如EMR)进行数据处理作业。

Pros: Processing Jobs run on a schedule. Permanent cluster can be utilized for other purposes like querying using Hive, streaming workloads etc.

优点:处理作业按计划运行。 永久群集可以用于其他目的,例如使用Hive查询,流式处理工作负载等。

Cons: There can be a delay between the time data arrives vs. when it gets processed. Compute resources may not be optimally utilized. There may be under utilization at times, therefore wasting expensive $$$

缺点:数据到达与处理之间可能会有延迟。 计算资源可能没有得到最佳利用。 有时可能利用率不足,因此浪费了昂贵的$$$

Image for post
(Image by Author)
(图片由作者提供)

Here is an alternative that can help achieve the right balance between operation and costs. This method may not apply to all use cases, but if it does then rest assured it will save you a lot of $$$.

这是可以帮助在运营和成本之间实现适当平衡的替代方法。 此方法可能不适用于所有用例,但是如果可以放心使用,则可以节省很多资金。

Image for post
(Image by Author)
(图片由作者提供)

In this method the storage layer stays pretty much the same, except an event notification is added to the storage that invokes a Lambda function when new data arrives. In turn the Lambda function invokes the creation of a transient EMR cluster for data processing. A transient EMR cluster is a special type of cluster that deploys, runs the data processing job and then self-destructs.

在这种方法中,存储层几乎保持不变,只是事件通知被添加到存储中,该通知在新数据到达时调用Lambda函数。 Lambda函数进而调用瞬态EMR集群的创建以进行数据处理。 临时EMR群集是一种特殊类型的群集,它可以部署,运行数据处理作业,然后自毁。

Pros: Processing jobs can start as soon as the data is available. No waits. Compute resources optimally utilized. Only pay for what you use and save $$$.

优点:数据可用后即可开始处理作业。 没有等待。 计算最佳利用的资源。 只需支付您使用的费用,即可节省$$$。

Cons: Cluster cannot be utilized for other purposes like querying using Hive, streaming workloads etc.

缺点:群集不能用于其他目的,例如使用Hive查询,流工作负载等。

Here is how the entire process is handled technically:

这是技术上处理整个过程的方式:

Assume the data files are delivered to s3://<BUCKET>/raw/files/renewable/hydropower-consumption/<DATE>_<HOUR>

假设数据文件已传递到s3:// <BUCKET> / raw / files / renewable / hydropower-consumption / <DATE> _ <HOUR>

Clone my git repo:

克隆我的git repo:

$ git clone https://github.com/mkukreja1/blogs.git

Create a new S3 Bucket for running the demo. Remember to change the bucket name since S3 bucket name are globally unique.

创建一个新的S3存储桶以运行演示。 请记住要更改存储桶名称,因为S3存储桶名称是全局唯一的。

$ S3_BUCKET=lambda-emr-pipeline  #Edit as per your bucket name$ REGION='us-east-1' #Edit as per your AWS region$ JOB_DATE='2020-08-07_2PM' #Do not Edit this$ aws s3 mb s3://$S3_BUCKET$ aws s3 cp blogs/lambda-emr/emr.sh s3://$S3_BUCKET/bootstrap/$ aws s3 cp blogs/lambda-emr/hydropower-processing.py s3://$S3_BUCKET/spark/

Create a role for the Lambda Function

为Lambda函数创建角色

$ aws iam create-role --role-name trigger-pipeline-role --assume-role-policy-document file://blogs/lambda-emr//lambda-policy.json$ aws iam attach-role-policy --role-name trigger-pipeline-role  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole$ aws iam attach-role-policy --role-name trigger-pipeline-role  --policy-arn arn:aws:iam::aws:policy/AmazonElasticMapReduceFullAccess

Create the Lambda Function in AWS

在AWS中创建Lambda函数

$ ROLE_ARN=`aws iam get-role --role-name trigger-pipeline-role | grep Arn |sed 's/"Arn"://'  |sed 's/,//' | sed 's/"//g'`;echo $ROLE_ARN$ cat blogs/lambda-emr/trigger-pipeline.py | sed "s/YOUR_BUCKET/$S3_BUCKET/g" | sed "s/YOUR_REGION/'$REGION'/g" > lambda_function.py$ zip trigger-pipeline.zip lambda_function.py$ aws lambda delete-function --function-name trigger-pipeline$ LAMBDA_ARN=` aws lambda create-function --function-name trigger-pipeline --runtime python3.6 --role $ROLE_ARN --handler lambda_function.lambda_handler --timeout 60 --zip-file fileb://trigger-pipeline.zip | grep FunctionArn | sed -e 's/"//g' -e 's/,//g'  -e 's/FunctionArn//g' -e 's/: //g' `;echo $LAMBDA_ARN$ aws lambda add-permission --function-name trigger-pipeline --statement-id 1 --action lambda:InvokeFunction --principal s3.amazonaws.com

Finally, let's create the S3 event notification. This notification will invoke the Lambda function above.

最后,让我们创建S3事件通知。 此通知将调用上面的Lambda函数。

$ cat blogs/lambda-emr/notification.json | sed "s/YOUR_LAMBDA_ARN/$LAMBDA_ARN/g" | sed "s/\    arn/arn/" > notification.json$ aws s3api put-bucket-notification-configuration --bucket $S3_BUCKET --notification-configuration file://notification.json
Image for post
(Image by Author)
(图片由作者提供)

Let's kick off the process by copying data to S3

让我们开始将数据复制到S3的过程

$ aws s3 rm s3://$S3_BUCKET/curated/ --recursive
$ aws s3 rm s3://$S3_BUCKET/data/ --recursive$ aws s3 sync blogs/lambda-emr/data/ s3://$S3_BUCKET/data/

If everything ran OK you should be able to see a running Cluster in EMR with Status=Starting

如果一切正常,您应该可以在Status = Starting的 EMR中看到正在运行的集群

Image for post
(Image by Author)
(图片由作者提供)

After some time the EMR cluster should change to Status=Terminated.

一段时间后,EMR群集应更改为Status = Terminated。

Image for post
(Image by Author)
(图片由作者提供)

To check if the Spark program was successful check the S3 folder as below:

要检查Spark程序是否成功,请检查S3文件夹,如下所示:

$ aws s3 ls s3://$S3_BUCKET/curated/2020-08-07_2PM/
2020-08-10 17:10:36 0 _SUCCESS
2020-08-10 17:10:35 18206 part-00000-12921d5b-ea28-4e7f-afad-477aca948beb-c000.snappy.parquet

Moving into the next phase of Data Engineering and Data Science automation of data pipelines is becoming a critical operation. If done correctly it carries a potential for streamlining not only operations but resource costs as well.

进入数据工程和数据科学的下一阶段,数据管道自动化已成为一项关键操作。 如果做得正确,它不仅可以简化运营,还可以简化资源成本。

Hope you gained some valuable insights from the article above. Feel free to contact me if you need further clarifications and advice.

希望您从上面的文章中获得了一些有价值的见解。 如果您需要进一步的说明和建议,请随时与我联系。

翻译自: https://towardsdatascience.com/dataops-fully-automated-low-cost-data-pipelines-using-aws-lambda-and-amazon-emr-c4d94fdbea97

aws emr 大数据分析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388343.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

先进的NumPy数据科学

We will be covering some of the advanced concepts of NumPy specifically functions and methods required to work on a realtime dataset. Concepts covered here are more than enough to start your journey with data.我们将介绍NumPy的一些高级概念&#xff0c;特别是…

lsof命令详解

基础命令学习目录首页 原文链接&#xff1a;https://www.cnblogs.com/ggjucheng/archive/2012/01/08/2316599.html 简介 lsof(list open files)是一个列出当前系统打开文件的工具。在linux环境下&#xff0c;任何事物都以文件的形式存在&#xff0c;通过文件不仅仅可以访问常规…

统计和冰淇淋

Photo by Irene Kredenets on UnsplashIrene Kredenets在Unsplash上拍摄的照片 摘要 (Summary) In this article, you will learn a little bit about probability calculations in R Studio. As it is a Statistical language, R comes with many tests already built in it, …

信息流服务器哪种好,选购存储服务器需要注意六大关键因素,你知道几个?

原标题&#xff1a;选购存储服务器需要注意六大关键因素&#xff0c;你知道几个&#xff1f;信息技术的飞速发展带动了整个信息产业的发展。越来越多的电子商务平台和虚拟化环境出现在企业的日常应用中。存储服务器作为企业建设环境的核心设备&#xff0c;在整个信息流中承担着…

t3 深入Tornado

3.1 Application settings 前面的学习中&#xff0c;在创建tornado.web.Application的对象时&#xff0c;传入了第一个参数——路由映射列表。实际上Application类的构造函数还接收很多关于tornado web应用的配置参数。 参数&#xff1a; debug&#xff0c;设置tornado是否工作…

对数据仓库进行数据建模_确定是否可以对您的数据进行建模

对数据仓库进行数据建模Some data sets are just not meant to have the geospatial representation that can be clustered. There is great variance in your features, and theoretically great features as well. But, it doesn’t mean is statistically separable.某些数…

15 并发编程-(IO模型)

一、IO模型介绍 1、阻塞与非阻塞指的是程序的两种运行状态 阻塞&#xff1a;遇到IO就发生阻塞&#xff0c;程序一旦遇到阻塞操作就会停在原地&#xff0c;并且立刻释放CPU资源 非阻塞&#xff08;就绪态或运行态&#xff09;&#xff1a;没有遇到IO操作&#xff0c;或者通过某种…

不提拔你,就是因为你只想把工作做好

2019独角兽企业重金招聘Python工程师标准>>> 我有个朋友&#xff0c;他30出头&#xff0c;在500强公司做技术经理。他戴无边眼镜&#xff0c;穿一身土黄色的夹克&#xff0c;下面是一条常年不洗的牛仔裤加休闲皮鞋&#xff0c;典型技术高手范。 三 年前&#xff0c;…

python内置函数多少个_每个数据科学家都应该知道的10个Python内置函数

python内置函数多少个Python is the number one choice of programming language for many data scientists and analysts. One of the reasons of this choice is that python is relatively easier to learn and use. More importantly, there is a wide variety of third pa…

C#使用TCP/IP与ModBus进行通讯

C#使用TCP/IP与ModBus进行通讯1. ModBus的 Client/Server模型 2. 数据包格式及MBAP header (MODBUS Application Protocol header) 3. 大小端转换 4. 事务标识和缓冲清理 5. 示例代码 0. MODBUS MESSAGING ON TCP/IP IMPLEMENTATION GUIDE 下载地址&#xff1a;http://www.modb…

Hadoop HDFS常用命令

1、查看hdfs文件目录 hadoop fs -ls / 2、上传文件 hadoop fs -put 文件路径 目标路径 在浏览器查看:namenodeIP:50070 3、下载文件 hadoop fs -get 文件路径 保存路径 4、设置副本数量 -setrep 转载于:https://www.cnblogs.com/chaofan-/p/9742633.html

SAP UI 搜索分页技术

搜索分页技术往往和另一个术语Lazy Loading&#xff08;懒加载&#xff09;联系起来。今天由Jerry首先介绍S/4HANA&#xff0c;CRM Fiori和S4CRM应用里的UI搜索分页的实现原理。后半部分由SAP成都研究院菜园子小哥王聪向您介绍Twitter的懒加载实现。 关于王聪的背景介绍&#x…

万彩录屏服务器不稳定,万彩录屏 云服务器

万彩录屏 云服务器 内容精选换一换内网域名是指仅在VPC内生效的虚拟域名&#xff0c;无需购买和注册&#xff0c;无需备案。云解析服务提供的内网域名功能&#xff0c;可以让您在VPC中拥有权威DNS&#xff0c;且不会将您的DNS记录暴露给互联网&#xff0c;解析性能更高&#xf…

针对数据科学家和数据工程师的4条SQL技巧

SQL has become a common skill requirement across industries and job profiles over the last decade.在过去的十年中&#xff0c;SQL已成为跨行业和职位描述的通用技能要求。 Companies like Amazon and Google will often demand that their data analysts, data scienti…

全排列算法实现

版权声明&#xff1a;本文为博主原创文章&#xff0c;未经博主允许不得转载。 https://blog.csdn.net/summerxiachen/article/details/605796231.全排列的定义和公式&#xff1a; 从n个数中选取m&#xff08;m<n&#xff09;个数按照一定的顺序进行排成一个列&#xff0c;叫…

14.并发容器之ConcurrentHashMap(JDK 1.8版本)

1.ConcurrentHashmap简介 在使用HashMap时在多线程情况下扩容会出现CPU接近100%的情况&#xff0c;因为hashmap并不是线程安全的&#xff0c;通常我们可以使用在java体系中古老的hashtable类&#xff0c;该类基本上所有的方法都采用synchronized进行线程安全的控制&#xff0c;…

服务器虚拟化网口,服务器安装虚拟网口

服务器安装虚拟网口 内容精选换一换Atlas 800 训练服务器(型号 9010)安装上架、服务器基础参数配置、安装操作系统等操作请参见《Atlas 800 训练服务器 用户指南 (型号9010)》。Atlas 800 训练服务器(型号 9010)适配操作系统如表1所示。请参考表2下载驱动和固件包。Atlas 800 训…

芒果云接吗_芒果糯米饭是生产力的关键吗?

芒果云接吗Would you like to know how your mood impact your sleep and how your parents influence your happiness levels?您想知道您的心情如何影响您的睡眠以及您的父母如何影响您的幸福感吗&#xff1f; Become a data nerd, and track it!成为数据书呆子&#xff0c;…

laravel-admin 开发 bootstrap-treeview 扩展包

laravel-admin 扩展开发文档https://laravel-admin.org/doc... 效果图&#xff1a; 开发过程&#xff1a; 1、先创建Laravel项目&#xff0c;并集成laravel-admin&#xff0c;教程&#xff1a; http://note.youdao.com/notesh... 2、生成开发扩展包 php artisan admin:extend c…