aws emr 大数据分析_DataOps —使用AWS Lambda和Amazon EMR的全自动，低成本数据管道

aws emr 大数据分析

Progression is continuous. Taking a flashback journey through my 25 years career in information technology, I have experienced several phases of progression and adaptation.

进步是连续的。在我25年的信息技术职业生涯中经历了一次闪回之旅，我经历了发展和适应的多个阶段。

From a newly hired recruit who carefully watched every single SQL command run to completion to a confident DBA who scripted hundreds of SQL’s and ran them together as batch jobs using the Cron scheduler. In the modern era I adapted to DAG tools like Oozie and Airflow that not only provide job scheduling but can to run a series of jobs as data pipelines in an automated fashion.

从新雇用的新手，他仔细地观察每条SQL命令的执行情况，到自信的DBA，他用Cron调度程序编写了数百个SQL脚本并将它们作为批处理作业一起运行。在现代时代，我适应了OAGie和Airflow之类的DAG工具，这些工具不仅可以提供作业调度，还可以以自动化方式将一系列作业作为数据管道运行。

Lately, the adoption of cloud has changed the whole meaning of automation.

最近，云的采用改变了自动化的整体含义。

存储价格便宜，计算机价格昂贵 (STORAGE is cheap, COMPUTE is expensive)

In the cloud era, we can design automation methods that were previously unheard of. I admit that cloud storage resources are getting cheaper by the day but the compute resources (high CPU and memory) are still relatively expensive. Keeping that in mind, wouldn’t it be super cool if DataOps can help us save on compute costs. Let’s find out how this can be done:

在云时代，我们可以设计以前闻所未闻的自动化方法。我承认云存储资源一天比一天便宜，但是计算资源(高CPU和内存)仍然相对昂贵。记住这一点，如果DataOps可以帮助我们节省计算成本，那岂不是超级酷。让我们找出如何做到这一点：

Typically, we run data pipelines as follows:

通常，我们按以下方式运行数据管道：

Data collection at regular time intervals (daily, hourly or by the minute) saved to storage like S3. This is usually followed up by data processing jobs using permanently spawned distributed computing clusters like EMR.

按固定的时间间隔(每天，每小时或每分钟)收集数据，并像S3一样保存到存储中。接下来通常是使用永久生成的分布式计算集群(如EMR)进行数据处理作业。

Pros: Processing Jobs run on a schedule. Permanent cluster can be utilized for other purposes like querying using Hive, streaming workloads etc.

优点：处理作业按计划运行。永久群集可以用于其他目的，例如使用Hive查询，流式处理工作负载等。

Cons: There can be a delay between the time data arrives vs. when it gets processed. Compute resources may not be optimally utilized. There may be under utilization at times, therefore wasting expensive $$$

缺点：数据到达与处理之间可能会有延迟。计算资源可能没有得到最佳利用。有时可能利用率不足，因此浪费了昂贵的$$$

Here is an alternative that can help achieve the right balance between operation and costs. This method may not apply to all use cases, but if it does then rest assured it will save you a lot of $$$.

这是可以帮助在运营和成本之间实现适当平衡的替代方法。此方法可能不适用于所有用例，但是如果可以放心使用，则可以节省很多资金。

In this method the storage layer stays pretty much the same, except an event notification is added to the storage that invokes a Lambda function when new data arrives. In turn the Lambda function invokes the creation of a transient EMR cluster for data processing. A transient EMR cluster is a special type of cluster that deploys, runs the data processing job and then self-destructs.

在这种方法中，存储层几乎保持不变，只是事件通知被添加到存储中，该通知在新数据到达时调用Lambda函数。 Lambda函数进而调用瞬态EMR集群的创建以进行数据处理。临时EMR群集是一种特殊类型的群集，它可以部署，运行数据处理作业，然后自毁。

Pros: Processing jobs can start as soon as the data is available. No waits. Compute resources optimally utilized. Only pay for what you use and save $$$.

优点：数据可用后即可开始处理作业。没有等待。计算最佳利用的资源。只需支付您使用的费用，即可节省$$$。

Cons: Cluster cannot be utilized for other purposes like querying using Hive, streaming workloads etc.

缺点：群集不能用于其他目的，例如使用Hive查询，流工作负载等。

Here is how the entire process is handled technically:

这是技术上处理整个过程的方式：

Assume the data files are delivered to s3://<BUCKET>/raw/files/renewable/hydropower-consumption/<DATE>_<HOUR>

假设数据文件已传递到s3：// <BUCKET> / raw / files / renewable / hydropower-consumption / <DATE> _ <HOUR>

Clone my git repo:

克隆我的git repo：

$ git clone https://github.com/mkukreja1/blogs.git

Create a new S3 Bucket for running the demo. Remember to change the bucket name since S3 bucket name are globally unique.

创建一个新的S3存储桶以运行演示。请记住要更改存储桶名称，因为S3存储桶名称是全局唯一的。

$ S3_BUCKET=lambda-emr-pipeline  #Edit as per your bucket name$ REGION='us-east-1' #Edit as per your AWS region$ JOB_DATE='2020-08-07_2PM' #Do not Edit this$ aws s3 mb s3://$S3_BUCKET$ aws s3 cp blogs/lambda-emr/emr.sh s3://$S3_BUCKET/bootstrap/$ aws s3 cp blogs/lambda-emr/hydropower-processing.py s3://$S3_BUCKET/spark/

Create a role for the Lambda Function

为Lambda函数创建角色

$ aws iam create-role --role-name trigger-pipeline-role --assume-role-policy-document file://blogs/lambda-emr//lambda-policy.json$ aws iam attach-role-policy --role-name trigger-pipeline-role  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole$ aws iam attach-role-policy --role-name trigger-pipeline-role  --policy-arn arn:aws:iam::aws:policy/AmazonElasticMapReduceFullAccess

Create the Lambda Function in AWS

在AWS中创建Lambda函数

$ ROLE_ARN=`aws iam get-role --role-name trigger-pipeline-role | grep Arn |sed 's/"Arn"://'  |sed 's/,//' | sed 's/"//g'`;echo $ROLE_ARN$ cat blogs/lambda-emr/trigger-pipeline.py | sed "s/YOUR_BUCKET/$S3_BUCKET/g" | sed "s/YOUR_REGION/'$REGION'/g" > lambda_function.py$ zip trigger-pipeline.zip lambda_function.py$ aws lambda delete-function --function-name trigger-pipeline$ LAMBDA_ARN=` aws lambda create-function --function-name trigger-pipeline --runtime python3.6 --role $ROLE_ARN --handler lambda_function.lambda_handler --timeout 60 --zip-file fileb://trigger-pipeline.zip | grep FunctionArn | sed -e 's/"//g' -e 's/,//g'  -e 's/FunctionArn//g' -e 's/: //g' `;echo $LAMBDA_ARN$ aws lambda add-permission --function-name trigger-pipeline --statement-id 1 --action lambda:InvokeFunction --principal s3.amazonaws.com

Finally, let's create the S3 event notification. This notification will invoke the Lambda function above.

最后，让我们创建S3事件通知。此通知将调用上面的Lambda函数。

$ cat blogs/lambda-emr/notification.json | sed "s/YOUR_LAMBDA_ARN/$LAMBDA_ARN/g" | sed "s/\    arn/arn/" > notification.json$ aws s3api put-bucket-notification-configuration --bucket $S3_BUCKET --notification-configuration file://notification.json

Let's kick off the process by copying data to S3

让我们开始将数据复制到S3的过程

$ aws s3 rm s3://$S3_BUCKET/curated/ --recursive
$ aws s3 rm s3://$S3_BUCKET/data/ --recursive$ aws s3 sync blogs/lambda-emr/data/ s3://$S3_BUCKET/data/

If everything ran OK you should be able to see a running Cluster in EMR with Status=Starting

如果一切正常，您应该可以在Status = Starting的 EMR中看到正在运行的集群

After some time the EMR cluster should change to Status=Terminated.

一段时间后，EMR群集应更改为Status = Terminated。

To check if the Spark program was successful check the S3 folder as below:

要检查Spark程序是否成功，请检查S3文件夹，如下所示：

$ aws s3 ls s3://$S3_BUCKET/curated/2020-08-07_2PM/
2020-08-10 17:10:36          0 _SUCCESS
2020-08-10 17:10:35      18206 part-00000-12921d5b-ea28-4e7f-afad-477aca948beb-c000.snappy.parquet

Moving into the next phase of Data Engineering and Data Science automation of data pipelines is becoming a critical operation. If done correctly it carries a potential for streamlining not only operations but resource costs as well.

进入数据工程和数据科学的下一阶段，数据管道自动化已成为一项关键操作。如果做得正确，它不仅可以简化运营，还可以简化资源成本。

Hope you gained some valuable insights from the article above. Feel free to contact me if you need further clarifications and advice.

希望您从上面的文章中获得了一些有价值的见解。如果您需要进一步的说明和建议，请随时与我联系。