marlin 三角洲_带火花的三角洲湖:什么和为什么?

marlin 三角洲

Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:

首先,我介绍一下我在Apache Spark上的经历反复解决的两个问题:

  1. Data “overwrite” on the same path causing data loss in case of Job Failure.

    在作业失败的情况下,同一路径上的数据“覆盖”会导致数据丢失。
  2. Updates in the data.

    数据更新。

Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.

有时我通过设计更改解决了上述问题,有时通过引入了诸如Aerospike的另一层,或者有时通过维护历史增量数据来解决。

Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may be unlikely but inevitable.

维护历史数据通常是一个立即解决方案,但是如果不是真正需要历史增量数据,我真的不喜欢处理它,因为(至少对我来说)这会带来回填的痛苦,以防止发生故障(虽然这不太可能,但不可避免)。

The above two problems are “problems” because Apache Spark does not really support ACID. I know it was never Spark’s use case to work with transactions(hello, you can’t have everything) but sometimes, there might be a scenario(like my two problems above) where ACID compliance would have come in handy.

以上两个问题是“问题”,因为Apache Spark并不真正支持ACID。 我知道这绝不是Spark处理事务的用例(您好,您不能拥有所有东西),但是有时候,在某些情况下(如上述两个问题),ACID合规性会派上用场。

When I read about Delta Lake and its ACID compliance, I saw it as one of the possible solutions for my two problems. Please read on to find out how the two problems are related to ACID compliance failure and how delta lake can be seen as a savior?

当我阅读Delta Lake及其ACID合规性时,我将其视为解决我的两个问题的可能解决方案之一。 请继续阅读,以找出这两个问题与ACID合规性失败之间的关系以及如何将三角洲湖视为救星?

什么是三角洲湖? (What is Delta Lake?)

Delta Lake Documentation introduces Delta lake as:

Delta Lake文档将Delta Lake引入为:

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake是一个开源存储层 ,可为数据湖带来可靠性。 Delta Lake提供ACID事务,可伸缩的元数据处理,并统一流和批处理数据处理。 Delta Lake在您现有的数据湖之上运行,并且与Apache Spark API完全兼容。

Delta Lake key points:

三角洲湖重点:

  • Supports ACID

    支持ACID
  • Enables Time travel

    启用时间旅行
  • Enables UPSERT

    启用UPSERT

Spark如何使ACID失败? (How Spark fails ACID?)

Consider the following piece of code to remove duplicates from a dataset:

考虑以下代码,从数据集中删除重复项:

# Read from HDFS
df = spark.read.parquet("/path/on/hdfs") # Line 1
# Remove duplicates
df = df.distinct() # Line 2
# Overwrite the data
df.cache() # Line 3
df.write.parquet("/path/on/hdfs", mode="overwrite") # Line 4

For my spark application running above piece of code consider a scenario where it fails on Line 4, that is while writing the data. This may or may not lead to data loss. [Problem #1: As mentioned above].You can replicate the scenario, by creating a test dataset and kill the job when it’s in the Write stage.

对于在上述代码段上运行的我的spark应用程序,请考虑一种情况,即它在第4行上失败,即在写入数据时。 这可能会或可能不会导致数据丢失。 [问题1:如上所述]。您可以通过创建测试数据集来复制方案,并在处于Write阶段时取消该作业。

Let us try to understand ACID failure in spark with the above scenario.

让我们尝试了解上述情况下火花中的ACID故障。

ACID中的A代表原子性, (A in ACID stands for Atomicity,)

  • What is Atomicity: Either all changes take place or none, the system is never in halfway state.

    什么是原子性:要么所有更改都发生,要么全部不发生,系统永远不会处于中间状态。

  • How spark fails: While writing data, (at Line 4 above), if a failure occurs at a stage where old data is removed and new data is not yet written, data loss occurs. We have lost old data and we were not able to write new data due to job failure, atomicity fails. [It can vary according to file output committer used, please do read about File output committer to see how data writing takes place, the scenario I explained is for v2]

    火花如何失败:在写入数据时(在上面的第4行),如果在删除旧数据而尚未写入新数据的阶段发生故障,则会发生数据丢失。 我们丢失了旧数据,并且由于作业失败,原子性失败而无法写入新数据。 [根据使用的文件输出提交程序,它可能有所不同,请阅读有关文件输出提交程序的信息,以了解如何进行数据写入,我所说明的场景适用于v2]

ACID中的C代表一致性, (C in ACID stands for Consistency,)

  • What is Consistency: Data must be consistent and valid in the system at all times.

    什么是一致性 :数据必须始终在系统中保持一致和有效。

  • How Spark fails: As seen above, in the case of failure and data loss, we are left with invalid data in the system, consistency fails.

    Spark如何失败:如上所示,在失败和数据丢失的情况下,我们在系统中留有无效数据,一致性失败。

ACID中的I代表隔离, (I in ACID stands for Isolation,)

  • What is Isolation: Multiple transactions occur in isolation

    什么是隔离:多个事务是隔离发生的

  • How spark fails: Consider two jobs running in parallel, one as described above and another which is also using the same dataset, if one job overwrites the dataset while other is still using it, failure might happen, isolation fails.

    Spark如何失败:考虑两个并行运行的作业,一个如上所述,另一个使用相同的数据集,如果一个作业覆盖了数据集而另一个仍在使用它,则可能会发生故障,隔离失败。

ACID中的D代表耐久性, (D in ACID stands for Durability,)

  • What is Durability: Changes once made are never lost, even in the case of system failure.

    什么是耐久性:即使系统发生故障,一旦进行更改也不会丢失。

  • How spark might fail: Spark really doesn’t affect the durability, it is mainly governed by the storage layer, but since we are losing data in case of job failures, in my opinion, it is a durability failure.

    Spark 可能如何失败: Spark确实不会影响持久性,它主要由存储层控制,但是由于我们在工作失败的情况下会丢失数据,因此我认为这是持久性失败。

Delta Lake如何支持ACID? (How Delta Lake supports ACID?)

Delta lake maintains a delta log in the path where data is written. Delta Log maintains details like:

Delta Lake在写入数据的路径中维护一个delta日志。 Delta Log维护以下详细信息:

  • Metadata like

    像元数据

    - Paths added in the write operation.

    -在写操作中添加的路径。

    - Paths removed in the write operation.

    -在写操作中删除了路径。

    - Data size

    -数据大小

    - Changes in data

    -数据变化

  • Data Schema

    数据架构
  • Commit information like

    提交信息,例如

    - Number of output rows

    -输出行数

    - Output bytes

    -输出字节

    - Timestamp

    -时间戳

Sample log file in _delta_log_ directory created after some operations:

某些操作后在_delta_log_目录中创建的示例日志文件:

Image for post

After successful execution, a log file is created in the _delta_log_ directory. The important thing to note is when you save your data as delta, no files once written are removed. The concept is similar to versioning.

成功执行后,将在_delta_log_目录中创建一个日志文件。 需要注意的重要一点是,当您将数据另存为增量时,写入后不会删除任何文件。 该概念类似于版本控制。

By keeping track of paths removed, added and other metadata information in the _delta_log_, Delta lake is ACID-compliant.

通过跟踪_delta_log_中路径的删除,添加和其他元数据信息,Delta Lake符合ACID。

Versioning enables time travel property of Delta Lake, which is, I can go back to any state of data because all this information is being maintained in _delta_log_.

版本控制启用了Delta Lake的时间旅行属性,即,由于所有这些信息都保存在_delta_log_中,因此我可以返回到任何数据状态。

Delta Lake如何解决上述两个问题? (How Delta Lake solves my two problems mentioned above?)

  • With the support for ACID, if my job fails during the “overwrite” operation, data is not lost, as changes won’t be committed to the log file of _delta_log_ directory. Also, since Delta Lake, does not remove old files in the “overwrite operation”, old state of my data is maintained and there is no data loss. (Yes, I have tested it)

    有了ACID的支持,如果我的工作在“覆盖”操作期间失败,则数据不会丢失,因为更改不会提交到_delta_log_目录的日志文件中。 另外,由于Delta Lake不会在“覆盖操作”中删除旧文件,因此我的数据保持了旧状态,并且没有数据丢失。 (是的,我已经测试过了)
  • Delta lake supports Update operation as mentioned above so it makes dealing with updates in data easier.

    Delta Lake支持如上所述的Update操作,因此使数据更新更容易。

Until next time,Ciao.

直到下次,Ciao。

翻译自: https://towardsdatascience.com/delta-lake-with-spark-what-and-why-6d08bef7b963

marlin 三角洲

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391297.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

eda分析_EDA理论指南

eda分析Most data analysis problems start with understanding the data. It is the most crucial and complicated step. This step also affects the further decisions that we make in a predictive modeling problem, one of which is what algorithm we are going to ch…

基于ssm框架和freemarker的商品销售系统

项目说明 1、项目文件结构 2、项目主要接口及其实现 (1)Index: 首页页面:展示商品功能,可登录或查看商品详细信息 (2)登录:/ApiLogin 3、dao层 数据持久化层,把商品和用户…

简·雅各布斯指数第二部分:测试

In Part I, I took you through the data gathering and compilation required to rank Census tracts by the four features identified by Jane Jacobs as the foundation of a great neighborhood:在第一部分中 ,我带您完成了根据简雅各布斯(Jacobs Jacobs)所确定…

Docker 入门(3)Docke的安装和基本配置

1. Docker Linux下的安装 1.1 Docker Engine 的版本 社区版 ( CE, Community Edition ) 社区版 ( Docker Engine CE ) 主要提供了 Docker 中的容器管理等基础功能,主要针对开发者和小型团队进行开发和试验企业版 ( EE, Enterprise Edition ) 企业版 ( Docker Engi…

python:单元测试框架pytest的一个简单例子

之前一般做自动化测试用的是unitest框架,发现pytest同样不错,写一个例子感受一下 test_sample.py import cx_Oracle import config from send_message import send_message from insert_cainiao_oracle import insert_cainiao_oracledef test_cainiao_mo…

抑郁症损伤神经细胞吗_使用神经网络探索COVID-19与抑郁症之间的联系

抑郁症损伤神经细胞吗The drastic changes in our lifestyles coupled with restrictions, quarantines, and social distancing measures introduced to combat the corona virus outbreak have lead to an alarming rise in mental health issues all over the world. Social…

Docker 入门(4)镜像与容器

1. 镜像与容器 1.1 镜像 Docker镜像类似于未运行的exe应用程序,或者停止运行的VM。当使用docker run命令基于镜像启动容器时,容器应用便能为外部提供服务。 镜像实际上就是这个用来为容器进程提供隔离后执行环境的文件系统。我们也称之为根文件系统&a…

python:pytest中的setup和teardown

原文:https://www.cnblogs.com/peiminer/p/9376352.html  之前我写的unittest的setup和teardown,还有setupClass和teardownClass(需要配合classmethod装饰器一起使用),接下来就介绍pytest的类似于这类的固件。 &#…

如何开始使用任何类型的数据? - 第1部分

从数据开始 (START WITH DATA) My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was nave and still doing my masters.我的数据科学之旅从在德国最大的汽车制造商之一…

iHealth基于Docker的DevOps CI/CD实践

本文由1月31日晚iHealth运维技术负责人郭拓在Rancher官方技术交流群内所做分享的内容整理而成,分享了iHealth从最初的服务器端直接部署,到现在实现全自动CI/CD的实践经验。作者简介郭拓,北京爱和健康科技有限公司(iHealth)。负责公…

机器学习图像源代码_使用带有代码的机器学习进行快速房地产图像分类

机器学习图像源代码RoomNet is a very lightweight (700 KB) and fast Convolutional Neural Net to classify pictures of different rooms of a house/apartment with 88.9 % validation accuracy over 1839 images. I have written this in python and TensorFlow.RoomNet是…

leetcode 938. 二叉搜索树的范围和

给定二叉搜索树的根结点 root,返回值位于范围 [low, high] 之间的所有结点的值的和。 示例 1: 输入:root [10,5,15,3,7,null,18], low 7, high 15 输出:32 示例 2: 输入:root [10,5,15,3,7,13,18,1,nul…

COVID-19和世界幸福报告数据告诉我们什么?

For many people, the idea of ​​staying home actually sounded good at first. This process was really efficient for Netflix and Amazon. But then sad truths awaited us. What was boring was the number of dead and intubated patients one after the other. We al…

iOS 开发一定要尝试的 Texture(ASDK)

原文链接 - iOS 开发一定要尝试的 Texture(ASDK)(排版正常, 包含视频) 前言 本篇所涉及的性能问题我都将根据滑动的流畅性来评判, 包括掉帧情况和一些实际体验 ASDK 已经改名为 Texture, 我习惯称作 ASDK 编译环境: MacOS 10.13.3, Xcode 9.2 参与测试机型: iPhone 6 10.3.3, i…

lisp语言是最好的语言_Lisp可能不是数据科学的最佳语言,但是我们仍然可以从中学到什么呢?...

lisp语言是最好的语言This article is in response to Emmet Boudreau’s article ‘Should We be Using Lisp for Data-Science’.本文是对 Emmet Boudreau的文章“我们应该将Lisp用于数据科学”的 回应 。 Below, unless otherwise stated, lisp refers to Common Lisp; in …

static、volatile、synchronize

原子性(排他性):不论是多核还是单核,具有原子性的量,同一时刻只能有一个线程来对它进行操作!可见性:多个线程对同一份数据操作,thread1改变了某个变量的值,要保证thread2…

1.10-linux三剑客之sed命令详解及用法

内容:1.sed命令介绍2.语法格式,常用功能查询 增加 替换 批量修改文件名第1章 sed是什么字符流编辑器 Stream Editor第2章 sed功能与版本处理出文本文件,日志,配置文件等增加,删除,修改,查询sed --versionsed -i 修改文件内容第3章 语法格式3.1 语法格式sed [选项] [sed指令…

python pca主成分_超越“经典” PCA:功能主成分分析(FPCA)应用于使用Python的时间序列...

python pca主成分FPCA is traditionally implemented with R but the “FDASRSF” package from J. Derek Tucker will achieve similar (and even greater) results in Python.FPCA传统上是使用R实现的,但是J. Derek Tucker的“ FDASRSF ”软件包将在Python中获得相…

初探Golang(2)-常量和命名规范

1 命名规范 1.1 Go是一门区分大小写的语言。 命名规则涉及变量、常量、全局函数、结构、接口、方法等的命名。 Go语言从语法层面进行了以下限定:任何需要对外暴露的名字必须以大写字母开头,不需要对外暴露的则应该以小写字母开头。 当命名&#xff08…

大数据平台构建_如何像产品一样构建数据平台

大数据平台构建重点 (Top highlight)Over the past few years, many companies have embraced data platforms as an effective way to aggregate, handle, and utilize data at scale. Despite the data platform’s rising popularity, however, little literature exists on…