marlin 三角洲
Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:
首先,我介绍一下我在Apache Spark上的经历反复解决的两个问题:
- Data “overwrite” on the same path causing data loss in case of Job Failure. 在作业失败的情况下,同一路径上的数据“覆盖”会导致数据丢失。
- Updates in the data. 数据更新。
Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.
有时我通过设计更改解决了上述问题,有时通过引入了诸如Aerospike的另一层,或者有时通过维护历史增量数据来解决。
Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may be unlikely but inevitable.
维护历史数据通常是一个立即解决方案,但是如果不是真正需要历史增量数据,我真的不喜欢处理它,因为(至少对我来说)这会带来回填的痛苦,以防止发生故障(虽然这不太可能,但不可避免)。
The above two problems are “problems” because Apache Spark does not really support ACID. I know it was never Spark’s use case to work with transactions(hello, you can’t have everything) but sometimes, there might be a scenario(like my two problems above) where ACID compliance would have come in handy.
以上两个问题是“问题”,因为Apache Spark并不真正支持ACID。 我知道这绝不是Spark处理事务的用例(您好,您不能拥有所有东西),但是有时候,在某些情况下(如上述两个问题),ACID合规性会派上用场。
When I read about Delta Lake and its ACID compliance, I saw it as one of the possible solutions for my two problems. Please read on to find out how the two problems are related to ACID compliance failure and how delta lake can be seen as a savior?
当我阅读Delta Lake及其ACID合规性时,我将其视为解决我的两个问题的可能解决方案之一。 请继续阅读,以找出这两个问题与ACID合规性失败之间的关系以及如何将三角洲湖视为救星?
什么是三角洲湖? (What is Delta Lake?)
Delta Lake Documentation introduces Delta lake as:
Delta Lake文档将Delta Lake引入为:
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake是一个开源存储层 ,可为数据湖带来可靠性。 Delta Lake提供ACID事务,可伸缩的元数据处理,并统一流和批处理数据处理。 Delta Lake在您现有的数据湖之上运行,并且与Apache Spark API完全兼容。
Delta Lake key points:
三角洲湖重点:
- Supports ACID 支持ACID
- Enables Time travel 启用时间旅行
- Enables UPSERT 启用UPSERT
Spark如何使ACID失败? (How Spark fails ACID?)
Consider the following piece of code to remove duplicates from a dataset:
考虑以下代码,从数据集中删除重复项:
# Read from HDFS
df = spark.read.parquet("/path/on/hdfs") # Line 1
# Remove duplicates
df = df.distinct() # Line 2
# Overwrite the data
df.cache() # Line 3
df.write.parquet("/path/on/hdfs", mode="overwrite") # Line 4
For my spark application running above piece of code consider a scenario where it fails on Line 4, that is while writing the data. This may or may not lead to data loss. [Problem #1: As mentioned above].You can replicate the scenario, by creating a test dataset and kill the job when it’s in the Write stage.
对于在上述代码段上运行的我的spark应用程序,请考虑一种情况,即它在第4行上失败,即在写入数据时。 这可能会或可能不会导致数据丢失。 [问题1:如上所述]。您可以通过创建测试数据集来复制方案,并在处于Write阶段时取消该作业。
Let us try to understand ACID failure in spark with the above scenario.
让我们尝试了解上述情况下火花中的ACID故障。
ACID中的A代表原子性, (A in ACID stands for Atomicity,)
What is Atomicity: Either all changes take place or none, the system is never in halfway state.
什么是原子性:要么所有更改都发生,要么全部不发生,系统永远不会处于中间状态。
How spark fails: While writing data, (at Line 4 above), if a failure occurs at a stage where old data is removed and new data is not yet written, data loss occurs. We have lost old data and we were not able to write new data due to job failure, atomicity fails. [It can vary according to file output committer used, please do read about File output committer to see how data writing takes place, the scenario I explained is for v2]
火花如何失败:在写入数据时(在上面的第4行),如果在删除旧数据而尚未写入新数据的阶段发生故障,则会发生数据丢失。 我们丢失了旧数据,并且由于作业失败,原子性失败而无法写入新数据。 [根据使用的文件输出提交程序,它可能有所不同,请阅读有关文件输出提交程序的信息,以了解如何进行数据写入,我所说明的场景适用于v2]
ACID中的C代表一致性, (C in ACID stands for Consistency,)
What is Consistency: Data must be consistent and valid in the system at all times.
什么是一致性 :数据必须始终在系统中保持一致和有效。
How Spark fails: As seen above, in the case of failure and data loss, we are left with invalid data in the system, consistency fails.
Spark如何失败:如上所示,在失败和数据丢失的情况下,我们在系统中留有无效数据,一致性失败。
ACID中的I代表隔离, (I in ACID stands for Isolation,)
What is Isolation: Multiple transactions occur in isolation
什么是隔离:多个事务是隔离发生的
How spark fails: Consider two jobs running in parallel, one as described above and another which is also using the same dataset, if one job overwrites the dataset while other is still using it, failure might happen, isolation fails.
Spark如何失败:考虑两个并行运行的作业,一个如上所述,另一个使用相同的数据集,如果一个作业覆盖了数据集而另一个仍在使用它,则可能会发生故障,隔离失败。
ACID中的D代表耐久性, (D in ACID stands for Durability,)
What is Durability: Changes once made are never lost, even in the case of system failure.
什么是耐久性:即使系统发生故障,一旦进行更改也不会丢失。
How spark might fail: Spark really doesn’t affect the durability, it is mainly governed by the storage layer, but since we are losing data in case of job failures, in my opinion, it is a durability failure.
Spark 可能如何失败: Spark确实不会影响持久性,它主要由存储层控制,但是由于我们在工作失败的情况下会丢失数据,因此我认为这是持久性失败。
Delta Lake如何支持ACID? (How Delta Lake supports ACID?)
Delta lake maintains a delta log in the path where data is written. Delta Log maintains details like:
Delta Lake在写入数据的路径中维护一个delta日志。 Delta Log维护以下详细信息:
Metadata like
像元数据
- Paths added in the write operation.
-在写操作中添加的路径。
- Paths removed in the write operation.
-在写操作中删除了路径。
- Data size
-数据大小
- Changes in data
-数据变化
- Data Schema 数据架构
Commit information like
提交信息,例如
- Number of output rows
-输出行数
- Output bytes
-输出字节
- Timestamp
-时间戳
Sample log file in _delta_log_ directory created after some operations:
某些操作后在_delta_log_目录中创建的示例日志文件:
After successful execution, a log file is created in the _delta_log_ directory. The important thing to note is when you save your data as delta, no files once written are removed. The concept is similar to versioning.
成功执行后,将在_delta_log_目录中创建一个日志文件。 需要注意的重要一点是,当您将数据另存为增量时,写入后不会删除任何文件。 该概念类似于版本控制。
By keeping track of paths removed, added and other metadata information in the _delta_log_, Delta lake is ACID-compliant.
通过跟踪_delta_log_中路径的删除,添加和其他元数据信息,Delta Lake符合ACID。
Versioning enables time travel property of Delta Lake, which is, I can go back to any state of data because all this information is being maintained in _delta_log_.
版本控制启用了Delta Lake的时间旅行属性,即,由于所有这些信息都保存在_delta_log_中,因此我可以返回到任何数据状态。
Delta Lake如何解决上述两个问题? (How Delta Lake solves my two problems mentioned above?)
- With the support for ACID, if my job fails during the “overwrite” operation, data is not lost, as changes won’t be committed to the log file of _delta_log_ directory. Also, since Delta Lake, does not remove old files in the “overwrite operation”, old state of my data is maintained and there is no data loss. (Yes, I have tested it) 有了ACID的支持,如果我的工作在“覆盖”操作期间失败,则数据不会丢失,因为更改不会提交到_delta_log_目录的日志文件中。 另外,由于Delta Lake不会在“覆盖操作”中删除旧文件,因此我的数据保持了旧状态,并且没有数据丢失。 (是的,我已经测试过了)
- Delta lake supports Update operation as mentioned above so it makes dealing with updates in data easier. Delta Lake支持如上所述的Update操作,因此使数据更新更容易。
Until next time,Ciao.
直到下次,Ciao。
翻译自: https://towardsdatascience.com/delta-lake-with-spark-what-and-why-6d08bef7b963
marlin 三角洲
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391297.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!