marlin 三角洲_三角洲湖泊和数据湖泊-入门

marlin 三角洲

Data lakes are becoming adopted in more and more companies seeking for efficient storage of their assets. The theory behind it is quite simple, in contrast to the industry standard data warehouse. To conclude this this post explains the logical foundation behind this and presents practical use case with tool called Delta Lake. Enjoy!

数据湖正被越来越多的寻求有效存储其资产的公司采用。 与行业标准数据仓库相比,其背后的理论非常简单。 总结这篇文章解释了背后的逻辑基础,并用名为Delta Lake的工具提出了实际用例。 请享用!

什么是数据湖? (What is data lake?)

A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

集中式存储库,可让您以任何规模存储所有结构化和非结构化数据。 您可以按原样存储数据,而无需先构建数据结构并运行不同类型的分析-从仪表板和可视化到大数据处理,实时分析和机器学习,以指导更好的决策。

Amazon Web Services

亚马逊网络服务

Firstly, the rationale behind data lakes is quite similar to widely used data warehouse. Although they fall into same category are quite different in the logic behind them. For instance data warehouse’s nature is that information stored inside it is already pre-processed. In other words reason for storing has to be known and data model well defined. However data lake takes different approach. As a result the reason of storing and data model don’t have to be defined. In conclusion, both variants can be compared like below:

首先,数据湖背后的原理与广泛使用的数据仓库非常相似。 尽管它们属于同一类别,但它们背后的逻辑却有很大不同。 例如,数据仓库的性质是存储在其中的信息已经过预处理。 换句话说,必须知道存储的原因并明确定义数据模型。 但是数据湖采取不同的方法。 因此,不必定义存储原因和数据模型。 总之,可以如下比较两种变体:

+-----------+----------------------+-------------------+
| | Data Warehouse | Data Lake |
+-----------+----------------------+-------------------+
| Data | Structured | Unstructured data |
| Schema | Schema on write | Schema on read |
| Storage | High-cost storage | Low-cost storage |
| Users | Business analysts | Data scientists |
| Analytics | BI and visualization | Data Science |
+-----------+----------------------+-------------------+

使用Delta Lake OSS创建数据湖 (Using Delta Lake OSS create a data lake)

Now let’s use that theoretical knowledge and apply it using Delta Lake OSS. Delta Lake is open source framework based on Apache Spark, used to retrieve, manage and transform data into data lake. Getting started is quite simple — you will need an Apache Spark project (use this link for more guidance). Firstly, add Delta Lake as SBT dependency:

现在,让我们使用该理论知识,并使用Delta Lake OSS进行应用。 Delta Lake是基于Apache Spark的开源框架,用于检索,管理数据并将其转换为Data Lake。 入门非常简单-您将需要一个Apache Spark项目(使用此链接可获得更多指导)。 首先,添加Delta Lake作为SBT依赖项:

libraryDependencies += "io.delta" %% "delta-core" % "0.5.0"

将数据保存到Delta (Saving data to Delta)

Next, let’s create a first table. For this, you will need a Spark Dataframe, which can be an arbitrary set or data read from another format, like JSON or Parquet.

接下来,让我们创建第一个表。 为此,您将需要一个Spark Dataframe,它可以是任意集合,也可以是从其他格式(如JSON或Parquet)读取的数据。

val data = spark.range(0, 50)
data.write.format("delta").save("/data/delta-table")

从Delta读取数据 (Reading data from Delta)

Reading the data is as simple as writing to it. Just specify the path and correct format, same as you would do with CSV or JSON data.

读取数据就像写入数据一样简单。 只需指定路径和正确的格式即可,就像处理CSV或JSON数据一样。

val df = spark.read.format("delta").load("/data/delta-table")
df.show()

在Delta中更新数据 (Updating the data in Delta)

The Delta Lake OSS supports a range of update options, thanks to its ACID model. Let’s use that to run a batch update, that overwrite the existing data. We do this by using following code:

借助其ACID模型,Delta Lake OSS支持一系列更新选项。 让我们使用它来运行批处理更新,该更新将覆盖现有数据。 我们通过使用以下代码来做到这一点:

val data = spark.range(0, 100)
data.write.format("delta").mode("overwrite").save("/data/delta-table")
df.show()

摘要 (Summary)

I hope you have found this post useful. If so, don’t hesitate to like or share this post. Additionally you can follow me on my social media if you fancy so :)

我希望您发现这篇文章有用。 如果是这样,请随时喜欢或分享此帖子。 此外,如果您愿意,也可以在我的社交媒体上关注我:)

演示地址

Sources: https://docs.delta.io/latest/quick-start.html https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

资料来源: https : //docs.delta.io/latest/quick-start.html https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

翻译自: https://medium.com/swlh/delta-lake-and-data-lakes-getting-started-41ce957ed0da

marlin 三角洲

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392442.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

tomcat中设置Java 客户端程序的http(https)访问代理

1、假定http/https代理服务器为 127.0.0.1 端口为8118 2、在tomcat/bin/catalina.sh脚本文件中设置JAVA_OPTS,如下图: 保存后重启tomcat就能生效。转载于:https://www.cnblogs.com/zhangmingcheng/p/11211776.html

java界面中显示图片_java中怎样在界面中显示图片?

方法一:JLabel helloLabel new JLabel("New label");helloLabel.setIcon(new ImageIcon("E:\\javaSE\u4EE3\u7801\\TimeManager\\asset\\hello.gif"));helloLabel.setBackground(Color.BLACK);helloLabel.setBounds(0, 0, 105, 50);contentPan…

one-of-k 编码算法_我们如何教K-12学生如何编码

one-of-k 编码算法by Christopher George克里斯托弗乔治(Christopher George) 我们如何教K-12学生如何编码 (How we’re teaching K-12 students how to code) Hello World! (Sorry, I couldn’t resist.) My name is Christopher George and I am currently a Junior at Carn…

knime简介_KNIME简介

knime简介Data Science is abounding. It considers different realms of the data world including its preparation, cleaning, modeling, and whatnot. To be precise, it is massive in terms of the span it covers and the opportunities it offers. Needless to say, th…

hadoop2.x HDFS快照介绍

说明:由于近期正好在研究hadoop的快照机制。看官网上的文档讲的非常仔细。就顺手翻译了。也没有去深究一些名词的标准译法,所以可能有些翻译和使用方法不是非常正确,莫要介意~~ 原文地址:(Apache hadoop的官方文档&…

MQTT服务器搭建--Mosquitto用户名密码配置

前言: 基于Mosquitto服务器已经搭建成功,大部分都是采用默认的是允许匿名用户登录模式,正式上线的系统需要进行用户认证。 1.用户参数说明 Mosquitto服务器的配置文件为/etc/mosquitto/mosquitto.conf,关于用户认证的方式和读取的…

java number string_java基础系列(一):Number,Character和String类及操作

这篇文章总结了Java中最基础的类以及常用的方法,主要有:Number,Character,String。1、Number类在实际开发的过程中,常常会用到需要使用对象而不是内置的数据类型的情形。所以,java语言为每个内置数据类型都…

谁参加了JavaScript 2018状况调查?

by Sacha Greif由Sacha Greif 谁参加了JavaScript 2018状况调查? (Who Took the State of JavaScript 2018 Survey?) 我们如何努力使调查更具代表性 (How we’re working to make the survey more representative) I was recently listening to a podcast episode…

机器学习 建立模型_建立生产的机器学习系统

机器学习 建立模型When businesses plan to start incorporating machine learning to enhance their solutions, they more often than not think that it is mostly about algorithms and analytics. Most of the blogs/training on the matter also only talk about taking …

CDH使用秘籍(一):Cloudera Manager和Managed Service的数据库

背景从业务发展需求,大数据平台须要使用spark作为机器学习、数据挖掘、实时计算等工作,所以决定使用Cloudera Manager5.2.0版本号和CDH5。曾经搭建过Cloudera Manager4.8.2和CDH4,在搭建Cloudera Manager5.2.0版本号的时候,发现对…

leetcode 455. 分发饼干(贪心算法)

假设你是一位很棒的家长,想要给你的孩子们一些小饼干。但是,每个孩子最多只能给一块饼干。 对每个孩子 i,都有一个胃口值 g[i],这是能让孩子们满足胃口的饼干的最小尺寸;并且每块饼干 j,都有一个尺寸 s[j]…

压缩/批量压缩/合并js文件

写在前面 如果文件少的话,直接去网站转化一下就行。 http://tool.oschina.net/jscompress?type3 1.压缩单个js文件 cnpm install uglify-js -g 安装 1>压缩单个js文件打开cmd,目录引到当前文件夹,cduglifyjs inet.js -o inet-min.js 或者 uglifyjs i…

angular依赖注入_Angular依赖注入简介

angular依赖注入by Neeraj Dana由Neeraj Dana In this article, we will see how the dependency injection of Angular works internally. Suppose we have a component named appcomponent which has a basic and simple structure as follows:在本文中,我们将看…

leetcode 85. 最大矩形(dp)

给定一个仅包含 0 和 1 、大小为 rows x cols 的二维二进制矩阵,找出只包含 1 的最大矩形,并返回其面积。 示例 1: 输入:matrix [[“1”,“0”,“1”,“0”,“0”],[“1”,“0”,“1”,“1”,“1”],[“1”,“1”,“1”,“1”,“…

如何查看系统版本

1. winR,输入cmd,确定,打开命令窗口,输入msinfo32,注意要在英文状态下输入,回车。然后在弹出的窗口中就可以看到系统的具体版本号了。 2.winR,输入cmd,确定,打开命令窗口,输入ver&am…

java activemq jmx_通过JMX 获取Activemq 队列信息

首先在 activemq.xml 中新增以下属性在broker 节点新增属性 useJmx"true"在managementContext 节点配置断开与访问服务iP配置成功后启动下面来看测试代码/*** Title: ActivemqTest.java* Package activemq* Description: TODO(用一句话描述该文件做什么)* author LYL…

风能matlab仿真_发现潜力:使用计算机视觉对可再生风能发电场的主要区域进行分类(第1部分)

风能matlab仿真Github Repo: https://github.com/codeamt/WindFarmSpotterGithub回购: https : //github.com/codeamt/WindFarmSpotter This is a series:这是一个系列: Part 1: A Brief Introduction on Leveraging Edge Devices and Embedded AI to …

【Leetcode_easy】821. Shortest Distance to a Character

problem 821. Shortest Distance to a Character 参考 1. Leetcode_easy_821. Shortest Distance to a Character; 完转载于:https://www.cnblogs.com/happyamyhope/p/11214805.html

tdd测试驱动开发课程介绍_测试驱动开发的实用介绍

tdd测试驱动开发课程介绍by Luca Piccinelli通过卢卡皮奇内利 测试驱动开发很难! 这是不为人知的事实。 (Test Driven Development is hard! This is the untold truth about it.) These days you read a ton of articles about all the advantages of doing Test …

软件安装(JDK+MySQL+TOMCAT)

一,JDK安装 1,查看当前Linux系统是否已经安装了JDK 输入 rpm -qa | grep java 如果有: 卸载两个openJDK,输入rpm -e --nodeps 要卸载的软件 2,上传JDK到Linux 3,安装jdk运行需要的插件yum install gl…