【数据中台】开源项目(5)-Amoro

介绍

        Amoro is a Lakehouse management system built on open data lake formats. Working with compute engines including Flink, Spark, and Trino, Amoro brings pluggable and self-managed features for Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture。
Amoro定位是一个搭建在 Apache Iceberg之上的流式湖仓服务,流式强调向实时能力的拓展,服务则强调管理、标准化度量,以及其他可以抽象到基础软件中的湖仓一体能力。
通过 Amoro,用户可以在 Flink、Spark、Trino 等引擎上实现更加优化的 CDC、流式更新、OLAP 等功能, 结合数据湖高效的离线处理能力,Arctic 能够服务于更多流批混用的场景;同时,Arctic 的结构自优化、并发冲突解决以及标准化的湖仓管理功能,将有效减少用户在数据湖管理和优化上的负担。
开源地址: GitHub - NetEase/amoro: Amoro is a Lakehouse management system built on open data lake formats.

Amoro架构

The architecture of Amoro is as follows:
The core components of Amoro include:
  • AMS: Amoro Management Service provides Lakehouse management features, like self-optimizing, data expiration, etc. It also provides a unified catalog service for all computing engines, which can also be combined with existing metadata services.
  • Plugins: Amoro provides a wide selection of external plugins to meet different scenarios.
  • Optimizers: The self-optimizing execution engine plugin asynchronously performs merging, sorting, deduplication, layout optimization, and other operations on all type table format tables.
  • Terminal: SQL command-line tools, provide various implementations like local Spark and Kyuubi.
  • LogStore: Provide millisecond to second level SLAs for real-time data processing based on message queues like Kafka and Pulsar.

支持的格式

Amoro can manage tables of different table formats, similar to how MySQL/ClickHouse can choose different storage engines. Amoro meets diverse user needs by using different table formats. Currently, Amoro supports three table formats:
  • Iceberg format: means using the native table format of the Apache Iceberg, which has all the features and characteristics of Iceberg.
  • Mixed-Iceberg format: built on top of Iceberg format, which can accelerate data processing using LogStore and provides more efficient query performance and streaming read capability in CDC scenarios.
  • Mixed-Hive format: has the same features as the Mixed-Iceberg tables but is compatible with a Hive table. Support upgrading Hive tables to Mixed-Hive tables, and allow Hive’s native read and write methods after upgrading.

支持的引擎

Iceberg format

Iceberg format tables use the engine integration method provided by the Iceberg community. For details, please refer to: Iceberg Docs.

Paimon format

Paimon format tables use the engine integration method provided by the Paimon community. For details, please refer to: Paimon Docs.

Mixed format

Amoro support multiple processing engines for Mixed format as below:
Processing Engine
Version
Batch Read
Batch Write
Batch Overwrite
Streaming Read
Streaming Write
Create Table
Alter Table
Flink
1.15.x, 1.16.x and 1.17.x
Spark
3.1, 3.2, 3.3
Hive
2.x, 3.x
Trino
406

应用场景

Self-managed streaming Lakehouse

Amoro makes it easier for users to handle the challenges of writing to a real-time data lake, such as ingesting append-only event logs or CDC data from databases. In these scenarios, the rapid increase of fragment and redundant files cannot be ignored. To address this issue, Amoro provides a pluggable streaming data self-optimizing mechanism that automatically compacts fragment files and removes expired data, ensuring high-quality table queries while reducing system costs.

Stream-and-batch-fused data pipeline

Whether in the AI or BI business field , the requirement for real-time analysis is becoming increasingly high. The traditional approach of using one streaming job to complete all data processing from the source to the end is no longer applicable. There is an increasing demand for layered construction of streaming data pipeline, and the traditional layered construction approach based on message queues can cause a inconsistency problem between the streaming and batch data processing. Building a unified stream-and-batch-fused pipeline based on new data lake formats is the future direction for solving these problems. Amoro fully leverages the characteristics of the new data lake table formats about unified streaming and batch processing, not only ensuring the quality of data in the streaming pileline but also enhancing critical features such as incremental reading for CDC data and streaming dimension table association, helping users to build a stream-and-batch-fused data pipeline.

Cloud-native Lakehouse

Currently, most data platforms and products are tightly coupled with their underlying infrastructure(such as the storage layer). The migration of infrastructure, such as switching to cloud-native OSS, may require extensive adaptation efforts or even be impossible. However, Amoro provides an infra-decoupled, lake-native architecture built on top of the infrastructure. This allows products based on Amoro to interact with the underlying infrastructure through a unified interface (Amoro Catalog service), protecting upper-layer products from the impact of infrastructure switch.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/192118.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

SQL优化的面试题

1. **针对慢查询进行性能优化**: - 使用数据库提供的工具(如MySQL的EXPLAIN语句)分析查询计划,找出潜在的性能问题。 - 优化查询语句的结构,确保索引被充分利用。 - 对于大表,考虑分页或缓存部分结…

6-2 统计大于等于平均分人数

函数 fun 的功能是:从m个学生的成绩中统计出高于和等于平均分的学生人数, 此人数由函数值返回。平均分通过形参传回,输入学生成绩时, 用-1结束输入,由程序自动统计学生人数。 函数接口定义: int fun ( fl…

【无标题】AttributeError: module ‘gradio‘ has no attribute ‘outputs‘

问题描述 AttributeError: module gradio has no attribute outputs 不知道作者用的是哪个gradio版本,最新的版本报错AttributeError: module gradio has no attribute outputs , 换一个老一点的版本会报错AttributeError: module gradio has no attribu…

软件工程(9-10章、11章、12章、13章小测)参考答案

软件工程9-10章小测 一 单项选择题(8分) 1、软件体系结构定义为()(1分) {component, connector, configuration} {models, connector} {object, collaboration, message, } 正确答案:{component, connector, config…

指针常量和常量指针的区别

文章目录 指针常量常量指针即是指针常量又是常量指针 指针常量 指针常量的本质是常量,表示的是 这个指针所指向的地址不能发生改变。即指针变量的值(即地址值)不能发生修改。但是指针所指向的那块内存里的值是可以修改的。 注意:…

canvas基础:绘制圆弧、圆形

canvas实例应用100 专栏提供canvas的基础知识,高级动画,相关应用扩展等信息。 canvas作为html的一部分,是图像图标地图可视化的一个重要的基础,学好了canvas,在其他的一些应用上将会起到非常重要的帮助。 文章目录 arc…

【大数据】HBase 中的列和列族

😊 如果您觉得这篇文章有用 ✔️ 的话,请给博主一个一键三连 🚀🚀🚀 吧 (点赞 🧡、关注 💛、收藏 💚)!!!您的支持 &#x…

K8S客户端二 使用Rancher部署服务

Rancher容器云管理平台 本博客中使用了四台服务器,如下 rancher服务器k8s-masterk8s-worker01k8s-worker02 一、主机硬件说明 序号硬件操作及内核1CPU 4 Memory 4G Disk 100GCentOS72CPU 4 Memory 4G Disk 100GCentOS73CPU 4 Memory 4G Disk 100GCentOS74CPU 4 …

Linux的基本指令(五)

目录 前言 tar指令(重要) 再次思考,为什么要打包和压缩呢? 实例:基于xshell进行压缩包在Windows与Linux之间的互传 实例:实现两个Linux系统之间的文件互传 bc指令 uname -r指令 重要的热键 关机与开机 扩展命令 shell及…

国际语音群呼系统的产品优势有哪些?为什么要使用国际语音群呼系统?

一、国际语音群呼系统的产品优势: 1.巨量群呼 支持大容量并发群呼,呼叫不受限制,充裕的线路保障造就百万级平台容量,可以短时间内同时拨打大量电话,让语音快速到达,大大提高发送效率; 2.自主…

WSL2 Linux22.04 图形化界面配置

Windows10/11 三步安装wsl2 Ubuntu20.04(任意盘) - 知乎 22.04需要将appxbundle改后缀名为zip,选出x64解压再改后缀名,再按上面文章操作 ubuntu 22.04国内镜像阿里云/163源/清华大学/中科大 WSL2 Ubuntugnome图形界面的安装血泪史&#xf…

Android12蓝牙框架

参考: https://evilpan.com/2021/07/11/android-bt/ https://source.android.com/docs/core/connect/bluetooth?hlzh-cn https://developer.android.com/guide/topics/connectivity/bluetooth?hlzh-cn https://developer.android.com/guide/components/intents-fi…

FL Studio Producer Edition21.0.3中文版安装详解(附下载链接)

fl studio 21中文版是Image-Line公司继20版本之后更新的水果音乐制作软件,很多用户不太理解,为什么新版本不叫fl studio 21或fl studio2024,非得直接跳到21.2版本,其实该版本是为了纪念该公司22周年,所以该版本也是推出…

点云从入门到精通技术详解100篇-基于三维点云的工件曲面轮廓检测与机器人打磨轨迹规划(下)

目录 4.3 机器人打磨轨迹规划 4.3.1 机器人运动学建模与分析 4.3.2 机器人轨迹规划算法

C++生成静态库和动态库

什么是静态库和动态库 在项目开发中,或多或少地需要使用到第三方(非编译器提供)的程序库,使用第三方的程序库能够减少重复造轮子的工作,提高开发效率。本文将介绍如何把自己的写的程序制作为程序库提供给他人使用&…

Android 13 - Media框架(18)- CodecBase

从这一节开始我们会回到上层来看ACodec的实现,在这之前我们会先了解ACodec的基类CodecBase。CodecBase.h 中除了声明有自身接口外,还定义有内部类 CodecCallback、BufferCallback,以及另一个基类 BufferChannelBase,接下来我们会一…

mysql基本命令

MySQL 是一个流行的关系型数据库管理系统,以下是一些常用的 MySQL 基本命令: 连接到 MySQL 服务器: mysql -u [username] -p [password]其中 [username] 是你的 MySQL 用户名,[password] 是对应的密码。执行该命令后&#xff0c…

使用gcloud SDK 管理和部署 Cloud run service

查看cloud run 上的service 列表: gcloud run services list > gcloud run services listSERVICE REGION URL LAST DEPLOYED BY LAST DEPL…

小米秒享3--非小米电脑

小米妙享中心是小米最新推出的一款功能,能够为用户们提供更加舒适便利的操作体验。简单的说可以让你的笔记本和你的小米手机联动,比如你在手机的文档,连接小米共享后,可以通过电脑进行操作。 对于非小米电脑想要体验终版秒享AIOT…

使用java批量生成Xshell session(*.xsh)文件

背景 工作中需要管理多套环境, 有时需要同时登陆多个节点, 且每个环境用户名密码都一样, 因此需要一个方案来解决动态的批量登录问题. XShell Xshell有session管理功能: 提供了包括记住登录主机、用户名、密码及登录时执行命令或脚本(js,py,vbs)的功能 session被存储在xsh文…