开源 数据仓库_使用这些开源工具进行数据仓库

开源 数据仓库

by Simon Späti

西蒙·斯派蒂(SimonSpäti)

使用这些开源工具进行数据仓库 (Use these open-source tools for Data Warehousing)

These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this?

如今,每个人都在谈论开源软件。 但是,这在数据仓库(DWH)字段中仍然不常见。 为什么是这样?

For this post, I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system.

在这篇文章中,我选择了一些开源技术,并将它们一起用于构建数据仓库系统的完整数据体系结构。

I went with Apache Druid for data storage, Apache Superset for querying, and Apache Airflow as a task orchestrator.

我使用Apache Druid进行数据存储,使用Apache Superset进行查询,并使用Apache Airflow作为任务编排器。

德鲁伊—数据存储 (Druid — the data store)

Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.

Druid是一个用Java编写的开源,面向列的分布式数据存储。 它旨在快速提取大量事件数据,并在数据之上提供低延迟查询。

为什么要使用德鲁伊? (Why use Druid?)

Druid has many key features, including sub-second OLAP queries, real-time streaming ingestion, scalability, and cost effectiveness.

Druid具有许多关键功能,包括亚秒级OLAP查询,实时流接收,可伸缩性和成本效益。

With the comparison of modern OLAP Technologies in mind, I chose Druid over ClickHouse, Pinot and Apache Kylin. Recently, Microsoft announced they will add Druid to their Azure HDInsight 4.0.

考虑到现代OLAP技术的比较 ,我选择了Druid而不是ClickHouse,Pinot和Apache Kylin。 最近, Microsoft宣布将把Druid添加到其Azure HDInsight 4.0中。

为什么不德鲁伊? (Why not Druid?)

Carter Shanklin wrote a detailed post about Druid’s limitations at Horthonwork.com. The main issue is with its support for SQL joins, and advanced SQL capabilities.

Carter Shanklin在Horthonwork.com上写了一篇有关Druid局限性的详细文章 。 主要问题是它对SQL连接的支持以及高级SQL功能。

德鲁伊的体系结构 (The Architecture of Druid)

Druid is scalable due to its cluster architecture. You have three different node types — the Middle-Manager-Node, the Historical Node and the Broker.

由于其集群体系结构,Druid可扩展。 您有三种不同的节点类型-中间管理者节点,历史节点和代理。

The great thing is that you can add as many nodes as you want in the specific area that fits best for you. If you have many queries to run, you can add more Brokers. Or, if a lot of data needs to be batch-ingested, you would add middle managers and so on.

很棒的是,您可以在最适合您的特定区域中添加任意数量的节点。 如果要运行许多查询,则可以添加更多代理。 或者,如果需要批量处理大量数据,则可以添加中层管理人员,依此类推。

A simple architecture is shown below. You can read more about Druid’s design here.

一个简单的架构如下所示。 您可以在此处阅读有关Druid设计的更多信息。

Apache Superset —用户界面 (Apache Superset — the UI)

The easiest way to query against Druid is through a lightweight, open-source tool called Apache Superset.

针对Druid进行查询的最简单方法是通过一个名为Apache Superset的轻量级开源工具。

It is easy to use and has all common chart types like Bubble Chart, Word Count, Heatmaps, Boxplot and many more.

它易于使用,并具有所有常见的图表类型,例如气泡图,字数统计,热图,箱线图等等 。

Druid provides a Rest-API, and in the newest version also a SQL Query API. This makes it easy to use with any tool, whether it is standard SQL, any existing BI-tool or a custom application.

Druid提供了Rest-API,并且在最新版本中还提供了SQL Query API。 这使得可以轻松使用任何工具,无论它是标准SQL,任何现有的BI工具还是自定义应用程序。

Apache Airflow-协调器 (Apache Airflow — the Orchestrator)

As mentioned in Orchestrators — Scheduling and monitor workflows, this is one of the most critical decisions.

如Orchestrators中的“计划和监视工作流”中所述 ,这是最关键的决定之一。

In the past, ETL tools like Microsoft SQL Server Integration Services (SSIS) and others were widely used. They were where your data transformation, cleaning and normalisation took place.

过去,ETL工具(例如Microsoft SQL Server集成服务(SSIS)和其他工具)得到了广泛使用。 它们是您进行数据转换,清理和标准化的地方。

In more modern architectures, these tools aren’t enough anymore.

在更现代的体系结构中,这些工具已经远远不够了。

Moreover, code and data transformation logic are much more valuable to other data-savvy people in the company.

而且,代码和数据转换逻辑对于公司中其他精通数据的人来说更有价值。

I highly recommend you read a blog post from Maxime Beauchemin about Functional Data Engineering — a modern paradigm for batch data processing. This goes much deeper into how modern data pipelines should be.

我强烈建议您阅读Maxime Beauchemin的博客文章有关功能数据工程(一种用于批处理数据的现代范例) 。 这将更深入地介绍现代数据管道的方式。

Also, consider the read of The Downfall of the Data Engineer where Max explains about the breaking “data silo” and much more.

另外,请考虑阅读《数据工程师的垮台》一书,其中Max解释了打破“数据孤岛”等问题。

为什么要使用气流? (Why use Airflow?)

Apache Airflow is a very popular tool for this task orchestration. Airflow is written in Python. Tasks are written as Directed Acyclic Graphs (DAGs). These are also written in Python.

Apache Airflow是用于此任务编排的非常流行的工具。 气流是用Python编写的。 任务被编写为有向无环图( DAG )。 这些也是用Python编写的。

Instead of encapsulating your critical transformation logic somewhere in a tool, you place it where it belongs to inside the Orchestrator.

无需将关键转换逻辑封装在工具中的任何位置,而是将其放置在Orchestrator内部的位置。

Another advantage is using plain Python. There is no need to encapsulate other dependencies or requirements, like fetching from an FTP, copying data from A to B, writing a batch-file. You do that and everything else in the same place.

另一个优点是使用普通的Python。 无需封装其他依赖项或要求,例如从FTP提取,将数据从A复制到B,编写批处理文件。 您可以执行此操作,其他所有操作都在同一位置。

气流特征 (Features of Airflow)

Moreover, you get a fully functional overview of all current tasks in one place.

此外,您可以在一处获得所有当前任务的完整功能概述。

More relevant features of Airflow are that you write workflows as if you are writing programs. External jobs like Databricks, Spark, etc. are no problems.

Airflow的更多相关功能是您像编写程序一样编写工作流。 诸如Databricks,Spark等的外部作业没有问题。

Job testing goes through Airflow itself. That includes passing parameters to other jobs downstream or verifing what is running on Airflow and seeing the actual code. The log files and other meta-data are accessible through the web GUI.

作业测试通过Airflow本身进行。 这包括将参数传递给下游的其他作业,或验证Airflow上正在运行的内容并查看实际代码。 日志文件和其他元数据可通过Web GUI访问。

(Re)run only on parts of the workflow and dependent tasks is a crucial feature which comes out of the box when you create your workflows with Airflow. The jobs/tasks are run in a context, the scheduler passes in the necessary details plus the work gets distributed across your cluster at the task level, not at the DAG level.

仅在部分工作流程上运行(重新),并且相关任务是一项至关重要的功能,当您使用Airflow创建工作流程时,该功能即开即用。 作业/任务在上下文中运行,调度程序传递必要的详细信息,然后工作将在任务级别(而不是DAG级别)上跨集群分布。

For many more feature visit the full list.

有关更多功能,请访问完整列表 。

使用Apache Airflow的ETL (ETL with Apache Airflow)

If you want to start with Apache Airflow as your new ETL-tool, please start with this ETL best practices with Airflow shared with you. It has simple ETL-examples, with plain SQL, with HIVE, with Data Vault, Data Vault 2, and Data Vault with Big Data processes. It gives you an excellent overview of what’s possible and also how you would approach it.

如果要以Apache Airflow作为新的ETL工具开始,请从与您共享的Airflow的ETL最佳实践开始。 它具有简单的ETL示例,带有简单SQL,带有HIVE ,带有Data Vault , Data Vault 2和带有大数据流程的Data Vault 。 它为您提供了一个很好的概述,介绍了可行的方法以及如何实现它。

At the same time, there is a Docker container that you can use, meaning you don’t even have to set-up any infrastructure. You can pull the container from here.

同时,您可以使用一个Docker容器,这意味着您甚至不必设置任何基础架构。 您可以从此处拉出容器。

For the GitHub-repo follow the link on etl-with-airflow.

对于GitHub-repo,请点击etl-with-airflow上的链接。

结论 (Conclusion)

If you’re searching for open-source data architecture, you cannot ignore Druid for speedy OLAP responses, Apache Airflow as an orchestrator that keeps your data lineage and schedules in line, plus an easy to use dashboard tool like Apache Superset.

如果您正在寻找开源数据架构,则不能忽略Druid的快速OLAP响应,Apache Airflow作为协调器(使您的数据沿袭和时间表保持一致)以及易于使用的仪表板工具(如Apache Superset)。

My experience so far is that Druid is bloody fast and a perfect fit for OLAP cube replacements in a traditional way, but still needs a more relaxed startup to install clusters, ingest data, view logs etc. If you need that, have a look at Impy which was created by the founders of Druid. It creates all the services around Druid that you need. Unfortunately, though, it’s not open-source.

到目前为止,我的经验是Druid的速度非常快,并且以传统方式非常适合OLAP多维数据集替换 ,但是仍然需要更轻松的启动来安装集群,提取数据,查看日志等。如果需要,请看看由Druid的创始人创建的Impy 。 它围绕您需要的Druid创建所有服务。 不幸的是,它不是开源的。

Apache Airflow and its features as an orchestrator are something which has not happened much yet in traditional Business Intelligence environments. I believe this change comes very naturally when you start using open-source and more new technologies.

在传统的商业智能环境中,Apache Airflow及其作为协调器的功能尚未发生很多事情。 我相信,当您开始使用开源和更多新技术时,这种变化会自然而然地出现。

And Apache Superset is an easy and fast way to be up and running and showing data from Druid. There for better tools like Tableau, etc., but not for free. That’s why Superset fits well in the ecosystem if you’re already using the above open-source technologies. But as an enterprise company, you might want to spend some money in that category because that is what the users can see at the end of the day.

Apache Superset是一种简便,快速的方法,可用于启动和运行以及显示来自Druid的数据。 那里有更好的工具,例如Tableau等,但不是免费的。 这就是为什么如果您已经在使用上述开源技术,那么Superset非常适合生态系统。 但是作为一家企业公司,您可能需要在该类别中花一些钱,因为这是用户最终可以看到的。

Related Links:

相关链接:

  • Understanding Apache Airflow’s key concepts

    了解Apache Airflow的关键概念

  • How Druid enables analytics at Airbnb

    Druid如何在Airbnb上启用分析

  • Google launches Cloud Composer, a new workflow automation tool for developers

    Google推出了Cloud Composer,这是面向开发人员的全新工作流程自动化工具

  • A fully managed workflow orchestration service built on Apache Airflow

    基于Apache Airflow的完全托管的工作流程编排服务

  • Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark

    集成Apache Airflow和Databricks:使用Apache Spark构建ETL管道

  • ETL with Apache Airflow

    使用Apache Airflow的ETL

  • What is Data Engineering and the future of Data Warehousing

    什么是数据工程和数据仓库的未来

  • Imply — Managed Druid platform (closed-source)

    暗示—托管Druid平台(封闭源)

  • Ultra-fast OLAP Analytics with Apache Hive and Druid

    使用Apache Hive和Druid的超快速OLAP分析

Originally published at www.sspaeti.com on November 29, 2018.

最初于2018年11月29日发布在www.sspaeti.com 。

翻译自: https://www.freecodecamp.org/news/open-source-data-warehousing-druid-apache-airflow-superset-f26d149c9b7/

开源 数据仓库

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392574.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

.netcore mysql_.netcore基于mysql的codefirst

.netcore基于mysql的codefirst此文仅是对于netcore基于mysql的简单的codefirst实现的简单记录。示例为客服系统消息模板的增删改查实现第一步、创建实体项目,并在其中建立对应的实体类,以及数据库访问类须引入Pomelo.EntityFrameworkCore.MySql和Microso…

leetcode 34. 在排序数组中查找元素的第一个和最后一个位置(二分查找)

给定一个按照升序排列的整数数组 nums,和一个目标值 target。找出给定目标值在数组中的开始位置和结束位置。 如果数组中不存在目标值 target,返回 [-1, -1]。 进阶: 你可以设计并实现时间复杂度为 O(log n) 的算法解决此问题吗&#xff1…

CentOS6.7上使用FPM打包制作自己的rpm包

自定义rpm包,还是有逼格和实际生产环境的意义的。 (下面的文档有的代码由于博客排版的问题导致挤在了一起,需要自己判别) 安装FPM fpm是ruby写的,因此系统环境需要ruby,且ruby版本号大于1.8.5。 # 安装ruby模块 yum -y…

汉堡菜单_开发人员在编写汉堡菜单时犯的错误

汉堡菜单by Jared Tong汤杰(Jared Tong) 开发人员在编写汉堡菜单时犯的错误 (The mistake developers make when coding a hamburger menu) What do The New York Times’ developers get wrong about the hamburger menu, and what do Disney’s and Wikipedia’s get right?…

android 涨潮动画加载_Android附带涨潮动画效果的曲线报表绘制

写在前面本文属于部分原创,实现安卓平台正弦曲线类报表绘制功能介绍,基于网络已有的曲线报表绘制类(LineGraphicView)自己添加了涨潮的渐变动画算法最终效果图废话少说,直接上源码一、自定义View LineGraphicView,本类注释不算多&…

使用css3属性transition实现页面滚动

<!DOCTYPE html> <html><head><meta http-equiv"Content-type" content"text/html; charsetutf-8" /><title>慕课七夕主题</title><script src"http://libs.baidu.com/jquery/1.9.1/jquery.min.js">&…

leetcode 321. 拼接最大数(单调栈)

给定长度分别为 m 和 n 的两个数组&#xff0c;其元素由 0-9 构成&#xff0c;表示两个自然数各位上的数字。现在从这两个数组中选出 k (k < m n) 个数字拼接成一个新的数&#xff0c;要求从同一个数组中取出的数字保持其在原数组中的相对顺序。 求满足该条件的最大数。结…

Oracle Study之--Oracle等待事件(5)

Db file single write这个等待事件通常只发生在一种情况下&#xff0c;就是Oracle 更新数据文件头信息时&#xff08;比如发生Checkpoint&#xff09;。当这个等待事件很明显时&#xff0c;需要考虑是不是数据库中的数据文件数量太大&#xff0c;导致Oracle 需要花较长的时间来…

两台centos之间免密传输 scp

两台linux服务器之间免密scp&#xff0c;在A机器上向B远程拷贝文件 操作步骤&#xff1a;1、在A机器上&#xff0c;执行ssh-keygen -t rsa&#xff0c;一路按Enter&#xff0c;不需要输入任何内容。&#xff08;如有提示是否覆盖&#xff0c;可输入y后按回车&#xff09;2、到/…

jsp导出数据时离开页面_您应该在要离开的公司开始使用数据

jsp导出数据时离开页面If you’re new in data science, “doing data science” likely sounds like a big deal to you. You might think that you need meticulously collected data, all the tools for data science and a flawless knowledge before you can claim that y…

分步表单如何实现 html_HTML表格入门的分步指南

分步表单如何实现 htmlby Abhishek Jakhar通过阿比舍克贾卡(Abhishek Jakhar) HTML表格入门的分步指南 (A step-by-step guide to getting started with HTML tables) 总览 (Overview) The web is filled with information like football scores, cricket scores, lists of em…

laravel mysql pdo,更改Laravel中的基本PDO配置

My shared web host have some problems with query prepares and I want to enable PDOs emulated prepares, theres no option for this in the config\database.php.Is there any way I can do that in Laravel?解决方案You can add an "options" array to add o…

Java多线程-工具篇-BlockingQueue

Java多线程-工具篇-BlockingQueue 转载 http://www.cnblogs.com/jackyuj/archive/2010/11/24/1886553.html 这也是我们在多线程环境下&#xff0c;为什么需要BlockingQueue的原因。作为BlockingQueue的使用者&#xff0c;我们再也不需要关心什么时候需要阻塞线程&#xff0c;什…

leetcode 204. 计数质数

统计所有小于非负整数 n 的质数的数量。 示例 1&#xff1a; 输入&#xff1a;n 10 输出&#xff1a;4 解释&#xff1a;小于 10 的质数一共有 4 个, 它们是 2, 3, 5, 7 。 解题思路 大于等于5的质数一定和6的倍数相邻。例如5和7&#xff0c;11和13,17和19等等&#xff1b…

JAVA 网络编程小记

在进行JAVA网络编程时&#xff0c;发现写入的数据对方等200ms左右才会收到。起初认为是JAVA自已进行了 Cache。进行flush也没有效果。查看JDK代码&#xff0c;Write操作直接调用的native方法&#xff0c;说明JAVA层面并没有缓存。再看flush&#xff0c;只是一个空方法. FileOut…

vue生成静态js文件_如何立即使用Vue.js生成静态网站

vue生成静态js文件by Ondřej Polesn通过OndřejPolesn 如何立即使用Vue.js生成静态网站 (How to generate a static website with Vue.js in no time) You have decided to build a static site, but where do you start? How do you select the right tool for the job wit…

查看文件夹大小的4种方法,总有一种是你喜欢的

有必要检查文件夹的大小,以确定它们是否占用了太多的存储空间。此外,如果你通过互联网或其他存储设备传输文件夹,还需要查看文件夹大小。 幸运的是,在Windows设备上查看文件夹大小非常容易。窗口中提供了图形化和基于命令行的应用程序,为你提供了多种方法。 如何在Windo…

Python 获取服务器的CPU个数

在使用gunicorn时&#xff0c;需要设置workers&#xff0c; 例如&#xff1a; gunicorn --workers3 app:app -b 0.0.0.0:9000 其中&#xff0c;worker的数量并不是越多越好&#xff0c;推荐值是CPU的个数x21&#xff0c; CPU个数使用如下的方式获取&#xff1a; python -c impo…

多种数据库连接工具_20多种热门数据工具及其不具备的功能

多种数据库连接工具In the past few months, the data ecosystem has continued to burgeon as some parts of the stack consolidate and as new challenges arise. Our first attempt to help stakeholders navigate this ecosystem highlighted 25 Hot New Data Tools and W…

怎么连接 mysql_怎样连接连接数据库

这个博客是为了说明怎么连接数据库第一步&#xff1a;肯定是要下载数据库&#xff0c;本人用的SqlServer2008&#xff0c;是从别人的U盘中拷来的。第二步&#xff1a;数据库的登录方式设置为混合登录&#xff0c;步骤如下&#xff1a;1.打开数据库这是数据库界面&#xff0c;要…