开源数据仓库

by Simon Späti

西蒙·斯派蒂(SimonSpäti)

使用这些开源工具进行数据仓库 (Use these open-source tools for Data Warehousing)

These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this?

如今，每个人都在谈论开源软件。但是，这在数据仓库(DWH)字段中仍然不常见。为什么是这样？

For this post, I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system.

在这篇文章中，我选择了一些开源技术，并将它们一起用于构建数据仓库系统的完整数据体系结构。

I went with Apache Druid for data storage, Apache Superset for querying, and Apache Airflow as a task orchestrator.

我使用Apache Druid进行数据存储，使用Apache Superset进行查询，并使用Apache Airflow作为任务编排器。

德鲁伊—数据存储 (Druid — the data store)

Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.

Druid是一个用Java编写的开源，面向列的分布式数据存储。它旨在快速提取大量事件数据，并在数据之上提供低延迟查询。

为什么要使用德鲁伊？ (Why use Druid?)

Druid has many key features, including sub-second OLAP queries, real-time streaming ingestion, scalability, and cost effectiveness.

Druid具有许多关键功能，包括亚秒级OLAP查询，实时流接收，可伸缩性和成本效益。

With the comparison of modern OLAP Technologies in mind, I chose Druid over ClickHouse, Pinot and Apache Kylin. Recently, Microsoft announced they will add Druid to their Azure HDInsight 4.0.

考虑到现代OLAP技术的比较，我选择了Druid而不是ClickHouse，Pinot和Apache Kylin。最近， Microsoft宣布将把Druid添加到其Azure HDInsight 4.0中。

为什么不德鲁伊？ (Why not Druid?)

Carter Shanklin wrote a detailed post about Druid’s limitations at Horthonwork.com. The main issue is with its support for SQL joins, and advanced SQL capabilities.

Carter Shanklin在Horthonwork.com上写了一篇有关Druid局限性的详细文章。主要问题是它对SQL连接的支持以及高级SQL功能。

德鲁伊的体系结构 (The Architecture of Druid)

Druid is scalable due to its cluster architecture. You have three different node types — the Middle-Manager-Node, the Historical Node and the Broker.

由于其集群体系结构，Druid可扩展。您有三种不同的节点类型-中间管理者节点，历史节点和代理。

The great thing is that you can add as many nodes as you want in the specific area that fits best for you. If you have many queries to run, you can add more Brokers. Or, if a lot of data needs to be batch-ingested, you would add middle managers and so on.

很棒的是，您可以在最适合您的特定区域中添加任意数量的节点。如果要运行许多查询，则可以添加更多代理。或者，如果需要批量处理大量数据，则可以添加中层管理人员，依此类推。

A simple architecture is shown below. You can read more about Druid’s design here.

一个简单的架构如下所示。您可以在此处阅读有关Druid设计的更多信息。

Apache Superset —用户界面 (Apache Superset — the UI)

The easiest way to query against Druid is through a lightweight, open-source tool called Apache Superset.

针对Druid进行查询的最简单方法是通过一个名为Apache Superset的轻量级开源工具。

It is easy to use and has all common chart types like Bubble Chart, Word Count, Heatmaps, Boxplot and many more.

它易于使用，并具有所有常见的图表类型，例如气泡图，字数统计，热图，箱线图等等。

Druid provides a Rest-API, and in the newest version also a SQL Query API. This makes it easy to use with any tool, whether it is standard SQL, any existing BI-tool or a custom application.

Druid提供了Rest-API，并且在最新版本中还提供了SQL Query API。这使得可以轻松使用任何工具，无论它是标准SQL，任何现有的BI工具还是自定义应用程序。

Apache Airflow-协调器 (Apache Airflow — the Orchestrator)

As mentioned in Orchestrators — Scheduling and monitor workflows, this is one of the most critical decisions.

如Orchestrators中的“计划和监视工作流”中所述，这是最关键的决定之一。

In the past, ETL tools like Microsoft SQL Server Integration Services (SSIS) and others were widely used. They were where your data transformation, cleaning and normalisation took place.

过去，ETL工具(例如Microsoft SQL Server集成服务(SSIS)和其他工具)得到了广泛使用。它们是您进行数据转换，清理和标准化的地方。

In more modern architectures, these tools aren’t enough anymore.

在更现代的体系结构中，这些工具已经远远不够了。

Moreover, code and data transformation logic are much more valuable to other data-savvy people in the company.

而且，代码和数据转换逻辑对于公司中其他精通数据的人来说更有价值。

I highly recommend you read a blog post from Maxime Beauchemin about Functional Data Engineering — a modern paradigm for batch data processing. This goes much deeper into how modern data pipelines should be.

我强烈建议您阅读Maxime Beauchemin的博客文章有关功能数据工程(一种用于批处理数据的现代范例) 。这将更深入地介绍现代数据管道的方式。

Also, consider the read of The Downfall of the Data Engineer where Max explains about the breaking “data silo” and much more.

另外，请考虑阅读《数据工程师的垮台》一书，其中Max解释了打破“数据孤岛”等问题。

为什么要使用气流？ (Why use Airflow?)

Apache Airflow is a very popular tool for this task orchestration. Airflow is written in Python. Tasks are written as Directed Acyclic Graphs (DAGs). These are also written in Python.

Apache Airflow是用于此任务编排的非常流行的工具。气流是用Python编写的。任务被编写为有向无环图( DAG )。这些也是用Python编写的。

Instead of encapsulating your critical transformation logic somewhere in a tool, you place it where it belongs to inside the Orchestrator.

无需将关键转换逻辑封装在工具中的任何位置，而是将其放置在Orchestrator内部的位置。

Another advantage is using plain Python. There is no need to encapsulate other dependencies or requirements, like fetching from an FTP, copying data from A to B, writing a batch-file. You do that and everything else in the same place.

另一个优点是使用普通的Python。无需封装其他依赖项或要求，例如从FTP提取，将数据从A复制到B，编写批处理文件。您可以执行此操作，其他所有操作都在同一位置。

气流特征 (Features of Airflow)

Moreover, you get a fully functional overview of all current tasks in one place.

此外，您可以在一处获得所有当前任务的完整功能概述。

More relevant features of Airflow are that you write workflows as if you are writing programs. External jobs like Databricks, Spark, etc. are no problems.

Airflow的更多相关功能是您像编写程序一样编写工作流。诸如Databricks，Spark等的外部作业没有问题。

Job testing goes through Airflow itself. That includes passing parameters to other jobs downstream or verifing what is running on Airflow and seeing the actual code. The log files and other meta-data are accessible through the web GUI.

作业测试通过Airflow本身进行。这包括将参数传递给下游的其他作业，或验证Airflow上正在运行的内容并查看实际代码。日志文件和其他元数据可通过Web GUI访问。

(Re)run only on parts of the workflow and dependent tasks is a crucial feature which comes out of the box when you create your workflows with Airflow. The jobs/tasks are run in a context, the scheduler passes in the necessary details plus the work gets distributed across your cluster at the task level, not at the DAG level.

仅在部分工作流程上运行(重新)，并且相关任务是一项至关重要的功能，当您使用Airflow创建工作流程时，该功能即开即用。作业/任务在上下文中运行，调度程序传递必要的详细信息，然后工作将在任务级别(而不是DAG级别)上跨集群分布。

For many more feature visit the full list.

有关更多功能，请访问完整列表。

使用Apache Airflow的ETL (ETL with Apache Airflow)

If you want to start with Apache Airflow as your new ETL-tool, please start with this ETL best practices with Airflow shared with you. It has simple ETL-examples, with plain SQL, with HIVE, with Data Vault, Data Vault 2, and Data Vault with Big Data processes. It gives you an excellent overview of what’s possible and also how you would approach it.

如果要以Apache Airflow作为新的ETL工具开始，请从与您共享的Airflow的ETL最佳实践开始。它具有简单的ETL示例，带有简单SQL，带有HIVE ，带有Data Vault ， Data Vault 2和带有大数据流程的Data Vault 。它为您提供了一个很好的概述，介绍了可行的方法以及如何实现它。

At the same time, there is a Docker container that you can use, meaning you don’t even have to set-up any infrastructure. You can pull the container from here.

同时，您可以使用一个Docker容器，这意味着您甚至不必设置任何基础架构。您可以从此处拉出容器。

For the GitHub-repo follow the link on etl-with-airflow.

对于GitHub-repo，请点击etl-with-airflow上的链接。

结论 (Conclusion)

If you’re searching for open-source data architecture, you cannot ignore Druid for speedy OLAP responses, Apache Airflow as an orchestrator that keeps your data lineage and schedules in line, plus an easy to use dashboard tool like Apache Superset.

如果您正在寻找开源数据架构，则不能忽略Druid的快速OLAP响应，Apache Airflow作为协调器(使您的数据沿袭和时间表保持一致)以及易于使用的仪表板工具(如Apache Superset)。

My experience so far is that Druid is bloody fast and a perfect fit for OLAP cube replacements in a traditional way, but still needs a more relaxed startup to install clusters, ingest data, view logs etc. If you need that, have a look at Impy which was created by the founders of Druid. It creates all the services around Druid that you need. Unfortunately, though, it’s not open-source.

到目前为止，我的经验是Druid的速度非常快，并且以传统方式非常适合OLAP多维数据集替换，但是仍然需要更轻松的启动来安装集群，提取数据，查看日志等。如果需要，请看看由Druid的创始人创建的Impy 。它围绕您需要的Druid创建所有服务。不幸的是，它不是开源的。

Apache Airflow and its features as an orchestrator are something which has not happened much yet in traditional Business Intelligence environments. I believe this change comes very naturally when you start using open-source and more new technologies.

在传统的商业智能环境中，Apache Airflow及其作为协调器的功能尚未发生很多事情。我相信，当您开始使用开源和更多新技术时，这种变化会自然而然地出现。

And Apache Superset is an easy and fast way to be up and running and showing data from Druid. There for better tools like Tableau, etc., but not for free. That’s why Superset fits well in the ecosystem if you’re already using the above open-source technologies. But as an enterprise company, you might want to spend some money in that category because that is what the users can see at the end of the day.

Apache Superset是一种简便，快速的方法，可用于启动和运行以及显示来自Druid的数据。那里有更好的工具，例如Tableau等，但不是免费的。这就是为什么如果您已经在使用上述开源技术，那么Superset非常适合生态系统。但是作为一家企业公司，您可能需要在该类别中花一些钱，因为这是用户最终可以看到的。