The term “Data Warehouse” is widely used in the data analytics world, however, it’s quite common for people who are new with data analytics to ask the above question.
术语“数据仓库”在数据分析领域中被广泛使用,但是,对于数据分析新手来说,问上述问题非常普遍。
This post attempts to help explain the definition of a data warehouse, when, and why to consider setting up one.
这篇文章试图帮助解释数据仓库的定义,何时以及为什么考虑建立一个数据仓库。
Ps: This is a section of a guidebook our team is writing, The Analytics Set-up Guidebook. If you are interested to learn more about the high-level or best practices of modern BI stacks, feel free to check out the link to see our progress.
附:这是我们团队正在编写的指南的一部分, 即Google Analytics(分析)设置指南 。 如果您想了解有关现代BI堆栈的高级或最佳实践的更多信息,请随时查看链接以了解我们的进展。
什么是数据仓库? (What is a data warehouse?)
A data warehouse is a type of analytics database that stores and processes your data for the purpose of analytics. Your data warehouse will handle two main functions of your analytics: store your analytical data & process your analytical data.
数据仓库是一种分析数据库,用于存储和处理数据以进行分析。 数据仓库将处理分析的两个主要功能: 存储分析数据和处理分析数据 。
Why do you need one? You will need a data warehouse for two main purposes:
为什么需要一个? 您将需要一个数据仓库来实现两个主要目的:
- First, you can’t combine data from multiple business functions easily if they sit in different sources. 首先,如果来自多个业务功能的数据位于不同的来源中,则无法轻松合并它们。
- Second, your source systems are not designed to run heavy analytics, and doing so might jeopardize your business operations as it increases the load on those systems. 其次,您的源系统并非旨在运行繁重的分析,这样做可能会危害您的业务运营,因为它会增加这些系统的负载。
Your data warehouse is the centerpiece of every step of your analytics pipeline process, and it serves three main purposes:
数据仓库是分析管道流程中每个步骤的核心,它具有三个主要目的:
Storage: In the consolidate (Extract & Load) step, your data warehouse will receive and store data coming from multiple sources.
存储:在合并(提取和加载)步骤中,您的数据仓库将接收和存储来自多个来源的数据 。
Process: In the process (Transform & Model) step, your data warehouse will handle most (if not all) of the intensive processing generated from the transform step.
流程:在流程(“转换和模型”)步骤中,您的数据仓库将处理从转换步骤生成的大部分(如果不是全部) 密集处理 。
Access: In the reporting (Visualize & Delivery) step, reports are being gathered within the data-warehouse first, then visualized and delivered to end-users.
访问:在报告(可视化和交付)步骤中,首先在数据仓库中收集报告,然后将其可视化并交付给最终用户。
At the moment, most data warehouses use SQL as their primary querying language.
目前,大多数数据仓库都使用SQL作为其主要查询语言。
什么时候是合适的时间来获取数据仓库? (When is the right time to get a data warehouse?)
The TL;DR answer is that it depends. It depends on the stage of your company, the amount of data you have, your budget, and so on.
TL; DR的答案取决于它。 这取决于公司的阶段,拥有的数据量,预算等。
At an early stage, you can probably get by without a data warehouse, and connect a business intelligence (BI) tool directly to your production database (A simple BI setup for people just starting out).
在早期阶段,您可能不需要数据仓库就可以解决问题,并将商业智能(BI)工具直接连接到生产数据库( 针对刚开始的人们的简单BI设置 )。
However, if you are still not sure if a data warehouse is the right thing for your company, consider the below pointers:
但是,如果仍然不确定数据仓库是否适合您的公司,请考虑以下几点:
First, do you need to analyze data from different sources?
首先,您需要分析来自不同来源的数据吗?
At some point in your company’s life, you would need to combine data from different internal tools in order to make better, more informed business decisions.
在公司生命中的某个时刻,您需要合并来自不同内部工具的数据,以便做出更好,更明智的业务决策。
For instance, if you’re a restaurant and want to analyze orders/waitress ratio efficiency (which hour of the week the staff is most busy vs most free), you need to combine your sales data (from POS system) with your staff duty data (from HR system).
例如,如果您是一家餐馆,并且想分析订单/女服务员的工作效率(工作人员在一周中的哪一小时最忙与最空闲),则需要将销售数据(来自POS系统)与工作人员的职责结合起来数据(来自HR系统)。
For those analyses, it is a lot easier to do if your data is located in one central location.
对于这些分析,如果您的数据位于一个中央位置,则操作会容易得多。
Second, do you need to separate your analytical data from your transactional data?
其次,您是否需要将分析数据与交易数据分开?
As mentioned, your transactional systems are not designed for analytical purposes. So if you collect activity logs or other potentially useful pieces of information in your app, it’s probably not a good idea to store this data in your app’s database and have your analysts work on the production database directly.
如前所述,您的交易系统并非为分析目的而设计。 因此,如果您在应用程序中收集活动日志或其他可能有用的信息,将这些数据存储在应用程序的数据库中并让分析师直接在生产数据库上工作可能不是一个好主意。
Instead, it’s a much better idea to purchase a data warehouse — one that’s designed for complex querying — and transfer the analytical data there instead. That way, the performance of your app isn’t affected by your analytics work.
取而代之的是,购买一个数据仓库(一个专门用于复杂查询的数据仓库)并在那里传输分析数据是一个更好的主意。 这样,您的应用程序的性能就不会受到分析工作的影响。
Third, is your original data source not suitable for querying?
第三,您的原始数据源是否不适合查询?
For example, the vast majority of BI tools do not work well with NoSQL data stores like MongoDB. This means that applications that use MongoDB on the backend need their analytical data to be transferred to a data warehouse, in order for data analysts to work effectively with it.
例如,绝大多数BI工具不能与MongoDB等NoSQL数据存储一起很好地工作。 这意味着在后端使用MongoDB的应用程序需要将其分析数据传输到数据仓库,以便数据分析人员有效地使用它。
Fourth, do you want to increase the performance of your analytical queries?
第四,您是否想提高分析查询的性能?
If your transactional data consists of hundreds of thousands of rows, it’s probably a good idea to create summary tables that aggregate that data into a more queryable form. Not doing so will cause queries to be incredibly slow — not to mention having them being an unnecessary burden on your database.
如果您的事务数据包含成千上万的行,那么创建汇总表以将该数据聚合为更可查询的形式可能是一个好主意。 如果不这样做,将导致查询异常缓慢-更不用说使查询成为数据库的不必要负担。
If you answered yes to any of the above questions, then chances are good that you should just get a data warehouse.
如果您对以上任何一个问题的回答为“是”,那么很可能就应该获得一个数据仓库。
That said, in our opinion, it’s usually a good idea to just go get a data warehouse, as data warehouses are not expensive in the cloud era.
也就是说,在我们看来,最好是先获得一个数据仓库,因为在云时代,数据仓库并不昂贵。
我应该选择哪个数据仓库? (Which Data Warehouse Should I Pick?)
Here are some common data warehouses that you may pick from:
以下是一些常见的数据仓库,您可以从中选择:
- Amazon Redshift 亚马逊Redshift
- Google BigQuery Google BigQuery
- Snowflake 雪花
- ClickHouse (self-hosted) ClickHouse(自托管)
- Presto (self-hosted) Presto(自托管)
If you’re just getting started and don’t have a strong preference, we suggest that you go with Google BigQuery for the following reasons:
如果您只是入门而又没有强烈的偏好,建议您使用Google BigQuery,原因如下:
BigQuery is free for the first 10GB storage and first 1TB of queries. After that it’s pay-per-usage.
BigQuery对前10GB的存储空间和前1TB的查询免费 。 之后是按使用量付费。
BigQuery is fully managed (serverless): There is no physical (or virtual) server to spin up or manage.
BigQuery是完全托管(无服务器)的 :没有物理(或虚拟)服务器可以启动或管理。
As a result of its architecture, BigQuery auto-scales: BigQuery will automatically determine the right amount of computing resources to allocate to each query, depending on the query’s complexity and the amount of data you scan, without you having to manually fine-tune it.
作为其架构的结果, BigQuery会自动扩展: BigQuery将自动确定要分配给每个查询的正确计算资源量,具体取决于查询的复杂性和您扫描的数据量,而无需手动进行微调。
(Note: we don’t have any affiliation with Google, and we don’t get paid to promote BigQuery).
(请注意:我们与Google没有任何隶属关系,并且没有获得宣传BigQuery的报酬)。
However, if you have a rapidly increasing volume of data, or if you have complex/special use cases, you will need to carefully evaluate your options.
但是,如果您的数据量Swift增加,或者您有复杂/特殊的用例,则需要仔细评估您的选择。
Below, we present a table of the most popular data warehouses. Our intention here is to give you a high-level understanding of the most common choices in the data warehouse space. This is by no means comprehensive, nor is it sufficient to help you make an informed decision.
下面,我们列出了最受欢迎的数据仓库。 我们的目的是让您对数据仓库空间中最常见的选择有一个高层次的了解。 这绝不是全面的,也不足以帮助您做出明智的决定。
But it is, we think, a good start:
但是,我们认为这是一个不错的开始:
是什么使数据仓库与普通SQL数据库不同? (What makes a data warehouse different from normal SQL database?)
At this point some of you might be asking:
此时,有些人可能会问:
“Hey isn’t a data warehouse just like a relational database that stores data for analytics? Can’t I just use something like MySQL, PostgreSQL, MSSQL or Oracle as my data warehouse?”
“嘿,数据仓库不像存储数据以供分析的关系数据库吗? 我不能只使用MySQL,PostgreSQL,MSSQL或Oracle之类的数据仓库吗?”
The short answer is: yes you can.
简短的答案是:是的,您可以。
The long answer is: it depends. First, we need to understand a few concepts.
长的答案是:这取决于。 首先,我们需要了解一些概念。
事务性工作量与分析性工作量 (Transactional Workloads vs Analytical Workloads)
It is important to understand the difference between two kinds of database workloads: transactional workloads and analytical workloads.
重要的是要了解两种数据库工作负载之间的区别:事务性工作负载和分析性工作负载。
A transactional workload is the querying workload that serves normal business applications. When a visitor loads a product page in a web app, a query is sent to the database to fetch this product, and return the result to the application for processing.
事务性工作负载是为正常业务应用程序提供服务的查询工作负载。 当访问者在Web应用程序中加载产品页面时,查询将发送到数据库以获取此产品,并将结果返回给应用程序进行处理。
SELECT * FROM products WHERE id = 123
(the query above retrieves information for a single product with ID 123)
(上面的查询检索ID为123的单个产品的信息)
Here are several common attributes of transactional workloads:
以下是事务性工作负载的几个常见属性:
Each query usually retrieves a single record or a small number of records (e.g. get the first 10 blog posts in a category)
每个查询通常检索单个记录或少量记录(例如,获取类别中的前10个博客帖子)
Transactional workloads typically involve simple queries that take a very short time to run (less than 1 second)
事务性工作负载通常涉及需要很短时间 (不到1秒) 运行的简单查询
Lots of concurrent queries at any point in time, limited by the number of concurrent visitors of the application. For big websites, this can go to the thousands or hundreds of thousands.
在任何时间点都有很多并发查询 ,受应用程序并发访问者数量的限制。 对于大型网站,这可以达到成千上万。
Usually interested in the whole data record (e.g. every column in the product table).
通常对整个数据记录感兴趣(例如,产品表中的每一列)。
Analytical workloads, on the other hand, refer to workload for analytical purposes, the kind of workload that this book talks about. When a data report is run, a query will be sent to DB to calculate the results, and then displayed to end-users.
分析工作负载 ,而另一方面,是指工作量分析的目的,那种工作量是这本书的会谈。 运行数据报告时,查询将发送到数据库以计算结果,然后显示给最终用户。
SELECT category_name, count(*) as num_products FROM products GROUP BY 1
(The above query scans the entire products table to count how many products are there in each category)
(上面的查询会扫描整个产品表,以计算每个类别中有多少个产品)
Analytical workloads, on the other hand, have the following attributes:
另一方面,分析工作负载具有以下属性:
Each query typically scans a large number of rows in the table.
每个查询通常会扫描表中的大量行 。
Each query is heavy and takes a long time (minutes, or even hours) to finish
每个查询都很繁琐,需要很长时间 (几分钟甚至几小时)才能完成
Not a lot of concurrent queries happen, limited by the number of reports or internal staff members using the analytics system.
并发查询不会很多 ,受使用分析系统的报告或内部员工数量的限制。
Usually interested in just a few columns of data.
通常只对几列数据感兴趣。
Below is a comparison table between transactional vs analytical workload/databases.
下表是事务性与分析性工作负载/数据库之间的比较表。
Transactional workloads have many simple queries, whereas analytical workloads have few heavy queries.
事务性工作负载具有许多简单查询,而分析工作负载则具有很少的繁重查询。
Analytics(分析)资料库的后端不同 (The Backend for Analytics Databases is Different)
Because of the drastic difference between the two workloads above, the underlying backend design of the database for the two workloads are very different. Transactional databases are optimized for fast, short queries with high concurrent volume, while analytical databases are optimized for long-running, resource-intensive queries.
由于上述两个工作负载之间的巨大差异,因此两个工作负载的数据库基础后端设计非常不同。 事务数据库针对并发量大的快速,短查询进行了优化,而分析数据库针对长时间运行的资源密集型查询进行了优化。
What are the differences in architecture you ask? This will take a dedicated section to explain, but the gist of it is that analytical databases use the following techniques to guarantee superior performance:
您要求的架构有什么区别? 这将用专门的部分进行解释,但要点是分析数据库使用以下技术来保证卓越的性能:
Columnar storage engine: Instead of storing data row by row on disk, analytical databases group columns of data together and store them.
列式存储引擎:分析数据库不是将数据逐行存储在磁盘上,而是将数据列分组在一起并进行存储。
Compression of columnar data: Data within each column is compressed for smaller storage and faster retrieval.
压缩列数据:每列中的数据都经过压缩,以减少存储量并加快检索速度。
Parallelization of query executions: Modern analytical databases are typically run on top of thousands of machines. Each analytical query can thus be split into multiple smaller queries to be executed in parallel amongst those machines (divide and conquer strategy)
查询执行的并行化:现代分析数据库通常在数千台计算机上运行。 因此,每个分析查询都可以拆分为多个较小的查询,以在这些机器之间并行执行(分而治之策略)
As you can probably guess by now, MySQL, PostgreSQL, MSSQL, and Oracle databases are designed to handle transactional workloads, whereas data warehouses are designed to handle analytical workloads.
您可能现在已经猜到了,MySQL,PostgreSQL,MSSQL和Oracle数据库旨在处理事务性工作负载,而数据仓库旨在处理分析性工作负载。
那么,我可以使用普通SQL数据库作为数据仓库吗? (So, can I use a normal SQL database as my data warehouse?)
Like we’ve said earlier, yes you can, but it depends.
就像我们之前说过的,可以,但是要视情况而定。
If you’re just starting out with small set of data and few analytical use cases, it’s perfectly fine to pick a normal SQL database as your data warehouse (most popular ones are MySQL, PostgreSQL, MSSQL or Oracle). If you’re relatively big with lots of data, you still can, but it will require proper tuning and configuring.
如果您只是从少量数据和很少的分析用例开始,那么最好选择一个普通SQL数据库作为您的数据仓库(最受欢迎的是MySQL,PostgreSQL,MSSQL或Oracle)。 如果您的数据量相对较大,您仍然可以,但是需要进行适当的调整和配置。
That said, with the advent of low-cost data warehouse like BigQuery, Redshift above, we would recommend you go ahead with a data warehouse.
也就是说,随着BigQuery,Redshift等低成本数据仓库的出现,我们建议您继续使用数据仓库。
However, if you must choose a normal SQL-based database (for example your business only allows you to host it on-premise, within your own network) we recommend going with PostgreSQL as it has the most features supported for analytics. We’ve also written a detailed blog post discussing this topic here: Why you should use PostgreSQL over MySQL for analytics purpose.
但是,如果必须选择一个基于SQL的常规数据库(例如,您的企业只允许您在自己的网络中内部托管该数据库),我们建议使用PostgreSQL,因为它具有分析支持的最多功能。 我们还在此处写了一篇详细的博客文章,讨论了该主题: 为什么要在分析上使用PostgreSQL over MySQL 。
摘要 (Summary)
In this post, we zoomed in into data warehouse and spoke about:
在这篇文章中,我们放大了数据仓库并谈到了:
- Data warehouse is the central analytics database that stores & processes your data for analytics 数据仓库是中央分析数据库,用于存储和处理数据以进行分析
- The 4 trigger points when you should get a data warehouse 获取数据仓库的4个触发点
- A simple list of data warehouse technologies you can choose from 您可以选择的简单数据仓库技术列表
- How a data warehouse is optimized for analytical workload vs traditional database for transactional workload. 与分析事务处理的传统数据库相比,如何优化数据仓库以进行分析工作负载。
Originally published at The Analytics Setup Guidebook by Holistics: Understanding The Data Warehouse.
最初发布于 Holistics的《 Analytics设置指南》中:了解数据仓库 。
翻译自: https://towardsdatascience.com/what-is-a-data-warehouse-when-and-why-to-consider-one-2e826be68e95
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388917.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!