什么是数据仓库,何时以及为什么要考虑一个

The term “Data Warehouse” is widely used in the data analytics world, however, it’s quite common for people who are new with data analytics to ask the above question.

术语“数据仓库”在数据分析领域中被广泛使用,但是,对于数据分析新手来说,问上述问题非常普遍。

This post attempts to help explain the definition of a data warehouse, when, and why to consider setting up one.

这篇文章试图帮助解释数据仓库的定义,何时以及为什么考虑建立一个数据仓库。

Ps: This is a section of a guidebook our team is writing, The Analytics Set-up Guidebook. If you are interested to learn more about the high-level or best practices of modern BI stacks, feel free to check out the link to see our progress.

附:这是我们团队正在编写的指南的一部分, 即Google Analytics(分析)设置指南 。 如果您想了解有关现代BI堆栈的高级或最佳实践的更多信息,请随时查看链接以了解我们的进展。

什么是数据仓库? (What is a data warehouse?)

A data warehouse is a type of analytics database that stores and processes your data for the purpose of analytics. Your data warehouse will handle two main functions of your analytics: store your analytical data & process your analytical data.

数据仓库是一种分析数据库,用于存储和处理数据以进行分析。 数据仓库将处理分析的两个主要功能: 存储分析数据和处理分析数据

Why do you need one? You will need a data warehouse for two main purposes:

为什么需要一个? 您将需要一个数据仓库来实现两个主要目的:

  1. First, you can’t combine data from multiple business functions easily if they sit in different sources.

    首先,如果来自多个业务功能的数据位于不同的来源中,则无法轻松合并它们。
  2. Second, your source systems are not designed to run heavy analytics, and doing so might jeopardize your business operations as it increases the load on those systems.

    其次,您的源系统并非旨在运行繁重的分析,这样做可能会危害您的业务运营,因为它会增加这些系统的负载。

Your data warehouse is the centerpiece of every step of your analytics pipeline process, and it serves three main purposes:

数据仓库是分析管道流程中每个步骤的核心,它具有三个主要目的:

  • Storage: In the consolidate (Extract & Load) step, your data warehouse will receive and store data coming from multiple sources.

    存储:在合并(提取和加载)步骤中,您的数据仓库将接收存储来自多个来源的数据

  • Process: In the process (Transform & Model) step, your data warehouse will handle most (if not all) of the intensive processing generated from the transform step.

    流程:在流程(“转换和模型”)步骤中,您的数据仓库将处理从转换步骤生成的大部分(如果不是全部) 密集处理

  • Access: In the reporting (Visualize & Delivery) step, reports are being gathered within the data-warehouse first, then visualized and delivered to end-users.

    访问:在报告(可视化和交付)步骤中,首先在数据仓库中收集报告,然后将其可视化并交付给最终用户。

At the moment, most data warehouses use SQL as their primary querying language.

目前,大多数数据仓库都使用SQL作为其主要查询语言。

什么时候是合适的时间来获取数据仓库? (When is the right time to get a data warehouse?)

The TL;DR answer is that it depends. It depends on the stage of your company, the amount of data you have, your budget, and so on.

TL; DR的答案取决于它。 这取决于公司的阶段,拥有的数据量,预算等。

At an early stage, you can probably get by without a data warehouse, and connect a business intelligence (BI) tool directly to your production database (A simple BI setup for people just starting out).

在早期阶段,您可能不需要数据仓库就可以解决问题,并将商业智能(BI)工具直接连接到生产数据库( 针对刚开始的人们的简单BI设置 )。

However, if you are still not sure if a data warehouse is the right thing for your company, consider the below pointers:

但是,如果仍然不确定数据仓库是否适合您的公司,请考虑以下几点:

First, do you need to analyze data from different sources?

首先,您需要分析来自不同来源的数据吗?

At some point in your company’s life, you would need to combine data from different internal tools in order to make better, more informed business decisions.

在公司生命中的某个时刻,您需要合并来自不同内部工具的数据,以便做出更好,更明智的业务决策。

For instance, if you’re a restaurant and want to analyze orders/waitress ratio efficiency (which hour of the week the staff is most busy vs most free), you need to combine your sales data (from POS system) with your staff duty data (from HR system).

例如,如果您是一家餐馆,并且想分析订单/女服务员的工作效率(工作人员在一周中的哪一小时最忙与最空闲),则需要将销售数据(来自POS系统)与工作人员的职责结合起来数据(来自HR系统)。

For those analyses, it is a lot easier to do if your data is located in one central location.

对于这些分析,如果您的数据位于一个中央位置,则操作会容易得多。

Second, do you need to separate your analytical data from your transactional data?

其次,您是否需要将分析数据与交易数据分开?

As mentioned, your transactional systems are not designed for analytical purposes. So if you collect activity logs or other potentially useful pieces of information in your app, it’s probably not a good idea to store this data in your app’s database and have your analysts work on the production database directly.

如前所述,您的交易系统并非为分析目的而设计。 因此,如果您在应用程序中收集活动日志或其他可能有用的信息,将这些数据存储在应用程序的数据库中并让分析师直接在生产数据库上工作可能不是一个好主意。

Instead, it’s a much better idea to purchase a data warehouse — one that’s designed for complex querying — and transfer the analytical data there instead. That way, the performance of your app isn’t affected by your analytics work.

取而代之的是,购买一个数据仓库(一个专门用于复杂查询的数据仓库)并在那里传输分析数据是一个更好的主意。 这样,您的应用程序的性能就不会受到分析工作的影响。

Third, is your original data source not suitable for querying?

第三,您的原始数据源是否不适合查询?

For example, the vast majority of BI tools do not work well with NoSQL data stores like MongoDB. This means that applications that use MongoDB on the backend need their analytical data to be transferred to a data warehouse, in order for data analysts to work effectively with it.

例如,绝大多数BI工具不能与MongoDB等NoSQL数据存储一起很好地工作。 这意味着在后端使用MongoDB的应用程序需要将其分析数据传输到数据仓库,以便数据分析人员有效地使用它。

Fourth, do you want to increase the performance of your analytical queries?

第四,您是否想提高分析查询的性能?

If your transactional data consists of hundreds of thousands of rows, it’s probably a good idea to create summary tables that aggregate that data into a more queryable form. Not doing so will cause queries to be incredibly slow — not to mention having them being an unnecessary burden on your database.

如果您的事务数据包含成千上万的行,那么创建汇总表以将该数据聚合为更可查询的形式可能是一个好主意。 如果不这样做,将导致查询异常缓慢-更不用说使查询成为数据库的不必要负担。

Image for post

If you answered yes to any of the above questions, then chances are good that you should just get a data warehouse.

如果您对以上任何一个问题的回答为“是”,那么很可能就应该获得一个数据仓库。

That said, in our opinion, it’s usually a good idea to just go get a data warehouse, as data warehouses are not expensive in the cloud era.

也就是说,在我们看来,最好是先获得一个数据仓库,因为在云时代,数据仓库并不昂贵。

我应该选择哪个数据仓库? (Which Data Warehouse Should I Pick?)

Here are some common data warehouses that you may pick from:

以下是一些常见的数据仓库,您可以从中选择:

  • Amazon Redshift

    亚马逊Redshift
  • Google BigQuery

    Google BigQuery
  • Snowflake

    雪花
  • ClickHouse (self-hosted)

    ClickHouse(自托管)
  • Presto (self-hosted)

    Presto(自托管)

If you’re just getting started and don’t have a strong preference, we suggest that you go with Google BigQuery for the following reasons:

如果您只是入门而又没有强烈的偏好,建议您使用Google BigQuery,原因如下:

  • BigQuery is free for the first 10GB storage and first 1TB of queries. After that it’s pay-per-usage.

    BigQuery对前10GB的存储空间和前1TB的查询免费 。 之后是按使用量付费。

  • BigQuery is fully managed (serverless): There is no physical (or virtual) server to spin up or manage.

    BigQuery是完全托管(无服务器)的 :没有物理(或虚拟)服务器可以启动或管理。

  • As a result of its architecture, BigQuery auto-scales: BigQuery will automatically determine the right amount of computing resources to allocate to each query, depending on the query’s complexity and the amount of data you scan, without you having to manually fine-tune it.

    作为其架构的结果, BigQuery会自动扩展: BigQuery将自动确定要分配给每个查询的正确计算资源量,具体取决于查询的复杂性和您扫描的数据量,而无需手动进行微调。

(Note: we don’t have any affiliation with Google, and we don’t get paid to promote BigQuery).

(请注意:我们与Google没有任何隶属关系,并且没有获得宣传BigQuery的报酬)。

However, if you have a rapidly increasing volume of data, or if you have complex/special use cases, you will need to carefully evaluate your options.

但是,如果您的数据量Swift增加,或者您有复杂/特殊的用例,则需要仔细评估您的选择。

Below, we present a table of the most popular data warehouses. Our intention here is to give you a high-level understanding of the most common choices in the data warehouse space. This is by no means comprehensive, nor is it sufficient to help you make an informed decision.

下面,我们列出了最受欢迎的数据仓库。 我们的目的是让您对数据仓库空间中最常见的选择有一个高层次的了解。 这绝不是全面的,也不足以帮助您做出明智的决定。

But it is, we think, a good start:

但是,我们认为这是一个不错的开始:

Image for post

是什么使数据仓库与普通SQL数据库不同? (What makes a data warehouse different from normal SQL database?)

At this point some of you might be asking:

此时,有些人可能会问:

“Hey isn’t a data warehouse just like a relational database that stores data for analytics? Can’t I just use something like MySQL, PostgreSQL, MSSQL or Oracle as my data warehouse?”

“嘿,数据仓库不像存储数据以供分析的关系数据库吗? 我不能只使用MySQL,PostgreSQL,MSSQL或Oracle之类的数据仓库吗?”

The short answer is: yes you can.

简短的答案是:是的,您可以。

The long answer is: it depends. First, we need to understand a few concepts.

长的答案是:这取决于。 首先,我们需要了解一些概念。

事务性工作量与分析性工作量 (Transactional Workloads vs Analytical Workloads)

It is important to understand the difference between two kinds of database workloads: transactional workloads and analytical workloads.

重要的是要了解两种数据库工作负载之间的区别:事务性工作负载和分析性工作负载。

A transactional workload is the querying workload that serves normal business applications. When a visitor loads a product page in a web app, a query is sent to the database to fetch this product, and return the result to the application for processing.

事务性工作负载是为正常业务应用程序提供服务的查询工作负载。 当访问者在Web应用程序中加载产品页面时,查询将发送到数据库以获取此产品,并将结果返回给应用程序进行处理。

SELECT * FROM products WHERE id = 123

(the query above retrieves information for a single product with ID 123)

(上面的查询检索ID为123的单个产品的信息)

Here are several common attributes of transactional workloads:

以下是事务性工作负载的几个常见属性:

  • Each query usually retrieves a single record or a small number of records (e.g. get the first 10 blog posts in a category)

    每个查询通常检索单个记录或少量记录(例如,获取类别中的前10个博客帖子)

  • Transactional workloads typically involve simple queries that take a very short time to run (less than 1 second)

    事务性工作负载通常涉及需要很短时间 (不到1秒) 运行的简单查询

  • Lots of concurrent queries at any point in time, limited by the number of concurrent visitors of the application. For big websites, this can go to the thousands or hundreds of thousands.

    在任何时间点都有很多并发查询 ,受应用程序并发访问者数量的限制。 对于大型网站,这可以达到成千上万。

  • Usually interested in the whole data record (e.g. every column in the product table).

    通常对整个数据记录感兴趣(例如,产品表中的每一列)。

Analytical workloads, on the other hand, refer to workload for analytical purposes, the kind of workload that this book talks about. When a data report is run, a query will be sent to DB to calculate the results, and then displayed to end-users.

分析工作负载 ,而另一方面,是指工作量分析的目的,那种工作量是这本书的会谈。 运行数据报告时,查询将发送到数据库以计算结果,然后显示给最终用户。

SELECT category_name, count(*) as num_products FROM products GROUP BY 1

(The above query scans the entire products table to count how many products are there in each category)

(上面的查询会扫描整个产品表,以计算每个类别中有多少个产品)

Analytical workloads, on the other hand, have the following attributes:

另一方面,分析工作负载具有以下属性:

  • Each query typically scans a large number of rows in the table.

    每个查询通常会扫描表中的大量行

  • Each query is heavy and takes a long time (minutes, or even hours) to finish

    每个查询都很繁琐,需要很长时间 (几分钟甚至几小时)才能完成

  • Not a lot of concurrent queries happen, limited by the number of reports or internal staff members using the analytics system.

    并发查询不会很多 ,受使用分析系统的报告或内部员工数量的限制。

  • Usually interested in just a few columns of data.

    通常只对几列数据感兴趣。

Below is a comparison table between transactional vs analytical workload/databases.

下表是事务性与分析性工作负载/数据库之间的比较表。

Image for post

Transactional workloads have many simple queries, whereas analytical workloads have few heavy queries.

事务性工作负载具有许多简单查询,而分析工作负载则具有很少的繁重查询。

Analytics(分析)资料库的后端不同 (The Backend for Analytics Databases is Different)

Because of the drastic difference between the two workloads above, the underlying backend design of the database for the two workloads are very different. Transactional databases are optimized for fast, short queries with high concurrent volume, while analytical databases are optimized for long-running, resource-intensive queries.

由于上述两个工作负载之间的巨大差异,因此两个工作负载的数据库基础后端设计非常不同。 事务数据库针对并发量大的快速,短查询进行了优化,而分析数据库针对长时间运行的资源密集型查询进行了优化。

What are the differences in architecture you ask? This will take a dedicated section to explain, but the gist of it is that analytical databases use the following techniques to guarantee superior performance:

您要求的架构有什么区别? 这将用专门的部分进行解释,但要点是分析数据库使用以下技术来保证卓越的性能:

  • Columnar storage engine: Instead of storing data row by row on disk, analytical databases group columns of data together and store them.

    列式存储引擎:分析数据库不是将数据逐行存储在磁盘上,而是将数据列分组在一起并进行存储。

  • Compression of columnar data: Data within each column is compressed for smaller storage and faster retrieval.

    压缩列数据:每列中的数据都经过压缩,以减少存储量并加快检索速度。

  • Parallelization of query executions: Modern analytical databases are typically run on top of thousands of machines. Each analytical query can thus be split into multiple smaller queries to be executed in parallel amongst those machines (divide and conquer strategy)

    查询执行的并行化:现代分析数据库通常在数千台计算机上运行。 因此,每个分析查询都可以拆分为多个较小的查询,以在这些机器之间并行执行(分而治之策略)

As you can probably guess by now, MySQL, PostgreSQL, MSSQL, and Oracle databases are designed to handle transactional workloads, whereas data warehouses are designed to handle analytical workloads.

您可能现在已经猜到了,MySQL,PostgreSQL,MSSQL和Oracle数据库旨在处理事务性工作负载,而数据仓库旨在处理分析性工作负载。

那么,我可以使用普通SQL数据库作为数据仓库吗? (So, can I use a normal SQL database as my data warehouse?)

Like we’ve said earlier, yes you can, but it depends.

就像我们之前说过的,可以,但是要视情况而定。

If you’re just starting out with small set of data and few analytical use cases, it’s perfectly fine to pick a normal SQL database as your data warehouse (most popular ones are MySQL, PostgreSQL, MSSQL or Oracle). If you’re relatively big with lots of data, you still can, but it will require proper tuning and configuring.

如果您只是从少量数据和很少的分析用例开始,那么最好选择一个普通SQL数据库作为您的数据仓库(最受欢迎的是MySQL,PostgreSQL,MSSQL或Oracle)。 如果您的数据量相对较大,您仍然可以,但是需要进行适当的调整和配置。

That said, with the advent of low-cost data warehouse like BigQuery, Redshift above, we would recommend you go ahead with a data warehouse.

也就是说,随着BigQuery,Redshift等低成本数据仓库的出现,我们建议您继续使用数据仓库。

However, if you must choose a normal SQL-based database (for example your business only allows you to host it on-premise, within your own network) we recommend going with PostgreSQL as it has the most features supported for analytics. We’ve also written a detailed blog post discussing this topic here: Why you should use PostgreSQL over MySQL for analytics purpose.

但是,如果必须选择一个基于SQL的常规数据库(例如,您的企业只允许您在自己的网络中内部托管该数据库),我们建议使用PostgreSQL,因为它具有分析支持的最多功能。 我们还在此处写了一篇详细的博客文章,讨论了该主题: 为什么要在分析上使用PostgreSQL over MySQL 。

摘要 (Summary)

In this post, we zoomed in into data warehouse and spoke about:

在这篇文章中,我们放大了数据仓库并谈到了:

  • Data warehouse is the central analytics database that stores & processes your data for analytics

    数据仓库是中央分析数据库,用于存储和处理数据以进行分析
  • The 4 trigger points when you should get a data warehouse

    获取数据仓库的4个触发点
  • A simple list of data warehouse technologies you can choose from

    您可以选择的简单数据仓库技术列表
  • How a data warehouse is optimized for analytical workload vs traditional database for transactional workload.

    与分析事务处理的传统数据库相比,如何优化数据仓库以进行分析工作负载。

Originally published at The Analytics Setup Guidebook by Holistics: Understanding The Data Warehouse.

最初发布于 Holistics的《 Analytics设置指南》中:了解数据仓库

翻译自: https://towardsdatascience.com/what-is-a-data-warehouse-when-and-why-to-consider-one-2e826be68e95

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388917.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

探索性数据分析入门_入门指南:R中的探索性数据分析

探索性数据分析入门When I started on my journey to learn data science, I read through multiple articles that stressed the importance of understanding your data. It didn’t make sense to me. I was naive enough to think that we are handed over data which we p…

python web应用_为您的应用选择最佳的Python Web爬网库

python web应用Living in today’s world, we are surrounded by different data all around us. The ability to collect and use this data in our projects is a must-have skill for every data scientist.生活在当今世界中,我们周围遍布着不同的数据。 在我们的…

NDK-r14b + FFmpeg-release-3.4 linux下编译FFmpeg

下载资源 官网下载完NDK14b 和 FFmpeg 下载之后,更改FFmpeg 目录下configure问价如下: SLIBNAME_WITH_MAJOR$(SLIBPREF)$(FULLNAME)-$(LIBMAJOR)$(SLIBSUF) LIB_INSTALL_EXTRA_CMD$$(RANLIB)"$(LIBDIR)/$(LIBNAME)" SLIB_INSTALL_NAME$(SLI…

html中列表导航怎么和图片对齐_HTML实战篇:html仿百度首页

本篇文章主要给大家介绍一下如何使用htmlcss来制作百度首页页面。1)制作页面所用的知识点我们首先来分析一下百度首页的页面效果图百度首页由头部的一个文字导航,中间的一个按钮和一个输入框以及下边的文字简介和导航组成。我们这里主要用到的知识点就是列表标签(ul…

C# 依赖注入那些事儿

原文地址:http://www.cnblogs.com/leoo2sk/archive/2009/06/17/1504693.html 里面有一个例子差了些代码,补全后贴上。 3.1.3 依赖获取 using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml;//定义…

在FAANG面试中破解堆算法

In FAANG company interview, Candidates always come across heap problems. There is one question they do like to ask — Top K. Because these companies have a huge dataset and they can’t always go through all the data. Finding tope data is always a good opti…

mysql springboot 缓存_Spring Boot 整合 Redis 实现缓存操作

摘要: 原创出处 www.bysocket.com 「泥瓦匠BYSocket 」欢迎转载,保留摘要,谢谢!『 产品没有价值,开发团队再优秀也无济于事 – 《启示录》 』本文提纲一、缓存的应用场景二、更新缓存的策略三、运行 springboot-mybatis-redis 工程…

itchat 道歉_人类的“道歉”

itchat 道歉When cookies were the progeny of “magic cookies”, they were seemingly innocuous packets of e-commerce data that stored a user’s partial transaction state on their computer. It wasn’t disclosed that you were playing a beneficial part in a muc…

matlab软件imag函数_「复变函数与积分变换」基本计算代码

使用了Matlab代码,化简平时遇到的计算问题,也可以用于验算结果来自211工科专业2学分复变函数与积分变换课程求复角主值sym(angle(待求复数))%公式 sym(angle(1sqrt(3)*i))%举例代入化简将 代入关于z的函数f(z)中并化解,用于公式法计算无穷远点…

数据科学 python_为什么需要以数据科学家的身份学习Python的7大理由

数据科学 pythonAs a new Data Scientist, you know that your path begins with programming languages you need to learn. Among all languages that you can select from Python is the most popular language for all Data Scientists. In this article, I will cover 7 r…

rabbitmq 不同的消费者消费同一个队列_RabbitMQ 消费端限流、TTL、死信队列

消费端限流1. 为什么要对消费端限流假设一个场景,首先,我们 Rabbitmq 服务器积压了有上万条未处理的消息,我们随便打开一个消费者客户端,会出现这样情况: 巨量的消息瞬间全部推送过来,但是我们单个客户端无法同时处理这…

动量策略 python_在Python中使用动量通道进行交易

动量策略 pythonMost traders use Bollinger Bands. However, price is not normally distributed. That’s why only 42% of prices will close within one standard deviation. Please go ahead and read this article. However, I have some good news.大多数交易者使用布林…

css3 变换、过渡效果、动画

1 CSS3 选择器 1.1 基本选择器 1.2 层级 空格 > .itemli ~ .item~p 1.3 属性选择器 [attr] [attrvalue] [attr^value] [attr$value] [attr*value] [][][] 1.4 伪类选择器 :link :visited :hover :active :focus :first-child .list li:first-child :last-chi…

mysql常用的存储引擎_Mysql存储引擎

什么是存储引擎?关系数据库表是用于存储和组织信息的数据结构,可以将表理解为由行和列组成的表格,类似于Excel的电子表格的形式。有的表简单,有的表复杂,有的表根本不用来存储任何长期的数据,有的表读取时非…

android studio设计模式和文本模式切换

转载于:https://www.cnblogs.com/judes/p/9437104.html

高斯模糊为什么叫高斯滤波_为什么高斯是所有发行之王?

高斯模糊为什么叫高斯滤波高斯分布及其主要特征: (Gaussian Distribution and its key characteristics:) Gaussian distribution is a continuous probability distribution with symmetrical sides around its center. 高斯分布是连续概率分布,其中心周…

C# webbrowser 代理

百度,google加自己理解后,将所得方法总结一下: 方法1:修改注册表Software//Microsoft//Windows//CurrentVersion//Internet Settings下 ProxyEnable和ProxyServer。这种方法适用于局域网用户,拨号用户无效。 1p…

C MySQL读写分离连接串_Mysql读写分离

一 什么是读写分离MySQL Proxy最强大的一项功能是实现“读写分离(Read/Write Splitting)”。基本的原理是让主数据库处理事务性查询,而从数据库处理SELECT查询。数据库复制被用来把事务性查询导致的变更同步到集群中的从数据库。当然,主服务器也可以提供…

从Jupyter Notebook到脚本

16 Aug: My second article: From Scripts To Prediction API8月16日:我的第二篇文章: 从脚本到预测API As advanced beginners, we know quite a lot: EDA, ML concepts, model architectures etc…… We can write a big Jupyter Notebook, click “Re…

加勒比海兔_加勒比海海洋物种趋势

加勒比海兔Ok, here’s a million dollar question: is the Caribbean really dying? Or, more specifically, are marine species found on Caribbean reefs becoming less abundant?好吧,这是一个百万美元的问题:加勒比海真的死了吗? 或者…