全栈入门
I advise a lot of people on how to build out their data stack, from tiny startups to enterprise companies that are moving to the cloud or from legacy solutions. There are many choices out there, and navigating them all can be tricky. Here’s a breakdown of your options, trade offs, pricing and some thinking points around which you can make your decision, as well as some personal thoughts on the options.
我为许多人提供了有关如何构建其数据堆栈的建议,从小型初创公司到正在迁移到云的企业公司,或者从旧解决方案开始。 有很多选择,而将它们全部导航可能很棘手。 这是您的期权,权衡,定价和一些可以做出决定的思考点的细目分类,以及对期权的一些个人想法。
我的背景 (My background)
I’m CTO and co-founder at Dataform, I was previously an engineer at Google, where I spent most of my 6 years there building big data pipelines with internal tools similar to what is now Apache Beam. Dataform is a data modelling platform for cloud data warehouses, and while only one small part of the overall data stack, is often the glue that ties many things together and as a result, we spend a lot of time talking about overall data architecture with customers and prospective clients.
我是Dataform的首席技术官兼联合创始人,我之前是Google的工程师,在这6年中的大部分时间里,我都在那里使用类似于现在的Apache Beam的内部工具构建大数据管道。 Dataform是用于云数据仓库的数据建模平台,虽然整个数据堆栈中只有一小部分,但通常是将许多事物联系在一起的粘合剂,因此,我们花了大量时间与客户讨论整体数据架构和潜在客户。
产品推荐方法 (Product recommendation methodology)
It’s impossible for me to give a completely fair trial to every product in this space. In general, I’ve chosen to highlight products that:
对于我这个领域的所有产品,我都不可能进行完全公正的试用。 通常,我选择突出显示以下产品:
- Have generally high adoption and awareness amongst startups 在初创企业中普遍具有较高的采用率和知名度
- We generally hear our customers speak highly of 我们通常听到客户高度评价
- Fit into the ELT model of a data stack 适合数据栈的ELT模型
- Innovative new products that may not tick the above boxes, I personally believe are worth a mention 我个人认为值得一提的创新产品可能无法在上述选项中打勾
Where I have significant experience with a product, I’ll let you know and provide more detail on why. Similarly in one or two cases I’ve shared my reasons for not recommending them.
如果我在产品方面有丰富的经验,我会告知您并提供原因的更多详细信息。 同样,在一两种情况下,我也分享了我不推荐它们的理由。
总览 (Overview)
There is a prevailing model of a data stack that we consistently see the world moving toward, that’s probably best summed up by this diagram. This is an ELT architecture (extract, load, transform) as opposed to a more traditional ETL architecture, and can support companies of all sizes (perhaps with the exception of extremely large enterprises).
有一个流行的数据堆栈模型,我们可以一目了然地看到世界正在朝着这个方向发展,这也许可以最好地用这张图来概括。 与更传统的ETL体系结构相反,这是一种ELT体系结构(提取,加载,转换),并且可以支持各种规模的公司(也许,超大型企业除外)。
事件数据收集 (Event data collection)
How do you collect event data from across all of your different applications, web, app, backend services and send them to other systems or your data warehouse.
如何从所有不同的应用程序,Web,应用程序,后端服务收集事件数据,并将其发送到其他系统或数据仓库。
Conceptually straightforward, so not much to say here! Event based analytics is usually the easiest place to start and most off the shelf solutions are built around this.
从概念上讲简单明了,因此在这里无需多说! 基于事件的分析通常是最容易开始的地方,并且大多数现成的解决方案都是以此为基础的。
Tracking everything that you want to use for analytics in events avoids needing to join in other data sources at analysis time and lends itself well to product analytics where ordering of events is important to consider.
跟踪要在事件中用于分析的所有内容,从而避免了在分析时需要加入其他数据源的情况,并且非常适合需要考虑事件顺序的产品分析。
资料整合 (Data integration)
How do you move data between databases and services? There is some overlap with collection here. Typically you need to move data between various places such as:
您如何在数据库和服务之间移动数据? 这里的收集有些重叠。 通常,您需要在各个位置之间移动数据,例如:
- SaaS services > Data warehouse SaaS服务>数据仓库
- Production DBs > Data warehouse 生产数据库>数据仓库
- Event collection > Data warehouse / SaaS tools / CRMs 活动收集>数据仓库/ SaaS工具/ CRM
- Data warehouse > SaaS tools / CRMs 数据仓库> SaaS工具/ CRM
For the rest of the article we’ll consider these as two different data integration problems:
对于本文的其余部分,我们将把它们视为两个不同的数据集成问题:
- Data integration to the warehouse 数据集成到仓库
- Data integration to other SaaS products 数据集成到其他SaaS产品
数据仓库 (Data warehouses)
Where you move all your data to so you can query it together.
将所有数据移至的位置,以便可以一起查询。
A lot about data warehousing has changed over the last 10 years, data warehouses now scale to unprecedented levels. Before Snowflake and BigQuery, organizations with truly massive data would have avoided them due to limited scale, and instead opt for solutions such as Apache Spark, Dataflow, or Hadoop MapReduce like systems.
在过去的十年中,有关数据仓库的许多事情发生了变化,数据仓库现在可以扩展到前所未有的水平。 在使用Snowflake和BigQuery之前,拥有真正海量数据的组织会因规模有限而避免使用它们,而是选择诸如Apache Spark,Dataflow或Hadoop MapReduce之类的解决方案。
Warehouses and SQL have many benefits and the scalability limits are (mostly) gone. Additionally with the rise of engineering inspired data modelling tools (such as Dataform), it’s possible to manipulate data via SQL in a well tested, reproducible way.
仓库和SQL有很多好处,可伸缩性限制(大多数)已经消失了。 此外,随着受工程启发的数据建模工具(例如Dataform)的兴起,可以通过SQL以一种经过良好测试,可重现的方式处理数据。
We’ve written about this change if you’d like more information on why we think the shift towards SQL based warehousing is the right one and how it can help you move quickly, especially as a startup!
如果您想了解为什么我们认为转向基于SQL的仓库是正确的选择,以及它如何帮助您快速迁移,尤其是作为一家初创公司,那么我们已经写了有关此更改的信息。
资料建模 (Data modelling)
How do you actually transform data from many different sources into a set of clean, well tested data sets?
您实际上如何将来自许多不同来源的数据转换为一组干净,经过良好测试的数据集?
ELT introduces a new problem, you end up with a data warehouse full of messy datasets from your newly set up data integration tools and no idea how to use them. This is where data modelling comes in, and if you are building a stack with a data warehouse at the center, it needs to be addressed.
ELT引入了一个新问题,最终您将得到一个数据仓库,其中充满了新建立的数据集成工具中混乱的数据集,却不知道如何使用它们。 这就是数据建模的用武之地,如果您要建立一个以数据仓库为中心的堆栈,则需要解决它。
数据可视化和分析 (Data visualization & analytics)
Once you sort out all of the above, how do you actually use that data to answer business questions or do advanced analytics?
整理完所有上述内容后,如何实际使用该数据回答业务问题或进行高级分析?
For any company, and particularly startups understanding how your users use your products, how much time they spend on your app, what your signup and activation rates are like is obviously important.
对于任何公司,特别是初创公司来说,了解您的用户如何使用您的产品,他们在应用程序上花费了多少时间,您的注册率和激活率如何,显然很重要。
I’ll mostly cover the business intelligence and product analytics side of this, and avoid more advanced ML and data science applications as this usually comes afterward.
我将主要介绍这一方面的商业智能和产品分析,并避免使用更高级的ML和数据科学应用程序,因为这通常是在随后出现的。
您需要数据仓库吗? (Do you need a data warehouse?)
Once you hit a certain level of complexity, or need complete control over how you join and mutate data before sending it to other systems you probably want to move to a model where the data warehouse becomes your source of truth for business data. This gives you the most power as once all your data is there, you can do pretty much anything with it.
一旦达到一定程度的复杂性,或者需要在将数据发送到其他系统之前完全控制如何联接和变异数据,您可能希望转移到一个模型,在该模型中,数据仓库将成为业务数据的真实来源。 一旦所有数据都存在,您就可以发挥最大的威力。
For very early stage startups, I recommend that you avoid this initially. Off the shelf product analytics tools will provide you with the insights you need without the extra work during the early stages.
对于非常早期的创业公司, 建议您一开始避免这种情况 。 现成的产品分析工具将为您提供所需的见解,而无需在早期阶段进行额外的工作。
In this model, you typically require just two components:
在此模型中,通常只需要两个组件:
- Data collection (events) 数据收集(事件)
- Visualization / analytics 可视化/分析
A concrete example of this would be something like:
具体示例如下:
Segment > Mixpanel / Amplitude / Heap / Google Analytics
细分>混合面板/幅度/堆/ Google Analytics(分析)
Using a product like Segment gives you the flexibility to move that data to your warehouse in the future, if not immediately.
使用细分市场之类的产品,您可以灵活地将数据将来(如果不是立即)移至仓库。
As this is probably something you’ll need to do at some point, I’d recommend using something that can do this from day one, as moving from something all in one such as Google Analytics to something custom built around a warehouse can be difficult or expensive (Google will charge you a lot of money to move your raw GA data into BigQuery).
因为这可能是您有时需要做的事情,所以我建议您使用从第一天起就可以做到的事情,因为从诸如Google Analytics(分析)之类的全部内容迁移到围绕仓库定制的内容可能很困难。或价格昂贵(Google会向您收取大量费用,以将原始GA数据转移到BigQuery中)。
数据采集 (Data collection)
Data collection products allow you to track events from various apps, and capture user activity, pageviews, clicks, sessions etc. This is not about collecting data from e.g. your production DB (see data integration).
数据收集产品使您可以跟踪来自各种应用程序的事件,并捕获用户活动,综合浏览量,点击次数,会话次数等。这与从生产数据库中收集数据无关(请参阅数据集成)。
分割 (Segment)
The market leader, with some good out the box data integration solutions.
市场领导者,提供一些出色的现成数据集成解决方案。
Segment is great, and we use it ourselves. Segment is more than just data collection. Their somewhat open-source analytics.js provides unified APIs for tracking events in pretty much any system.
区隔很棒,我们自己使用。 细分不仅仅是数据收集。 他们的开放源代码analytics.js提供了用于在几乎所有系统中跟踪事件的统一API。
Events from Segment can be sent to your warehouse, but can also be sent straight to other systems, for example Google/Twitter/Facebook/Quora Ads, most major CRMs.
来自细分市场的事件可以发送到您的仓库,也可以直接发送到其他系统,例如Google / Twitter / Facebook / Quora Ads,大多数主要的CRM。
Segment can get really expensive really quickly, particularly for B2C companies with high numbers of monthly tracked users. However as their core APIs are open-source, it’s possible to migrate to your own infrastructure.
细分市场很快就会变得非常昂贵,尤其是对于拥有大量每月跟踪用户的B2C公司而言。 但是,由于其核心API是开源的,因此可以迁移到您自己的基础架构。
Definitely worth mentioning RudderStack here, an open source host it yourself alternative to Segment that uses the same open-source APIs. When your Segment costs start to exceed an engineering salary, this is probably the time to consider such alternatives.
在这里绝对值得一提RudderStack ,这是您自己的开源托管产品,可以替代使用相同开源API的Segment。 当您的部门成本开始超过工程薪水时,可能是时候考虑采用这种替代方法了。
雪犁 (Snowplow)
Event data collection done right, and it’s open-source.
事件数据收集正确完成,并且是开源的。
Snowplow excels at event data collection, period. It lacks the out of the box data integration solutions of Segment, but arguably has a much more rich feature set when it comes to actual event tracking, such as validation of event schemas.
Snowplow在事件数据收集,期间方面表现出色。 它缺少Segment的开箱即用的数据集成解决方案,但是在涉及实际的事件跟踪(例如验证事件模式)时,可以说具有更丰富的功能集。
It’s open-source, so you can run and manage it yourself if you want to.
它是开源的,因此您可以根据需要自己运行和管理它。
Snowplow won’t help you push data to your CRM or other SaaS products, but there are other options here we discuss below.
Snowplow不能帮助您将数据推送到CRM或其他SaaS产品,但是我们在下面讨论了其他选项。
多合一解决方案 (All in one solutions)
Segment and Snowplow are primarily designed for data collection. There are other tools that will help you collect data too, but they are part of what I’ll call “all in one” data analytics packages. I won’t cover these here, but have mentioned them below as part of Data visualization and analytics.
Segment和Snowplow主要设计用于数据收集。 还有其他工具也可以帮助您收集数据,但是它们是我称之为“多合一”数据分析软件包的一部分。 我不会在这里介绍这些内容,但是下面在数据可视化和分析中提到了它们。
数据集成到仓库 (Data integration to the warehouse)
Data integration products allow you to move data from one source to another. This section covers products that help you move data to your warehouse.
数据集成产品使您可以将数据从一个源移动到另一个源。 本部分介绍可帮助您将数据移至仓库的产品。
Some of these products also act as data transformation tools (traditional ETL). We don’t recommend this approach, preferring a more software engineering (SQL / code based) approach that will scale as you grow out your data team (see data modelling). As a result, our recommendations are for data integration products that are designed with an ELT model in mind.
其中一些产品还充当数据转换工具(传统ETL)。 我们不建议您使用这种方法,而建议您采用更多的软件工程方法(基于SQL /代码),这种方法会随着您的数据团队的成长而扩展(请参阅数据建模 )。 因此,我们的建议是针对考虑了ELT模型设计的数据集成产品。
These products move data from other data sources such as CRMs, Stripe, most popular databases (Mongo, MySQL, Postgres etc) into your warehouse, so you have everything in one place and can join it all together and perform advanced analytics queries on the results.
这些产品将数据从其他数据源(例如CRM,Stripe,最流行的数据库(Mongo,MySQL,Postgres等))移至您的仓库中,因此您将所有内容集中在一个地方,可以将它们组合在一起并对结果执行高级分析查询。
缝 (Stitch)
The best option for early startups, teams who want to write their own integrations, or value open-source.
对于早期创业公司,想要编写自己的集成或重视开源的团队而言,这是最佳选择。
Recently acquired by Talend, of which I’ve seen little impact so far — for better or worse. We use them ourselves, and they serve the majority of our integration needs very well.
最近被塔伦德(Talend)收购,到目前为止,我对它的影响不大,无论好坏。 我们自己使用它们,它们很好地满足了我们大多数的集成需求。
Simple self service onboarding, reasonable usage based pricing. Open-source core, you can write your own singer taps, and run it yourself.
简单的自助服务入门,合理的使用定价。 开源核心,您可以编写自己的歌手水龙头,然后自己运行。
Fivetran (Fivetran)
The best option for those who are willing to pay a little more, or with certain data sources.
对于那些愿意多付一些钱或使用某些数据源的人来说,这是最佳选择。
Historically a more enterprise sales model, recently changed to variable/volume based pricing, and will build adapters for you if you pay them, but moving toward a self-service model. My understanding is Fivetran still nets out as a bit more expensive than Stitch, but they certainly build extremely high quality integrations.
从历史上讲,企业销售模型更多,最近已更改为基于可变/批量的定价,如果您支付适配器的价格,它将为您构建适配器,但将转向自助服务模型。 我的理解是,Fivetran仍然比Stitch贵一些,但是它们确实可以构建极高质量的集成。
Fivetran makes considerable efforts to normalize data coming from source systems into a more friendly format, whereas Stitch integrations are arguably a bit less intelligent.
Fivetran为将来自源系统的数据规范化为更友好的格式而付出了巨大的努力,而Stitch集成可以说是不太智能。
数据集成到SaaS产品 (Data integration to SaaS products)
There is another aspect of data integration, and that is how do you get data to your SaaS services rather than from them. For example, showing recent activity or orders in your CRM to help your sales or support teams.
数据集成还有另一个方面,那就是如何将数据从 SaaS服务而不是从其中获取。 例如,在CRM中显示最近的活动或订单以帮助您的销售或支持团队。
As mentioned above, Segment (and RudderStack) have support for this, Segment also recently added support for transforming data before sending it, but this is only available to some customer tiers and somewhat limited in what it can do.
如上所述, Segment (和RudderStack)对此提供了支持,Segment最近还添加了在发送数据之前对其进行转换的支持,但这仅适用于某些客户层,并且在功能上有所限制。
Typically I’ve seen the majority of our customers (including ourselves) building custom solutions here. Products like Zapier, AWS Lambda, Google Google Cloud functions or PubSub like setups where data can be transformed and sent elsewhere, either from sources directly, or via the warehouse.
通常,我在这里看到大多数客户(包括我们自己)在构建自定义解决方案。 诸如Zapier , AWS Lambda , Google Google Cloud功能或PubSub之类的产品,都可以直接从数据源或通过仓库转换和发送数据到其他地方。
We’ve written about how we do this ourselves in this blog post — sending dataform BigQuery to Intercom, and this general approach could be applied for most destinations, or built in other tools like Zapier too.
在此博客文章中,我们已经写了关于如何进行此操作的信息- 将数据形式的BigQuery发送到Intercom ,并且这种通用方法可以应用于大多数目的地,也可以内置于其他工具(例如Zapier)中。
人口普查 (Census)
The only dedicated Warehouse to SaaS integration tool (that I know of) on the market.
市场上唯一的专用仓库到SaaS集成工具(据我所知)。
While we haven’t used it ourselves, Census has a dedicated solution for this which fits in well with the rest of the ELT architecture proposed in this post. It’s definitely worth checking out if you are thinking of transitioning from what e.g. Segment provides out the box to something more custom.
虽然我们自己还没有使用过它,但Census为此提供了专用的解决方案,它与本文中提出的其余ELT体系结构非常吻合。 如果您正在考虑从细分(Segment)提供的框框过渡到更自定义的框框,那绝对值得一试。
数据仓库 (Data warehouses)
There’s a lot to discuss about the options here, but typically there are just 2 market leaders right now that I would recommend. I’ve worked directly with all of these options myself, as well as speaking to many customers who use them, and have provided slightly stronger opinions on which ones I think are best.
这里有很多关于这些选项的讨论,但是通常我现在建议只有两名市场领导者。 我本人直接使用所有这些选项,并与许多使用它们的客户进行了交谈,并对我认为最好的选项提供了更强有力的意见。
When it comes to pricing for warehouses, it’s worth noting that what seems to matter in practice here is the cost of compute, not storage. Storage is generally cheap, and not the first thing to start hurting.
当谈到仓库的定价时,值得注意的是,实际上在这里重要的是计算成本,而不是存储成本。 存储通常很便宜,而不是首先受到伤害的东西。
大查询 (BigQuery)
The best option for early stage startups, or enterprises that are willing to adopt Google Cloud, and are OK with a more self service experience and dont have custom requirements around security, or e.g running on premise.
对于早期创业公司或愿意采用Google Cloud并且可以提供更多自助服务体验并且对安全性没有自定义要求(例如在内部运行)的企业来说,这是最佳选择。
Pay as you go pricing, for startups it will likely be a while before you incur any cost on BigQuery thanks to their free tier that allows you to process up to 1TB/month at no cost.
随用随付定价,对于初创公司而言,由于BigQuery的免费层级,您可以免费处理每月高达1TB的费用,因此可能需要一段时间才能在BigQuery上承担任何费用。
In my opinion, having written a lot of SQL for all warehouses here, BigQuery has the best SQL experience. BigQuery’s standard SQL is elegant and powerful, and they are rolling out improvements continuously.
我认为,BigQuery在这里为所有仓库编写了很多SQL,因此拥有最好SQL经验。 BigQuery的标准SQL优雅而强大,并且正在不断推出改进功能。
Unprecedented, on demand scale. BigQuery scales extremely well, and with products such as BI Engine, can provide blazing fast query performance.
空前,按需规模。 BigQuery的伸缩性非常好,并且使用BI Engine等产品可以提供出色的快速查询性能。
雪花 (Snowflake)
The best choice for enterprises with custom requirements such as SSO, on-premise, or who are tied in to AWS/Azure. A good option for startups thanks due to a partially pay-as-you-go, on demand pricing model.
对于具有自定义要求的企业(例如SSO,内部部署或与AWS / Azure捆绑在一起)的最佳选择。 由于部分按需付费,按需定价模式,这对初创公司来说是一个不错的选择。
Snowflake separates storage and compute, like BigQuery, meaning it has the capacity for unbounded enterprise scale unlike e.g. Redshift.
Snowflake与BigQuery一样,将存储和计算分开,这意味着它具有与Redshift不同的无限企业规模的能力。
Pricing model is hybrid pay as you go. You pay for resources/minute, but clusters can be automatically turned down when inactive.
定价模式为混合支付。 您为分钟/分钟付费,但是当群集不活动时,群集会自动关闭。
Supports structured JSON data (like BigQuery), where e.g. Redshift does not, generally a nice SQL experience, a couple of quirks to get used to.
支持结构化的JSON数据(如BigQuery),例如Redshift不,这通常是一种不错SQL体验,但要习惯一些怪癖。
红移 (Redshift)
The choice if you are heavily invested in AWS and don’t want to add something new to your stack.
如果您在AWS上投入了大量资金,并且不想在堆栈中添加新内容,则可以选择。
I’ll be frank here, I would not personally recommend Redshift unless you have no other choice, but it’s so popular it needs to be included. Redshift was one of the first “modern” warehouses in GA, but it’s built on a foundation that is in my opinion fundamentally limiting. While Amazon is working to correct some of these issues, some fixes still appear to be a long way off.
坦率地说,除非您别无选择,否则我个人不会推荐Redshift,但它是如此受欢迎,因此必须包含在内。 Redshift是GA最早的“现代”仓库之一,但它建立在我认为从根本上限制的基础上。 尽管亚马逊正在努力纠正其中的一些问题,但仍有一些修补程序还有很长的路要走。
- Limited support for working with unstructured data, and a limited SQL dialect based on postgres. 对使用非结构化数据的支持有限,并且基于postgresSQL方言有限。
- Requires much more management than Snowflake or BigQuery which are more hands off operationally. 与Snowflake或BigQuery相比,它们需要的管理要多得多,而Snowflake或BigQuery在操作上会更多。
- Has scale limits, although some new features coming out address this. 尽管有一些新功能可以解决此问题,但它具有规模限制。
其他提及 (Other mentions)
Azure / Synapse SQL data warehouse — similar issues to Redshift, no on demand pricing, great if you already know how to work with SQL Server and variants, can get very expensive due to limited on-demand pricing options. Presto/Athena — powerful, distributed queries, but not a general purpose warehouse, as a result can be hard to operationalize. It’s not easy to create new datasets with Athena.
Azure / Synapse SQL数据仓库 -与Redshift类似的问题,没有按需定价,如果您已经知道如何使用SQL Server及其变体,那就太好了,因为按需定价选项有限,因此价格会非常昂贵。 Presto / Athena-功能强大的分布式查询,但不是通用仓库,因此可能难以操作。 使用Athena创建新数据集并不容易。
资料建模 (Data modelling)
It’s impossible for me to write an impartial comparison of the options here as this is the product vertical Dataform is in, so I’ll refrain from doing so. However, as this is a fairly new part of the stack that arises from the shift to ELT, I’ll explain a bit more about what it is and why we think it’s important.
我不可能在这里对这些选项进行公正的比较,因为这是垂直数据窗体所在的产品,因此,我将避免这样做。 但是,由于这是过渡到ELT产生的一个相当新的部分,因此我将详细解释它的含义以及为什么我们认为它很重要。
Data modelling is what your data team probably spends 50% of their time doing — turning your raw data into reliable, tested, accurate and up to date assets that can power your companies analytics.
数据建模可能是您的数据团队花费50%的时间进行的工作-将原始数据转换为可靠,经过测试,准确和最新的资产,可以为公司的分析提供支持。
When data lands in your warehouse it’s usually a bit of a mess — you will have hundreds of source tables with different schema structures, different data formats and types, primary keys, so on. Writing a query to join this all together is tricky, especially if you have to do it every time you want to answer any question.
当数据降落到仓库中时,通常会有些混乱–您将拥有数百个具有不同架构结构,不同数据格式和类型,主键的源表。 编写查询以将所有这些结合在一起非常棘手,特别是如果您每次想回答任何问题时都必须这样做。
When data modelling is done right, you should end up with a clear well defined set of tables that can be used for analytics and visualization, and encapsulate all of your business logic to create a clean and well tested schema that can be consumed elsewhere, visualization tools, or sent to other applications.
正确完成数据建模后,您应该最终获得一组定义清晰的表,这些表可用于分析和可视化,并封装所有业务逻辑以创建一个干净且经过良好测试的架构,该架构可在其他地方使用(可视化)工具,或发送给其他应用程序。
数据可视化和分析 (Data visualization and analytics)
There are a lot of options here too, and your mileage may vary. I’ve tried to summarize the position they capture in the space where I can, and I am generally less able to provide strong opinions here as there are so many.
这里也有很多选择,您的里程可能会有所不同。 我试图总结它们在我所能占据的空间中所占据的位置,并且由于这里有太多的人,我通常无法提供强有力的意见。
For our startup customers building a stack around a data warehouse, we probably mostly see: Looker, Metabase, Google Data Studio, Chartio, Tableau.
对于我们的初创客户,他们围绕数据仓库构建堆栈,我们通常会看到: Looker , Metabase , Google Data Studio , Chartio , Tableau 。
Despite so many options, it’s fairly easy to play around and experiment between them particularly if you have a good semantic data modelling layer maintained in your warehouse or use tools like Segment for the out of the box solutions.
尽管有很多选择,但是在它们之间进行测试还是很容易的,特别是如果您在仓库中维护了一个良好的语义数据建模层,或者使用了诸如Segment之类的工具来实现现成的解决方案。
Most of these products fit into a few different categories:
这些产品大多数都属于以下几种类别:
Chart builders — select some dimensions, choose a chart type, customize the visualizations, put them in some dashboards (where they will eventually break, go out of date, and never be updated).
图表构建器 -选择某些维度,选择图表类型,自定义可视化效果,将其放置在某些仪表板中(它们最终将在其中中断,过期,并且永远不会更新)。
Self service BI solutions — tell these tools about how your data is structured and how to interpret it, and they’ll try to make it easy for anyone to quickly answer questions.
自助服务BI解决方案 -向这些工具介绍数据的结构和解释方式,它们将使所有人都能轻松地快速回答问题。
Out of the box product analytics — generally these tools run on event data, have an opinionated schema and aren’t easily customized or generic, but they do what they do well and are generally self-service.
开箱即用的产品分析 —通常,这些工具运行在事件数据上,具有自以为是的模式,不容易定制或通用,但是它们做的很好,通常都是自助服务。
旁观者 (Looker)
The (expensive) market leader. Best in class self-service, fully customizable BI.
(昂贵的)市场领导者。 一流的自助服务,可完全自定义的BI。
Looker users seem to have nothing but praise for the platform. It stands out because of a few things:
Looker用户似乎对该平台一无所获。 之所以脱颖而出,是因为以下几点:
- It’s designed to help you deliver a self-service portal to your entire company, enabling anyone to answer business questions. 它旨在帮助您为整个公司提供自助服务门户,使任何人都能回答业务问题。
- While you can build charts in Looker, that’s not what you’re supposed to do. You teach Looker how to understand your data, and it makes answering questions a breeze. 虽然可以在Looker中构建图表,但这不是您应该做的。 您可以教Looker如何理解您的数据,这使回答问题变得轻而易举。
- It adopts engineering best practices such as version control. Your data team can collaborate using git based workflows, allowing you to scale to even hundreds of analysts. 它采用了工程最佳实践,例如版本控制。 您的数据团队可以使用基于git的工作流程进行协作,从而使您可以扩展到甚至数百名分析师。
Looker requires an investment of both time and money, but what you get out the end is something few other solutions provide.
Looker需要投入时间和金钱,但是最终获得的收益是其他解决方案所无法提供的。
元数据库 (Metabase)
The open-source self service BI leader.
开源自助服务BI主管。
Adopts the Looker concept of modelling data to make answering questions easy. Doesn’t adopt a git based workflow however. It’s open source and you’ll need to run it yourself on your own infrastructure. Seems to be a favourite amongst engineers.
采用建模数据的Looker概念,使回答问题变得容易。 但是不采用基于git的工作流程。 它是开源的,您需要自己在自己的基础架构上运行它。 似乎是工程师的最爱。
If you wanted a Looker like experience but the price tag is too much, then this is probably your next best bet.
如果您想要类似Looker的体验,但价格太多,那么这可能是您的下一个最佳选择。
Like Looker, makes it easy for non-SQL users to answer questions without relying on the data team.
与Looker一样,非SQL用户可以轻松地在不依赖数据团队的情况下回答问题。
资料室 (Data Studio)
A powerful and free chart builder.
强大而免费的图表构建器。
Primarily a chart builder, but you can kind of make it work for self-service dashboards up to a point. While it’s worth a mention, unlikely to serve as your primary BI portal as your team grows, but a great place to start if you are already in the Google stack.
主要是一个图表构建器,但是您可以使其在某种程度上适用于自助服务仪表板。 值得一提的是,随着团队的成长,它不太可能成为您的主要BI门户,但是如果您已经在Google堆栈中,那么它是一个很好的起点。
画面 (Tableau)
The chart builder incumbent, extremely powerful, loved by many.
现有的图表构建器,功能极其强大,受到许多人的喜爱。
Tableau has limited data modelling capabilities, but is extremely powerful and able to build a huge range of visualizations, dashboards, so on. For better or for worse, you can do pretty much anything with it.
Tableau的数据建模功能有限,但是功能极其强大,并且能够构建各种可视化,仪表板等。 不管是好是坏,您几乎可以用它做任何事情。
If you don’t have existing Tableau experience then this probably isn’t the place to start, consider Metabase, Looker, Chartio instead.
如果您没有Tableau的现有经验,那么可能不是开始的地方,而应考虑使用Metabase,Looker和Chartio。
Chartio (Chartio)
Reasonably priced, self service BI and chart building.
价格合理,自助式BI和图表构建。
Chartio makes it easy to build SQL queries without actually writing SQL. This makes it great for putting data in the hands of your whole team, and while it has some data modelling capabilities, it’s not quite in the same camp as looker though.
Chartio使您无需实际编写SQL即可轻松构建SQL查询。 这非常适合将数据交到整个团队中,并且虽然它具有一些数据建模功能,但与查找者并没有完全相同的阵营。
You can build complex SQL pipelines (multiple joins etc) through a UI interface which can be great for those who aren’t as comfortable writing SQL themselves.
您可以通过UI界面构建复杂SQL管道(多个联接等),这对那些不那么喜欢自己编写SQL的人来说非常有用。
Redash (Redash)
Open source chart builder, recently acquired by Databricks.
开源图表构建器,最近被Databricks收购。
One of the first tools we used ourselves. Redash is open source and works with modern warehouses, but is primarily a chart builder and may serve you well at first, you’ll probably spend a lot of time fixing queries unless you heavily invest in a data modelling tool too.
我们使用自己的第一批工具之一。 Redash是开源的,可以与现代仓库一起使用,但是主要是一个图表构建器,一开始可能会为您提供良好的服务,除非您也大量投资数据建模工具,否则您可能会花费大量时间来修复查询。
堆 /混合面板 / 幅度 / 指示 (Heap / Mixpanel / Amplitude / Indicative)
Out of the box event based customer and product analytics.
开箱即用的基于事件的客户和产品分析。
I’ve put all these together, as they really all do a similar thing. They all have pros and cons but ultimately exist to solve the same problem. Send them events, and you’ll be able to quickly answer questions about your users behaviour, what actions they took in what order and with some of these tools do experimentation, optimization or even personalization.
我将所有这些放在一起,因为它们确实都做类似的事情。 它们都有优点和缺点,但最终存在是为了解决相同的问题。 向他们发送事件,您将能够快速回答有关用户行为,他们以什么顺序采取的行动以及使用其中一些工具进行实验,优化甚至个性化的问题。
Think Google Analytics but on steroids, and with a focus on user behaviour and engagement.
考虑使用Google Analytics(分析),但要考虑类固醇,重点是用户行为和参与度。
If you want to start off warehouse-less, these are probably your best options to begin with but consider tagging with Segment from day one.
如果您想从无仓库开始,那么这些也许是您的最佳选择,但从第一天开始就考虑使用细分进行标记。
Indicative is a little different to the others, in that it’s primarily designed to be pointed at your data warehouse. If you want to transform event data in your warehouse and then analyse it in a UI, this is where Indicative shines and we hear great things about it, but probably not the place to start if you want an all-in-one out the box solution.
指示性与其他指示性有所不同,因为指示性设计主要针对您的数据仓库。 如果您想转换仓库中的事件数据,然后在用户界面中对其进行分析,这就是Indicative的亮点,我们听到了很多有关它的信息,但是如果您想一开箱即用,那么可能不是开始的地方解。
谷歌分析 (Google Analytics)
The out of the box web and app analytics all-in-one incumbent, basically free.
开箱即用的网络和应用程序分析功能一应俱全,基本上是免费的。
Despite having more advanced, custom solutions in place now, we still send data to GA because there are some questions that are just a lot easier to answer with it. Acquisition for example — data is enriched with geo locations, UTM tags are automatically grouped and categorized.
尽管现在有了更高级的自定义解决方案,但我们仍然将数据发送到GA,因为有些问题很容易回答。 例如,采集-数据充实了地理位置,UTM标签被自动分组和分类。
The fundamental challenge with GA is that it doesn’t give you access to raw data. As we’ve said above, consider tagging with Segment instead to make sure you retain the raw event data.
Google Analytics(分析)的基本挑战是无法让您存取原始资料。 如上文所述,请考虑使用细分标记,以确保您保留原始事件数据。
Microsoft / Azure堆栈 (The Microsoft / Azure stack)
Worth a mention, Microsoft has their own entire set of solutions for most steps we’ve described above.
值得一提的是,对于上述大多数步骤,Microsoft具有自己的整套解决方案。
We see it less with startups, and it also doesn’t really follow the ELT model so I haven’t included it in the above. If you’re just starting out and want to get set up quickly, this probably isn’t the place to start.
我们在初创公司中很少看到它,它也没有真正遵循ELT模型,因此我没有在上面包括它。 如果您只是刚开始并且想快速设置,那可能不是开始的地方。
You’ll probably only want to go down this route if you’re already heavily invested in Azure and Microsoft products as an organization.
如果您已经作为组织已经在Azure和Microsoft产品上进行了大量投资,则可能只想走这条路。
As part of their offering, they have the following tools:
作为其产品的一部分,他们具有以下工具:
Azure Data factory — highly customizable ETL like workflows for data integration
Azure数据工厂 -高度可定制的ETL,例如用于数据集成的工作流
Azure / Synapse data warehouse, a more scalable sql-server based warehouse with some support for on-demand pricing
Azure / Synapse数据仓库 ,一个基于sql-server的更具扩展性的仓库,并支持按需定价
Power BI — a powerful data analytics product with some support for data modelling but no Git based workflows and some support for warehouses such as BigQuery and Snowflake.
Power BI —一种功能强大的数据分析产品,具有对数据建模的支持,但不支持基于Git的工作流,并且对诸如BigQuery和Snowflake之类的仓库提供某些支持。
结论 (Conclusion)
Hopefully this overview of the product options and the core parts of the stack helps you and your team make a decision when it comes to setting up a data stack that will be able to scale as your data team and the complexity of the data problems you face grows.
希望对产品选项和堆栈核心部分的概述有助于您和您的团队在建立数据堆栈方面做出决定,该数据堆栈可以随着您的数据团队的规模以及所面临的数据问题的复杂性而扩展成长。
I believe regardless of the product options you choose above, following the ELT architecture outlined in this post should ensure that you are able to cope with new requirements as your stack evolves without hitting any major roadblocks.
我相信,无论您选择以上哪种产品选项,遵循本文中概述的ELT架构都应确保随着堆栈的发展,您能够应对新的要求,而不会遇到任何主要障碍。
Originally published at https://dataform.co.
最初发布在 https://dataform.co 。
翻译自: https://medium.com/dataform/the-startup-data-stack-starter-pack-2020-47fcb34aeb09
全栈入门
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392447.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!