bigquery数据类型_将BigQuery与TB数据一起使用后的成本和性能课程

bigquery数据类型

I’ve used BigQuery every day with small and big datasets querying tables, views, and materialized views. During this time I’ve learned some things, I would have liked to know since the beginning. The goal of this article is to give you some tips and recommendations for optimizing your costs and performance.

我每天都使用BigQuery处理大小数据集,用于查询表,视图和实例化视图。 在这段时间里,我学到了一些东西,从一开始我就想知道。 本文的目的是为您提供一些技巧和建议,以优化成本和性能。

基本概念 (Basic concepts)

BigQuery: Serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility [Google Cloud doc].

BigQuery :专为业务敏捷性而设计的无服务器,高度可扩展且经济高效的多云数据仓库[ Google Cloud doc ]。

Image for post
BigQuery logo
BigQuery徽标

GCP Project: Google Cloud projects form the basis for creating, enabling, and using all Google Cloud services including managing APIs, enabling billing, adding and removing collaborators, and managing permissions for Google Cloud resources [Google Cloud doc]. Each project could have more than one BigQuery Dataset.

GCP项目 :Google Cloud项目构成了创建,启用和使用所有Google Cloud服务的基础,包括管理API,启用计费,添加和删除协作者以及管理Google Cloud资源的权限[ Google Cloud文档 ]。 每个项目可能有多个BigQuery数据集。

Image for post
GCP Project
GCP项目

Dataset: It’s similar to the Databases or Schema concepts in other RDBMS. There are located tables, views, and materialized views.

数据集 :它类似于其他RDBMS中的数据库或架构概念。 存在表,视图和实例化视图。

Image for post
BigQuery’s Datasets
BigQuery的数据集

Slot: It’s a virtual CPU used by BigQuery to execute SQL queries. BigQuery automatically calculates how many slots are required by each query, depending on query size and complexity [Google Cloud doc]. This is the principal indicator if you are in flat-rate pricing because you make a slot commitment (minimum 500 slots).

插槽:这是BigQuery用于执行SQL查询的虚拟CPU。 BigQuery会根据查询的大小和复杂程度自动计算每个查询需要多少个广告位[ Google Cloud doc ]。 如果您采用固定费用定价,因为这是您承诺的广告位承诺(最少500个广告位),则这是主要指标。

Slot/ms: It’s the total amount of slots per millisecond used by a query. That’s the total number of slots consumed by the query over its entire execution time [Taking a practical approach to BigQuery slot usage analysis]. Notice that a query uses a different quantity of slots during all execution, for example, the number of slots when start making a join is higher than making a filter and also is the duration time.

插槽/毫秒:这是查询使用的每毫秒的插槽总数。 这是查询在整个执行时间内所消耗的插槽总数[ 采用实用的方法进行BigQuery插槽使用率分析 ]。 请注意,查询在所有执行过程中使用不同数量的插槽,例如,开始建立连接时的插槽数量要比创建过滤器的数量高,并且持续时间也要多。

Numer of byte processed: It’s the total bytes read when you run a query. This is the principal cost if you are in on-demand pricing.

已处理字节数 :运行查询时读取的总字节数。 如果您按需定价,这是主要成本。

Image for post
Getaway of “Select * “
逃脱“选择*”

Partition: A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data [BigQuery partitioned table].

分区 :分区表是一种特殊的表,分为几部分,称为分区,这使管理和查询数据更加容易[ BigQuery分区表 ]。

Image for post
A partitioned and clustered table
分区和集群表

Clustering: A clustered table data is automatically organized based on the contents of one or more columns in the table’s schema [BigQuery clustered table].

集群 :集群表数据是根据表架构[ BigQuery集群表 ]中一个或多个列的内容自动组织的。

Resuming this two important concepts, I’ve made a simple image to understand visually how a table get organized.

总结这两个重要概念,我制作了一个简单的图像以直观地了解表的组织方式。

Image for post

费用建议 (Cost recommendations)

There is not a magic recipe, however, It’s important to understand the BigQuery’s cost structure and then what exactly I’m spending on it.

但是,这并不是一个神奇的秘诀,重要的是要了解BigQuery的成本结构,然后确切地讲我要花多少钱。

Cost Structure

成本结构

  1. Storage [pricing]

    储存 [ 定价 ]

  • Active: A monthly charge for data stored in tables or in partitions that have been modified in the last 90 days. $0.020 per GB

    活跃:按月收费,用于存储在过去90天内已修改的表或分区中的数据。 每GB $ 0.020

  • Long-term: A lower monthly charge for data stored in tables or in partitions that have not been modified in the last 90 days. $0.010 per GB

    长期:对于过去90天内未修改的表或分区中存储的数据,每月收取的费用较低。 每GB $ 0.010

👨🏻‍💻 Cost TipPartition your tables whenever is possible. Image this scenario you have a 10 Terabytes DataWarehouse and only use your last month’s data. If all your tables are not partitioned this could mean $200 monthly or for a partitioned slightly more than $100 (without counting the compute cost savings!).

成本提示尽可能对表进行分区。 在这种情况下,您有一个10 TB的DataWarehouse,并且仅使用上个月的数据。 如果您的所有表都没有分区,则意味着每月200美元,或者分区的价格略高于100美元(不计算节省的计算费用!)。

Since the BigQuery partition limit for each table is 4000 an interesting approach could be to set a ‘maximum history availability’ in this case could be 3 years and the older data migrate to a cheaper storage li GCS nearline ($0.004 per GB), so with this, your DataWarehouse will not grow exponentially.

由于每个表的BigQuery分区限制为4000,因此一种有趣的方法是在这种情况下将“最大历史可用性”设置为3年,并且较旧的数据迁移到价格便宜的GCS近线存储(每GB 0.004美元),因此这样,您的DataWarehouse不会成倍增长。

2. Compute

2.计算

BigQuery has two pricing models and one hybrid model:

BigQuery有两种定价模型和一种混合模型:

  • on-demand: pay for the number of bytes processed $5 per TB.

    按需:支付处理的字节数,每TB 5美元。
  • flat-rate: you purchase a monthly Slot commitment (minimum commitment is 500 slots, $10 000 monthly), this means you’ll have a maximum compute capacity (500 CPUs) for your GCP Project, the advantage here is you pay a fixed amount doesn’t matter the number of TB processed.

    固定费用:您按月购买插槽承诺(最低承诺为500个插槽,每月支付10,000美元),这意味着您将为GCP项目拥有最大的计算容量(500个CPU),这是您需要支付固定金额的处理的TB数量无关紧要。
  • flat-slot: is similar to flat-rate the main difference is that the commitment is as little as 60 seconds. The cost here is $20 per hour.

    flat-slot:类似于固定速率,主要区别在于承诺时间仅为60秒。 每小时费用为$ 20。

👨🏻‍💻 Cost Tip

💻 费用提示

To decide which strategy is the best for your team you need to follow your expenses and slot utilization. Let’s check a real case and identify which pricing model is the best and why.

要确定哪种策略最适合您的团队,您需要了解支出和插槽利用率。 让我们检查一个实际案例,并确定哪种定价模型是最佳的以及为什么。

Cloud Monitoring

云监控

Activate this important tool.

激活此重要工具。

Image for post
The first time It needs to create the workstation
第一次需要创建工作站

Then select the BigQuery dashboard

然后选择BigQuery仪表板

Image for post
Select BigQuery Dashboard
选择BigQuery仪表板

Now Let’s focus on the ‘Slot utilization’ chart. Here we observe a few peaks over 500 that could be caused by intense not optimized queries. This could tell me that an on-demand model or flat-rate could work, to define let’s check bytes processed.

现在,让我们关注“插槽利用率”图表。 在这里,我们观察到500多个峰值可能是由强烈的未经优化的查询引起的。 这可以告诉我按需模型或固定费率可以工作,以定义让我们检查已处理的字节。

Image for post

Second check the bytes processed and the daily and monthly cost. This query results in the last 30 days we processed 115 Terabytes with a total cost of $572 this means that we are far from the $10,000 fix-rate and we keep with the on-demand price model. Another insight is that on July 16 was the highest peak in terabytes processed so only on this day a good idea would have been to use flex-slot.

其次检查已处理的字节以及每日和每月成本。 该查询的结果是,在过去30天内,我们处理了115 TB数据, 总成本为572美元,这意味着我们与10,000美元的固定费用相距甚远,并且我们保持按需价格模型。 另一个见解是,7月16日是处理的TB的最高峰,因此只有在这一天,才可以使用flex-slot。

Image for post
Image for post

绩效建议 (Performance recommendations)

The best partition field (Not is always day!)

最佳分区字段(并非总是一天!)

Each time I need to figure out which is the best partition strategy, I rely on the Google Cloud recommendation. At least each partition should be 1GB. This brings me that the majority of my tables are now partitioned by month.

每当我需要确定哪种是最佳分区策略时,我都会依靠Google Cloud的建议。 至少每个分区应为1GB。 这给我带来了我的大多数表现在按月分区。

👨🏻‍💻 Performance Tip

💻 性能提示

Evaluate to activate the ‘require_partition_filter’ so each time a person needs to query a table It will be required to add a Where clause filtering with the partition field.

评估以激活“ require_partition_filter”,以便每次有人需要查询表时,都需要在分区字段中添加一个Where子句过滤。

ALTER TABLE mydataset.mypartitionedtable
SET OPTIONS (require_partition_filter=true)

The best clustering strategy

最佳集群策略

Clustering is another great tool for optimizing your cost and performance. I always think clustering like a ‘box inside a box’ so to access an internal box I need to interact with the upper box.

群集是用于优化成本和性能的另一种出色工具。 我一直认为群集就像“ 盒子里的盒子 ”一样,因此要访问内部盒子,我需要与上面的盒子进行交互。

👨🏻‍💻 Performance Tip

💻 性能提示

To understand this idea let’s compare the following three queries. The last one only retrieves the data within the partition, and giving the two levels of partitioning means that the query wouldn’t need to ‘open’ each box until finding the ‘city box’. All this will be reflected in a short execute time and fewer bytes consumed.

为了理解这个想法,让我们比较以下三个查询。 最后一个仅检索分区内的数据,并且提供两个分区级别意味着查询无需先打开每个框,直到找到“城市框”为止。 所有这些都将在较短的执行时间和更少的字节消耗中得到反映。

Avoid windows functions

避免Windows功能

Operations that need to see all the data in the resulting table at once have to operate on a single node. Un-partitioned window functions like RANK() OVER() or ROW_NUMBER() OVER() will operate on a single node. [Looker question]. This means that it doesn’t matter that you have the best price model with thousands of slots, still, all the data will go to a single node if you use a window function.

需要立即查看结果表中所有数据的操作必须在单个节点上进行。 未分区的窗口函数(例如RANK() OVER()ROW_NUMBER() OVER()将在单个节点上运行。 [ Looker问题 ]。 这意味着,具有数千个插槽的最佳价格模型并不重要,但是,如果您使用窗口功能,所有数据将转到单个节点。

👨🏻‍💻 “Resources exceeded during query execution” Tip

💻 “查询执行期间超出了资源”提示

There are a few strategies you could use here. First for example if you need to use OVER()you need to PARTITION the window function by date and build a string as the primary key [Looker question].

您可以在此处使用一些策略。 首先,例如,如果需要使用OVER() ,则需要按日期对window函数进行分区,并构建一个字符串作为主键[ Looker问题 ]。

CONCAT(CAST(ROW_NUMBER() OVER(PARTITION BY event_date) AS STRING), ‘|’,(CAST(event_date AS STRING)) as id

If your query contains an ORDER BY clause It may cause the “Resources exceeded” message, for the same reason, all the data is passed to one node. In this situation, avoid ORDER BY if your result is just for creating a new table.

如果您的查询包含ORDER BY子句,则可能由于相同的原因而导致“超出资源”消息,所有数据都传递到一个节点。 在这种情况下,如果结果只是用于创建新表,请避免使用ORDER BY

Exists some strategies to increase the resources during the execution time. Please refer to the Google Cloud documentation for more details

存在一些在执行期间增加资源的策略。 有关更多详细信息,请参阅Google Cloud文档 。

结论 (Conclusions)

BigQuery is a fantastic Data Warehouse with a challenging price model. To keep your cost-controlled without losing the performance, I recommend you to keep reading more articles and useful information I’m sure exists many excellent tips out there 👨🏻‍💻 👩🏻‍💻.

BigQuery是一个出色的数据仓库,具有挑战性的价格模型。 为了在不损失性能的情况下保持成本控制,我建议您继续文章和有用的信息,我相信这里确实存在许多出色的技巧👨🏻‍💻 👩🏻‍💻。

PS This month Google Cloud introduced BigQuery Omni,

PS本月Google Cloud推出了BigQuery Omni ,

It’s a flexible, multi-cloud analytics solution that lets you cost-effectively access and securely analyze data across Google Cloud, Amazon Web Services (AWS), and Azure.

它是一种灵活的多云分析解决方案,可让您经济高效地访问和安全地跨Google Cloud,Amazon Web Services(AWS)和Azure进行数据分析。

I hope to be writing about my experience in the following weeks! Also, keep an eye on Materialized Views, now is in Beta that is why I didn’t add too much information here, however, I recommend you to give it a try.

我希望在接下来的几周内写下我的经历! 另外,请注意Materialized Views (现在处于Beta版),这就是为什么我在此处未添加太多信息的原因,但是,我建议您尝试一下。

PS 2 if you have any questions, or would like something clarified, ping me on Twitter or LinkedIn I like having a data conversation 😊 If you want to know about Apache Arrow and Apache Spark I had an article A gentle introduction to Apache Arrow with Apache Spark and Pandas with some examples.

PS 2如果您有任何疑问或想澄清一些问题,请在Twitter或LinkedIn上ping我,我想进行数据对话😊如果您想了解Apache Arrow和Apache Spark,我有一篇文章 通过Apache Spark和Pandas轻松介绍Apache Arrow 有一些例子。

有用的链接 (Useful links)

Thanks to all these persons behind each link.

感谢每个链接后面的所有这些人。

  • https://cloud.google.com/blog/products/data-analytics/monitoring-resource-usage-in-a-cloud-data-warehouse

    https://cloud.google.com/blog/products/data-analytics/monitoring-resource-usage-in-a-cloud-data-warehouse

  • https://cloud.google.com/bigquery/docs/reservations-workload-management#choosing_between_on-demand_and_flat-rate_billing_models

    https://cloud.google.com/bigquery/docs/reservations-workload-management#choosing_between_on-demand_and_flat-rate_billing_models

  • https://github.com/GoogleCloudPlatform/professional-services/tree/master/tools/bq-visualizer

    https://github.com/GoogleCloudPlatform/professional-services/tree/master/tools/bq-visualizer

  • https://cloud.google.com/bigquery/pricing#on_demand_pricing

    https://cloud.google.com/bigquery/pricing#on_demand_pricing

  • https://cloud.google.com/bigquery/docs/clustered-tables

    https://cloud.google.com/bigquery/docs/clustered-tables

  • https://cloud.google.com/bigquery/docs/partitioned-tables

    https://cloud.google.com/bigquery/docs/partitioned-tables

  • https://cloud.google.com/bigquery/pricing#active_storage

    https://cloud.google.com/bigquery/pricing#active_storage

  • https://cloud.google.com/bigquery/pricing#flat_rate_pricing

    https://cloud.google.com/bigquery/pricing#flat_rate_pricing

  • https://cloud.google.com/bigquery/docs/estimate-costs

    https://cloud.google.com/bigquery/docs/estimate-costs

  • https://cloud.google.com/storage/pricing

    https://cloud.google.com/storage/pricing

  • https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language

    https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language

  • https://medium.com/enkel-digital/googles-big-query-resources-exceeded-during-query-execution-%EF%B8%8F-13c20b6693e2

    https://medium.com/enkel-digital/googles-big-query-resources-exceeded-during-query-execution-%EF%B8%8F-13c20b6693e2

  • https://discourse.looker.com/t/resources-exceeded-during-query-execution-when-building-derived-table-in-bigquery/4414

    https://discourse.looker.com/t/resources-exceeded-during-query-execution-when-building-derived-table-in-bigquery/4414

  • https://cloud.google.com/bigquery/docs/writing-results#large-results

    https://cloud.google.com/bigquery/docs/writing-results#large-results

翻译自: https://medium.com/dataseries/costs-and-performance-lessons-after-using-bigquery-with-terabytes-of-data-54a5809ac912

bigquery数据类型

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388360.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

中国计算机学科建设,计算机学科建设战略研讨会暨“十四五”规划务虚会召开...

4月15日下午,信息学院计算机系举办了计算机科学与技术学科建设战略研讨会暨“十四五”规划务虚会。本次会议的主旨是借第五轮学科评估的契机,总结计算机学科发展的优劣势,在强调保持优势的同时,更着眼于短板和不足,在未…

服务器被攻击怎么修改,服务器一直被攻击怎么办?

原标题:服务器一直被攻击怎么办?有很多人问说,网站一直被攻击,什么被挂马,什么被黑,每天一早打开网站,总是会出现各种各样的问题,这着实让站长们揪心。从修改服务器管理账号开始&…

脚本 api_从脚本到预测API

脚本 apiThis is the continuation of my previous article:这是我上一篇文章的延续: From Jupyter Notebook To Scripts从Jupyter Notebook到脚本 Last time we discussed how to convert Jupyter Notebook to scripts, together with all sorts of basic engine…

Iphone代码创建视图

要想以编程的方式创建视图,需要使用视图控制器中定义的viewDidLoad方法,只有在运行期间生成UI时才需要实现该方法。 在此只贴出viewDidLoad方法的代码,因为只需要在这个方法里面编写代码: [cpp] view plaincopyprint?- (void)vi…

binary masks_Python中的Masks概念

binary masksAll men are sculptors, constantly chipping away the unwanted parts of their lives, trying to create their idea of a masterpiece … Eddie Murphy所有的人都是雕塑家,不断地消除生活中不必要的部分,试图建立自己的杰作理念……埃迪墨…

Iphone在ScrollView下点击TextField使文本筐不被键盘遮住

1.拖一个Scroll View视图填充View窗口&#xff0c;将Scroll View视图拖大一些&#xff0c;使其超出屏幕。 2.向Scroll View拖&#xff08;添加&#xff09;多个Label视图和Text View视图。 3.在.h头文件中添加如下代码&#xff1a; [cpp] view plaincopyprint?#import <U…

python 仪表盘_如何使用Python刮除仪表板

python 仪表盘Dashboard scraping is a useful skill to have when the only way to interact with the data you need is through a dashboard. We’re going to learn how to scrape data from a dashboard using the Selenium and Beautiful Soup packages in Python. The S…

VS2015 定时服务及控制端

一. 服务端 如下图—新建项目—经典桌面—Windows服务—起名svrr2. 打到server1 改名为svrExecSqlInsert 右击对应的设计界面&#xff0c;添加安装服务目录结构如图 3. svrExecSqlInsert里有打到OnStart()方法开始写代码如下 /// <summary>/// 服务开启操作/// </su…

Iphone表视图的简单操作

1.创建一个Navigation—based—Application项目&#xff0c;这样Interface Builder中会自动生成一个Table View&#xff0c;然后将Search Bar拖放到表示图上&#xff0c;以我们要给表示图添加搜索功能&#xff0c;不要忘记将Search Bar的delegate连接到File‘s Owner项&#xf…

aws emr 大数据分析_DataOps —使用AWS Lambda和Amazon EMR的全自动,低成本数据管道

aws emr 大数据分析Progression is continuous. Taking a flashback journey through my 25 years career in information technology, I have experienced several phases of progression and adaptation.进步是连续的。 在我25年的信息技术职业生涯中经历了一次闪回之旅&…

先进的NumPy数据科学

We will be covering some of the advanced concepts of NumPy specifically functions and methods required to work on a realtime dataset. Concepts covered here are more than enough to start your journey with data.我们将介绍NumPy的一些高级概念&#xff0c;特别是…

lsof命令详解

基础命令学习目录首页 原文链接&#xff1a;https://www.cnblogs.com/ggjucheng/archive/2012/01/08/2316599.html 简介 lsof(list open files)是一个列出当前系统打开文件的工具。在linux环境下&#xff0c;任何事物都以文件的形式存在&#xff0c;通过文件不仅仅可以访问常规…

统计和冰淇淋

Photo by Irene Kredenets on UnsplashIrene Kredenets在Unsplash上拍摄的照片 摘要 (Summary) In this article, you will learn a little bit about probability calculations in R Studio. As it is a Statistical language, R comes with many tests already built in it, …

信息流服务器哪种好,选购存储服务器需要注意六大关键因素,你知道几个?

原标题&#xff1a;选购存储服务器需要注意六大关键因素&#xff0c;你知道几个&#xff1f;信息技术的飞速发展带动了整个信息产业的发展。越来越多的电子商务平台和虚拟化环境出现在企业的日常应用中。存储服务器作为企业建设环境的核心设备&#xff0c;在整个信息流中承担着…

t3 深入Tornado

3.1 Application settings 前面的学习中&#xff0c;在创建tornado.web.Application的对象时&#xff0c;传入了第一个参数——路由映射列表。实际上Application类的构造函数还接收很多关于tornado web应用的配置参数。 参数&#xff1a; debug&#xff0c;设置tornado是否工作…

对数据仓库进行数据建模_确定是否可以对您的数据进行建模

对数据仓库进行数据建模Some data sets are just not meant to have the geospatial representation that can be clustered. There is great variance in your features, and theoretically great features as well. But, it doesn’t mean is statistically separable.某些数…

15 并发编程-(IO模型)

一、IO模型介绍 1、阻塞与非阻塞指的是程序的两种运行状态 阻塞&#xff1a;遇到IO就发生阻塞&#xff0c;程序一旦遇到阻塞操作就会停在原地&#xff0c;并且立刻释放CPU资源 非阻塞&#xff08;就绪态或运行态&#xff09;&#xff1a;没有遇到IO操作&#xff0c;或者通过某种…

不提拔你,就是因为你只想把工作做好

2019独角兽企业重金招聘Python工程师标准>>> 我有个朋友&#xff0c;他30出头&#xff0c;在500强公司做技术经理。他戴无边眼镜&#xff0c;穿一身土黄色的夹克&#xff0c;下面是一条常年不洗的牛仔裤加休闲皮鞋&#xff0c;典型技术高手范。 三 年前&#xff0c;…

python内置函数多少个_每个数据科学家都应该知道的10个Python内置函数

python内置函数多少个Python is the number one choice of programming language for many data scientists and analysts. One of the reasons of this choice is that python is relatively easier to learn and use. More importantly, there is a wide variety of third pa…

C#使用TCP/IP与ModBus进行通讯

C#使用TCP/IP与ModBus进行通讯1. ModBus的 Client/Server模型 2. 数据包格式及MBAP header (MODBUS Application Protocol header) 3. 大小端转换 4. 事务标识和缓冲清理 5. 示例代码 0. MODBUS MESSAGING ON TCP/IP IMPLEMENTATION GUIDE 下载地址&#xff1a;http://www.modb…