bigquery_在BigQuery中链接多个SQL查询

bigquery

Bigquery is a fantastic tool! It lets you do really powerful analytics works all using SQL like syntax.

Bigquery是一个很棒的工具! 它使您能够使用像语法一样SQL来进行真正强大的分析工作。

But it lacks chaining the SQL queries. We cannot run one SQL right after the completion of another. There are many real-life applications where the output of one query depends upon for the execution of another. And we would want to run multiple queries to achieve the results.

但是它缺少链接SQL查询的方法。 我们不能在完成另一个SQL之后立即运行一个SQL。 在许多实际应用中,一个查询的输出取决于另一个查询的执行。 我们希望运行多个查询来获得结果。

Here is one scenario, suppose you are doing RFM analysis using BigQuery ML. Where first you have to calculate the RFM values for all the users then apply the k-means cluster to the result of the first query and then merge the output of the first query and second query to generate the final data table.

这是一种情况,假设您正在使用BigQuery ML进行RFM分析 。 首先,您必须为所有用户计算RFM值,然后将k-means群集应用于第一个查询的结果,然后合并第一个查询和第二个查询的输出以生成最终数据表。

In the above scenario, every next query depends upon the output of the previous query and the output of each query also needs to be stored in data for other uses.

在上述情况下,每个下一个查询都取决于上一个查询的输出,每个查询的输出也需要存储在数据中以用于其他用途。

I this guide I will show how to execute as many SQL queries as you want in BigQuery one after another creating a chaining effect to gain the desire results.

在本指南中,我将展示如何在BigQuery中一个接一个地执行任意数量SQL查询,以及如何创建链接效果以获得所需的结果。

方法 (Methods)

I will demonstrate two approaches to chaining the queries

我将演示两种链接查询的方法

  1. The First using cloud pub/sub and cloud function: This is a more sophisticated method as it ensures that the current query is finished executing before executing the next one. This method also required a bit of programming experience so better to reach out to someone with a technical background in your company.

    第一种使用云发布/订阅和云功能:这是一种更为复杂的方法,因为它可以确保在执行下一个查询之前完成当前查询的执行。 这种方法还需要一点编程经验,因此更好地与您公司中具有技术背景的人员联系。

  2. The second using BigQuery’s own scheduler: However, the query scheduler cannot ensure the execution of one query is complete before the next is triggered so we will have to hack it using query execution time. More on this later.

    第二个使用BigQuery自己的调度程序:但是,查询调度程序无法确保一个查询的执行在触发下一个查询之前就已经完成,因此我们将不得不利用查询执行时间来破解它。 稍后再详细介绍。

And If you want to get your hands dirty yourself then here is an excellent course to start with.

而且,如果您想弄脏自己的双手,那么 这是一个很好的 起点。

Note: We will continue with the RFM example discussed above to get you the idea of the process. But the same can be applied for any possible scenario where triggering multiple SQL queries is needed.

注意 :我们将继续上面讨论的RFM示例,以使您了解该过程。 但是,对于需要触发多个SQL查询的任何可能情况,也可以应用相同的方法。

方法1 (Method 1)

Method 1 uses the combination of cloud functions and pub/subs to chain the entire flow. The process starts by query scheduler which after executing the first query sends a message to pub/sub topic that triggers a cloud function responsible to trigger 2nd query and once completed sends a message to another pub/sub topic to start yet another cloud function. The process continues until the last query is executed by the cloud function.

方法1使用云功能和pub / sub的组合来链接整个流程。 该过程由查询调度程序开始,查询调度程序在执行第一个查询后向pub / sub主题发送一条消息,该消息触发一个负责触发第二次查询的云功能,一旦完成,就向另一个pub / sub主题发送一条消息以启动另一个云功能。 该过程一直持续到云功能执行最后一个查询为止。

Let’s understand the process with our RFM analysis use case.

让我们通过我们的RFM分析用例来了解流程。

Suppose we have three queries that are needed to be run one after another to perform RFM analysis. First, that calculates RFM values, we will call it RFM Values. Second, that creates the model, we will call itRFM Model. Third, that merges model output with users RFM values, we will call it RFM Final.

假设我们有三个查询需要一个接一个地运行以执行RFM分析。 首先 ,它计算RFM值,我们将其称为 RFM Values 其次 ,创建模型,我们将其称为 RFM Model 第三 ,将模型输出与用户RFM值合并,我们将其称为 RFM Final

Here is how the data pipeline looks like:

数据管道如下所示:

Image for post
Chaining query in BigQuery data pipeline, by Muffaddal
Muffaddal在BigQuery数据管道中链接查询

Note: I will assume that tables for all three queries have already been created.

注意我将假设已经创建了所有三个查询的表。

1- We start by first creating a Pub/Sub topic as it will be needed while creating RFM Values query schedular. I have named it RFM_Model_Topic as it will trigger the cloud function responsible for executing our model query (i.e RFM Model).

1-我们首先创建一个Pub / Sub主题,因为在创建RFM Values查询计划时将需要它。 我将其命名为RFM_Model_Topic ,因为它将触发负责执行我们的模型查询的云函数(即RFM Model )。

Image for post
_Topic Pub/sub topic, by Muffaddal_Topic Pub / sub主题,作者:Muffaddal

Copy the topic name that is needed while creating RFM Values schedular.

计划创建 RFM Values ,复制所需的主题名称

2- Next, go to BigQuery, paste the RFM Values query that calculates RFM values for our users, in the query editor, and click the ‘Schedule query’ button to create a new query schedular.

2-接下来,转到BigQuery,在查询编辑器中粘贴为我们的用户计算RFM值的RFM Values查询,然后单击“计划查询”按钮以创建新的查询计划。

Image for post
create a scheduled query, by Muffaddal
通过Muffaddal创建计划的查询

3- Enter the required values in the scheduler creation menu to create the scheduler

3-在调度程序创建菜单中输入所需的值以创建调度程序

Image for post
query schedular creation menu, by Muffaddal
查询时间表创建菜单,由Muffaddal编写

What this scheduler will do is it will execute on the specified time to calculate users' recency, frequency, and monetary values and store it in the mentioned BigQuery table. Once the schedule is done executing the query it will send a message to our RFM_Model_Topic which will trigger a cloud function to trigger our model query. So next let’s create a cloud function.

该调度程序将执行的操作是在指定的时间执行以计算用户的新近度,频率和货币值,并将其存储在提到的BigQuery表中。 计划执行完查询后,它将向我们的RFM_Model_Topic发送一条消息,这将触发一个云函数来触发我们的模型查询。 因此,接下来让我们创建一个云功能。

4- Go to RFM_Model_Topicpub/sub topi and click ‘Trigger Cloud Function’ Button at the top of the screen.

4-转到RFM_Model_Topic pub / sub RFM_Model_Topic ,然后单击屏幕顶部的“触发云功能”按钮。

Image for post
create cloud function from pub/sub topic, by Muffaddal
通过发布/订阅主题创建云函数,作者:Muffaddal

5- Enters settings as shown below and name the cloud function as RFM_Model_Function

5-输入如下所示的设置,并将云功能命名为RFM_Model_Function

Image for post
cloud function settings, by Muffaddal
云功能设置,通过Muffaddal

6- And paste below code in index.js file

6-并将以下代码粘贴到index.js文件中

Cloud function to trigger RFM_Model Query, by Muffaddal
通过Muffaddal触发Cloud功能以触发RFM_Model查询

Once the query is executed cloud function sends a publish message to a new pub/sub topic named RFM_Final which triggers cloud function responsible for the last query that combines both RFM values and model results in one data set.

执行查询后,云功能会将发布消息发送到名为RFM_Final的新发布/子主题,该主题会触发负责最后一次查询的云功能,该功能将RFM值和模型结果组合到一个数据集中。

7- Therefore, next, create RFM_Model topic in pub/sub and a cloud function as we did in the previous step. Copy-paste below code in cloud function so that it can run the last query.

7-因此,接下来,像在上一步中一样,在pub / sub和一个云函数中创建RFM_Model主题。 将以下代码复制粘贴到云函数中,以便它可以运行最后一个查询。

Cloud function to trigger RFM_Final Query, by Muffaddal
通过Muffaddal触发Cloud功能以触发RFM_Final查询

And that is it!

就是这样!

We can use as many pub/sub and cloud functions as we want to chain as many SQL queries as we want.

我们可以使用任意数量的pub / sub和cloud函数,以根据需要链接任意数量SQL查询。

方法2 (Method 2)

Now the first approach is robust but requires a bit of programming background and says it is not your strong suit. You can use method 2 to chain the BigQuery queries.

现在,第一种方法是健壮的,但是需要一定的编程背景,并且说这不是您的强项。 您可以使用方法2链接BigQuery查询。

BigQuery’s query scheduler can be used to run the queries one after another.

BigQuery的查询计划程序可用于依次运行查询。

Idea is that we start the process the same as we did in method 1, i.e. trigger the first query using a scheduler and estimate its time for completion. Let’s say the first query takes 5 minutes to complete. What we will do is trigger the 2nd query 10 mints after the first query start time. This way we are ensured that the second query is triggered after the first query is completely executed.

想法是,我们开始的过程与方法1相同,即使用调度程序触发第一个查询并估计其完成时间。 假设第一个查询需要5分钟才能完成。 我们要做的是在第一个查询开始时间之后10分钟触发第二个查询。 这样,我们可以确保在完全执行第一个查询后触发第二个查询。

Let’s understand this by example

让我们通过示例来了解这一点

Image for post
Chain queries using query scheduler, by Muffaddal
使用查询调度程序链接查询,作者:Muffaddal

Suppose we scheduled the first query at 12:30 am. It takes 10 mints to complete. So we know at 12:40 am the first query should be completed. We will set the second query scheduler to execute at 12:50 am (keeping a 10 mint gap between two schedulers just in case). And we will trigger the third query at 1:10 am and so on.

假设我们将第一个查询安排在上午12:30。 它需要10颗薄荷糖才能完成。 因此,我们知道应该在12:40 am完成第一个查询。 我们将第二个查询调度程序设置为在上午12:50执行(以防万一,两个调度程序之间要保持10分钟的间隔)。 然后,我们将在上午1:10触发第三个查询,依此类推。

Note: Since the query scheduler doesn’t work with BigQuery ML, therefore, method 2 won’t work for our RFM analysis case but It should get you the idea on how to use the scheduler to chain queries.

注意 :由于查询调度程序不适用于BigQuery ML,因此方法2在我们的RFM分析案例中不起作用,但是它应该使您了解如何使用调度程序链接查询。

摘要 (Summary)

Executing queries one after another helps to achieve really great results especially when the result of one query depends on the output of another and all the query results are also needed as table format as well. BigQuery out of the box doesn’t support this functionality but using GCP’s component we can streamline the process to achieve the results.

逐个执行查询有助于获得非常好的结果,尤其是当一个查询的结果取决于另一个查询的输出并且所有查询结果也都需要作为表格式时。 开箱即用的BigQuery不支持此功能,但是使用GCP的组件,我们可以简化流程以实现结果。

In this article, we went through two of the method to do this. First using the cloud pub/sub and cloud function and another using BigQuery’s own query scheduler.

在本文中,我们介绍了两种方法来执行此操作。 首先使用云发布/订阅和云功能,另一个使用BigQuery自己的查询调度程序。

With this article, I hope I was able to convey the idea of the process for you to pick it up and tailor it for your particular business case.

希望通过这篇文章,您可以传达有关流程的想法,以供您选择并针对您的特定业务案例进行调整。

您想要的类似读物: (Similar Reads You Would Like:)

  1. Automate the RFM analysis using BigQuery ML.

    使用BigQuery ML自动执行RFM分析。

  2. Store Standard Google Analytics Hit Level Data in BigQuery.

    在BigQuery中存储标准Google Analytics(分析)点击量数据 。

  3. Automate Data Import to Google Analytics using GCP.

    使用GCP自动将数据导入Google Analytics(分析) 。

翻译自: https://towardsdatascience.com/chaining-multiple-sql-queries-in-bigquery-8498d8885da5

bigquery

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388822.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

大理石在哪儿 (Where is the Marble?,UVa 10474)

题目描述&#xff1a;算法竞赛入门经典例题5-1 1 #include <iostream>2 #include <algorithm>3 using namespace std;4 int maxn 10000 ;5 int main()6 {7 int n,q,a[maxn] ,k0;8 while(scanf("%d%d",&n,&q)2 && n &&q…

mysql 迁移到tidb_通过从MySQL迁移到TiDB来水平扩展Hive Metastore数据库

mysql 迁移到tidbIndustry: Knowledge Sharing行业&#xff1a;知识共享 Author: Mengyu Hu (Platform Engineer at Zhihu)作者&#xff1a;胡梦瑜(Zhhu的平台工程师) Zhihu which means “Do you know?” in classical Chinese, is the Quora of China: a question-and-ans…

XCode、Objective-C、Cocoa 说的是几样东西

大部分有一点其他平台开发基础的初学者看到XCode&#xff0c;第一感想是磨拳擦掌&#xff0c;看到 Interface Builder之后&#xff0c;第一感想是跃跃欲试&#xff0c;而看到Objective-C的语法&#xff0c;第一感想就变成就望而却步了。好吧&#xff0c;我是在说我自己。 如果…

递归函数基例和链条_链条和叉子

递归函数基例和链条因果推论 (Causal Inference) This is the fifth post on the series we work our way through “Causal Inference In Statistics” a nice Primer co-authored by Judea Pearl himself.这是本系列的第五篇文章&#xff0c;我们通过“因果统计推断”一书进行…

java lock 信号_java各种锁(ReentrantLock,Semaphore,CountDownLatch)的实现原理

先放结论&#xff1a;主要是实现AbstractQueuedSynchronizer中进入和退出函数&#xff0c;控制不同的进入和退出条件&#xff0c;实现适用于各种场景下的锁。JAVA中对于线程的同步提供了多种锁机制&#xff0c;比较著名的有可重入锁ReentrantLock&#xff0c;信号量机制Semapho…

Intent.ACTION_MAIN

1 Intent.ACTION_MAIN String: android.intent.action.MAIN 标识Activity为一个程序的开始。比较常用。 Input:nothing Output:nothing 例如&#xff1a; 1 <activity android:name".Main"android:label"string/app_name">2 <intent-filter…

足球预测_预测足球热

足球预测By Aditya Pethe通过阿蒂亚皮特(Aditya Pethe) From September to January every year, football takes over America. Games dominate TV Sunday and Monday nights, and my brother tears his hair out each week over his consistently underperforming fantasy te…

C#的特性Attribute

一、什么是特性 特性是用于在运行时传递程序中各种元素&#xff08;比如类、方法、结构、枚举、组件等&#xff09;的行为信息的声明性标签&#xff0c;这个标签可以有多个。您可以通过使用特性向程序添加声明性信息。一个声明性标签是通过放置在它所应用的元素前面的方括号&am…

python3中朴素贝叶斯_贝叶斯统计:Python中从零开始的都会都市

python3中朴素贝叶斯你在这里 (You are here) If you’re reading this, odds are: (1) you’re interested in bayesian statistics but (2) you have no idea how Markov Chain Monte Carlo (MCMC) sampling methods work, and (3) you realize that all but the simplest, t…

【转载】移动端布局概念总结

布局准备工作及布局思想及概念: 一个显示器&#xff08;pc端显示器 及 手机屏显示器&#xff09;&#xff0c;既有物理像素&#xff0c;又有独立像素&#xff08;独立像素也叫作css像素&#xff0c;用于前端人员使用&#xff09;&#xff1b; -->重要 首先确定设计稿的尺寸…

深入浅出:HTTP/2

上篇文章深入浅出&#xff1a;5G和HTTP里给自己挖了一根深坑&#xff0c;说是要写一篇关于HTTP/2的文章&#xff0c;今天来还账了。 本文分为以下几个部分&#xff1a; HTTP/2的背景HTTP/2的特点HTTP/2的协议分析HTTP/2的支持 HTTP/2简介 HTTP/2主要是为了解决现HTTP 1.1性能不…

画了个Android

画了个Android 今晚瞎折腾&#xff0c;闲着没事画了个机器人——android&#xff0c;浪费了一个晚上的时间。画这丫还真不容易&#xff0c;为那些坐标&#xff0c;差点砸了键盘&#xff0c;好在最后画出个有模有样的&#xff0c;心稍安。 下面来看看画这么个机器人需要些什么东…

数据治理 主数据 元数据_我们对数据治理的误解

数据治理 主数据 元数据Data governance is top of mind for many of my customers, particularly in light of GDPR, CCPA, COVID-19, and any number of other acronyms that speak to the increasing importance of data management when it comes to protecting user data.…

提高机器学习质量的想法_如何提高机器学习的数据质量?

提高机器学习质量的想法The ultimate goal of every data scientist or Machine Learning evangelist is to create a better model with higher predictive accuracy. However, in the pursuit of fine-tuning hyperparameters or improving modeling algorithms, data might …

mysql 集群实践_MySQL Cluster集群探索与实践

MySQL集群是一种在无共享架构(SNA&#xff0c;Share Nothing Architecture)系统里应用内存数据库集群的技术。这种无共享的架构可以使得系统使用低廉的硬件获取高的可扩展性。MySQL集群是一种分布式设计&#xff0c;目标是要达到没有任何单点故障点。因此&#xff0c;任何组成部…

matlab散点图折线图_什么是散点图以及何时使用

matlab散点图折线图When you were learning algebra back in high school, you might not have realized that one day you would need to create a scatter plot to demonstrate real-world results.当您在高中学习代数时&#xff0c;您可能没有意识到有一天需要创建一个散点图…

python字符串和List:索引值以 0 为开始值,-1 为从末尾的开始位置;值和位置的区别哦...

String&#xff08;字符串&#xff09;Python中的字符串用单引号 或双引号 " 括起来&#xff0c;同时使用反斜杠 \ 转义特殊字符。 字符串的截取的语法格式如下&#xff1a; 变量[头下标:尾下标]索引值以 0 为开始值&#xff0c;-1 为从末尾的开始位置。[一个是值&#x…

逻辑回归 python_深入研究Python的逻辑回归

逻辑回归 pythonClassification techniques are an essential part of machine learning and data science applications. Approximately 70% of problems in machine learning are classification problems. There are lots of classification problems that are available, b…

spring定时任务(@Scheduled注解)

&#xff08;一&#xff09;在xml里加入task的命名空间 xmlns:task"http://www.springframework.org/schema/task" http://www.springframework.org/schema/task http://www.springframework.org/schema/task/spring-task-4.1.xsd&#xff08;二&#xff09;启用注…

JavaScript是如何工作的:与WebAssembly比较及其使用场景

*摘要&#xff1a;** WebAssembly未来可期。 原文&#xff1a;JavaScript是如何工作的&#xff1a;与WebAssembly比较及其使用场景作者&#xff1a;前端小智Fundebug经授权转载&#xff0c;版权归原作者所有。 这是专门探索 JavaScript及其所构建的组件的系列文章的第6篇。 如果…