bigquery_如何在BigQuery中进行文本相似性搜索和文档聚类

bigquery

BigQuery offers the ability to load a TensorFlow SavedModel and carry out predictions. This capability is a great way to add text-based similarity and clustering on top of your data warehouse.

BigQuery可以加载TensorFlow SavedModel并执行预测。 此功能是在数据仓库之上添加基于文本的相似性和群集的一种好方法。

Follow along by copy-pasting queries from my notebook in GitHub. You can try out the queries in the BigQuery console or in an AI Platform Jupyter notebook.

然后在GitHub中从我的笔记本复制粘贴查询。 您可以在BigQuery控制台或AI Platform Jupyter笔记本中尝试查询。

风暴报告数据 (Storm reports data)

As an example, I’ll use a dataset consisting of wind reports phoned into National Weather Service offices by “storm spotters”. This is a public dataset in BigQuery and it can be queried as follows:

举例来说,我将使用由“风暴发现者”致电国家气象局办公室的风报告组成的数据集。 这是BigQuery中的公共数据集,可以按以下方式查询:

SELECT 
EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
ST_GeogPoint(longitude, latitude) AS location,
comments
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
LIMIT 10

The result looks like this:

结果看起来像这样:

Image for post

Let’s say that we want to build a SQL query to search for comments that look like “power line down on a home”.

假设我们要构建一个SQL查询来搜索看起来像“家中的电源线”的注释。

Steps:

脚步:

  • Load a machine learning model that creates an embedding (essentially a compact numerical representation) of some text.

    加载一个机器学习模型,该模型创建一些文本的嵌入(本质上是紧凑的数字表示形式)。
  • Use the model to generate the embedding of our search term.

    使用该模型生成搜索词的嵌入。
  • Use the model to generate the embedding of every comment in the wind reports table.

    使用该模型可将每个评论嵌入风报告表中。
  • Look for rows where the two embeddings are close to each other.

    查找两个嵌入彼此靠近的行。

将文本嵌入模型加载到BigQuery中 (Loading a text embedding model into BigQuery)

TensorFlow Hub has a number of text embedding models. For best results, you should use a model that has been trained on data that is similar to your dataset and which has a sufficient number of dimensions so as to capture the nuances of your text.

TensorFlow Hub具有许多文本嵌入模型。 为了获得最佳结果,您应该使用经过训练的模型,该数据类似于您的数据集,并且具有足够的维数,以捕获文本的细微差别。

For this demonstration, I’ll use the Swivel embedding which was trained on Google News and has 20 dimensions (i.e., it is pretty coarse). This is sufficient for what we need to do.

在此演示中,我将使用在Google新闻上接受训练的Swivel嵌入,它具有20个维度(即,非常粗略)。 这足以满足我们的需求。

The Swivel embedding layer is already available in TensorFlow SavedModel format, so we simply need to download it, extract it from the tarred, gzipped file, and upload it to Google Cloud Storage:

Swivel嵌入层已经可以使用TensorFlow SavedModel格式,因此我们只需要下载它,从压缩后的压缩文件中提取出来,然后将其上传到Google Cloud Storage:

FILE=swivel.tar.gz
wget --quiet -O tmp/swivel.tar.gz https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1?tf-hub-format=compressed
cd tmp
tar xvfz swivel.tar.gz
cd ..
mv tmp swivel
gsutil -m cp -R swivel gs://${BUCKET}/swivel

Once the model files on GCS, we can load it into BigQuery as an ML model:

将模型文件保存到GCS后,我们可以将其作为ML模型加载到BigQuery中:

CREATE OR REPLACE MODEL advdata.swivel_text_embed
OPTIONS(model_type='tensorflow', model_path='gs://BUCKET/swivel/*')

尝试在BigQuery中嵌入模型 (Try out embedding model in BigQuery)

To try out the model in BigQuery, we need to know its input and output schema. These would be the names of the Keras layers when it was exported. We can get them by going to the BigQuery console and viewing the “Schema” tab of the model:

要在BigQuery中试用模型,我们需要了解其输入和输出架构。 这些将是导出时Keras图层的名称。 我们可以通过转到BigQuery控制台并查看模型的“架构”标签来获得它们:

Image for post

Let’s try this model out by getting the embedding for a famous August speech, calling the input text as sentences and knowing that we will get an output column named output_0:

让我们通过获得著名的August演讲的嵌入,将输入文本称为句子并知道我们将得到一个名为output_0的输出列来试用该模型:

SELECT output_0 FROM
ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT "Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially." AS sentences))

The result has 20 numbers as expected, the first few of which are shown below:

结果有20个预期的数字,其中前几个显示如下:

Image for post

文件相似度搜寻 (Document similarity search)

Define a function to compute the Euclidean squared distance between a pair of embeddings:

定义一个函数来计算一对嵌入之间的欧几里德平方距离:

CREATE TEMPORARY FUNCTION td(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>, idx INT64) AS (
(a[OFFSET(idx)] - b[OFFSET(idx)]) * (a[OFFSET(idx)] - b[OFFSET(idx)])
);CREATE TEMPORARY FUNCTION term_distance(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>) AS ((
SELECT SQRT(SUM( td(a, b, idx))) FROM UNNEST(GENERATE_ARRAY(0, 19)) idx
));

Then, compute the embedding for our search term:

然后,为我们的搜索词计算嵌入:

WITH search_term AS (
SELECT output_0 AS term_embedding FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(SELECT "power line down on a home" AS sentences))
)

and compute the distance between each comment’s embedding and the term_embedding of the search term (above):

并计算每个评论的嵌入与搜索词的term_embedding之间的距离(如上):

SELECT
term_distance(term_embedding, output_0) AS termdist,
comments
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT comments, LOWER(comments) AS sentences
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
)), search_term
ORDER By termdist ASC
LIMIT 10

The result is:

结果是:

Image for post

Remember that we searched for “power line down on home”. Note that the top two results are “power line down on house” — the text embedding has been helpful in recognizing that home and house are similar in this context. The next set of top matches are all about power lines, the most unique pair of words in our search term.

请记住,我们搜索的是“家中的电源线”。 请注意,最上面的两个结果是“房屋上的电源线断开”-文本嵌入有助于识别房屋和房屋在这种情况下是相似的。 下一组热门匹配项都是关于电源线的,这是我们搜索词中最独特的词对。

文件丛集 (Document Clustering)

Document clustering involves using the embeddings as an input to a clustering algorithm such as K-Means. We can do this in BigQuery itself, and to make things a bit more interesting, we’ll use the location and day-of-year as additional inputs to the clustering algorithm.

文档聚类涉及将嵌入用作聚类算法(例如K-Means)的输入。 我们可以在BigQuery本身中做到这一点,并使事情变得更加有趣,我们将位置和年份作为聚类算法的其他输入。

CREATE OR REPLACE MODEL advdata.storm_reports_clustering
OPTIONS(model_type='kmeans', NUM_CLUSTERS=10) ASSELECT
arr_to_input_20(output_0) AS comments_embed,
EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
))

The embedding (output_0) is an array, but BigQuery ML currently wants named inputs. The work around is to convert the array to a struct:

嵌入(output_0)是一个数组,但是BigQuery ML当前需要命名输入。 解决方法是将数组转换为结构:

CREATE TEMPORARY FUNCTION arr_to_input_20(arr ARRAY<FLOAT64>)
RETURNS
STRUCT<p1 FLOAT64, p2 FLOAT64, p3 FLOAT64, p4 FLOAT64,
p5 FLOAT64, p6 FLOAT64, p7 FLOAT64, p8 FLOAT64,
p9 FLOAT64, p10 FLOAT64, p11 FLOAT64, p12 FLOAT64,
p13 FLOAT64, p14 FLOAT64, p15 FLOAT64, p16 FLOAT64,
p17 FLOAT64, p18 FLOAT64, p19 FLOAT64, p20 FLOAT64>AS (
STRUCT(
arr[OFFSET(0)]
, arr[OFFSET(1)]
, arr[OFFSET(2)]
, arr[OFFSET(3)]
, arr[OFFSET(4)]
, arr[OFFSET(5)]
, arr[OFFSET(6)]
, arr[OFFSET(7)]
, arr[OFFSET(8)]
, arr[OFFSET(9)]
, arr[OFFSET(10)]
, arr[OFFSET(11)]
, arr[OFFSET(12)]
, arr[OFFSET(13)]
, arr[OFFSET(14)]
, arr[OFFSET(15)]
, arr[OFFSET(16)]
, arr[OFFSET(17)]
, arr[OFFSET(18)]
, arr[OFFSET(19)]
));

The resulting ten clusters can visualized in the BigQuery console:

可以在BigQuery控制台中看到生成的十个集群:

Image for post

What do the comments in cluster #1 look like? The query is:

第1组中的注释是什么样的? 查询是:

SELECT sentences 
FROM ML.PREDICT(MODEL `ai-analytics-solutions.advdata.storm_reports_clustering`,
(
SELECT
sentences,
arr_to_input_20(output_0) AS comments_embed,
EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
))))
WHERE centroid_id = 1

The result shows that these are mostly short, uninformative comments:

结果表明,这些大多是简短的,无用的评论:

Image for post

How about cluster #3? Most of these reports seem to have something to do with verification by radar!!!

第3组如何? 这些报告大多数似乎与雷达验证有关!!!

Image for post

Enjoy!

请享用!

链接 (Links)

TensorFlow Hub has several text embedding models. You don’t have to use Swivel, although Swivel is a good general-purpose choice.

TensorFlow Hub具有多个文本嵌入模型。 尽管Swivel是一个不错的通用选择,但您不必使用Swivel 。

Full queries are in my notebook on GitHub. You can try out the queries in the BigQuery console or in an AI Platform Jupyter notebook.

完整查询在我的GitHub笔记本上 。 您可以在BigQuery控制台或AI Platform Jupyter笔记本中尝试查询。

翻译自: https://towardsdatascience.com/how-to-do-text-similarity-search-and-document-clustering-in-bigquery-75eb8f45ab65

bigquery

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390616.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

leetcode 1600. 皇位继承顺序(dfs)

题目 一个王国里住着国王、他的孩子们、他的孙子们等等。每一个时间点&#xff0c;这个家庭里有人出生也有人死亡。 这个王国有一个明确规定的皇位继承顺序&#xff0c;第一继承人总是国王自己。我们定义递归函数 Successor(x, curOrder) &#xff0c;给定一个人 x 和当前的继…

vlookup match_INDEX-MATCH — VLOOKUP功能的升级

vlookup match电子表格/索引匹配 (SPREADSHEETS / INDEX-MATCH) In a previous article, we discussed about how and when to use VLOOKUP functions and what are the issues that we might face while using them. This article, on the other hand, will take you to a jou…

PAT——1018. 锤子剪刀布

大家应该都会玩“锤子剪刀布”的游戏&#xff1a;两人同时给出手势&#xff0c;胜负规则如图所示&#xff1a; 现给出两人的交锋记录&#xff0c;请统计双方的胜、平、负次数&#xff0c;并且给出双方分别出什么手势的胜算最大。 输入格式&#xff1a; 输入第1行给出正整数N&am…

leetcode 1239. 串联字符串的最大长度

题目 二进制手表顶部有 4 个 LED 代表 小时&#xff08;0-11&#xff09;&#xff0c;底部的 6 个 LED 代表 分钟&#xff08;0-59&#xff09;。每个 LED 代表一个 0 或 1&#xff0c;最低位在右侧。 例如&#xff0c;下面的二进制手表读取 “3:25” 。 &#xff08;图源&am…

flask redis_在Flask应用程序中将Redis队列用于异步任务

flask redisBy: Content by Edward Krueger and Josh Farmer, and Douglas Franklin.作者&#xff1a; 爱德华克鲁格 ( Edward Krueger) 和 乔什法默 ( Josh Farmer )以及 道格拉斯富兰克林 ( Douglas Franklin)的内容 。 When building an application that performs time-co…

CentOS7下分布式文件系统FastDFS的安装 配置 (单节点)

背景 FastDFS是一个开源的轻量级分布式文件系统&#xff0c;为互联网量身定制&#xff0c;充分考虑了冗余备份、负载均衡、线性扩容等机制&#xff0c;并注重高可用、高性能等指标&#xff0c;解决了大容量存储和负载均衡的问题&#xff0c;特别适合以文件为载体的在线服务&…

剑指 Offer 38. 字符串的排列

题目 输入一个字符串&#xff0c;打印出该字符串中字符的所有排列。 你可以以任意顺序返回这个字符串数组&#xff0c;但里面不能有重复元素。 示例: 输入&#xff1a;s “abc” 输出&#xff1a;[“abc”,“acb”,“bac”,“bca”,“cab”,“cba”] 限制&#xff1a; 1…

前馈神经网络中的前馈_前馈神经网络在基于趋势的交易中的有效性(1)

前馈神经网络中的前馈This is a preliminary showcase of a collaborative research by Seouk Jun Kim (Daniel) and Sunmin Lee. You can find our contacts at the bottom of the article.这是 Seouk Jun Kim(Daniel) 和 Sunmin Lee 进行合作研究的初步展示 。 您可以在文章底…

hadoop将消亡_数据科学家:适应还是消亡!

hadoop将消亡Harvard Business Review marked the boom of Data Scientists in their famous 2012 article “Data Scientist: Sexiest Job”, followed by untenable demand in the past decade. [3]《哈佛商业评论 》在2012年著名的文章“数据科学家&#xff1a;最性感的工作…

剑指 Offer 15. 二进制中1的个数 and leetcode 1905. 统计子岛屿

题目 请实现一个函数&#xff0c;输入一个整数&#xff08;以二进制串形式&#xff09;&#xff0c;输出该数二进制表示中 1 的个数。例如&#xff0c;把 9 表示成二进制是 1001&#xff0c;有 2 位是 1。因此&#xff0c;如果输入 9&#xff0c;则该函数输出 2。 示例 1&…

httpd2.2的配置文件常见设置

目录 1、启动报错&#xff1a;提示没有名字fqdn2、显示服务器版本信息3、修改监听的IP和Port3、持久连接4 、MPM&#xff08; Multi-Processing Module &#xff09;多路处理模块5 、DSO&#xff1a;Dynamic Shared Object6 、定义Main server &#xff08;主站点&#xff09; …

leetcode 149. 直线上最多的点数

题目 给你一个数组 points &#xff0c;其中 points[i] [xi, yi] 表示 X-Y 平面上的一个点。求最多有多少个点在同一条直线上。 示例 1&#xff1a; 输入&#xff1a;points [[1,1],[2,2],[3,3]] 输出&#xff1a;3 示例 2&#xff1a; 输入&#xff1a;points [[1,1],[3,…

静态代理设计与动态代理设计

静态代理设计模式 代理设计模式最本质的特质&#xff1a;一个真实业务主题只完成核心操作&#xff0c;而所有与之辅助的功能都由代理类来完成。 例如&#xff0c;在进行数据库更新的过程之中&#xff0c;事务处理必须起作用&#xff0c;所以此时就可以编写代理设计模式来完成。…

6.3 遍历字典

遍历所有的键—值对 遍历字典时&#xff0c;键—值对的返回顺序也与存储顺序不同。 6.3.2 遍历字典中的所有键 在不需要使用字典中的值时&#xff0c;方法keys() 很有用。 6.3.3 按顺序遍历字典中的所有键 要以特定的顺序返回元素&#xff0c;一种办法是在for 循环中对返回的键…

Google Guava新手教程

以下资料整理自网络 一、Google Guava入门介绍 引言 Guavaproject包括了若干被Google的 Java项目广泛依赖 的核心库&#xff0c;比如&#xff1a;集合 [collections] 、缓存 [caching] 、原生类型支持 [primitives support] 、并发库 [concurrency libraries] 、通用注解 [comm…

数据科学领域有哪些技术_领域知识在数据科学中到底有多重要?

数据科学领域有哪些技术Jeremie Harris: “In a way, it’s almost like a data scientist or a data analyst has to be like a private investigator more than just a technical person.”杰里米哈里斯(Jeremie Harris) &#xff1a;“ 从某种意义上说&#xff0c;这就像是数…

初创公司怎么做销售数据分析_为什么您的初创企业需要数据科学来解决这一危机...

初创公司怎么做销售数据分析The spread of coronavirus is delivering a massive blow to the global economy. The lockdown and work from home restrictions have forced thousands of startups to halt expansion plans, cancel services, and announce layoffs.冠状病毒的…

leetcode 909. 蛇梯棋

题目 N x N 的棋盘 board 上&#xff0c;按从 1 到 N*N 的数字给方格编号&#xff0c;编号 从左下角开始&#xff0c;每一行交替方向。 例如&#xff0c;一块 6 x 6 大小的棋盘&#xff0c;编号如下&#xff1a; r 行 c 列的棋盘&#xff0c;按前述方法编号&#xff0c;棋盘格…

Python基础之window常见操作

一、window的常见操作&#xff1a; cd c:\ #进入C盘d: #从C盘切换到D盘 cd python #进入目录cd .. #往上走一层目录dir #查看目录文件列表cd ../.. #往上上走一层目录 二、常见的文件后缀名&#xff1a; .txt 记事本文本文件.doc word文件.xls excel文件.ppt PPT文件.exe 可执行…

WPF效果(GIS三维篇)

二维的GIS已经被我玩烂了&#xff0c;紧接着就是三维了&#xff0c;哈哈&#xff01;先来看看最简单的效果&#xff1a; 转载于:https://www.cnblogs.com/OhMonkey/p/8954626.html