bigquery_如何在BigQuery中进行文本相似性搜索和文档聚类

bigquery

BigQuery offers the ability to load a TensorFlow SavedModel and carry out predictions. This capability is a great way to add text-based similarity and clustering on top of your data warehouse.

BigQuery可以加载TensorFlow SavedModel并执行预测。此功能是在数据仓库之上添加基于文本的相似性和群集的一种好方法。

Follow along by copy-pasting queries from my notebook in GitHub. You can try out the queries in the BigQuery console or in an AI Platform Jupyter notebook.

然后在GitHub中从我的笔记本复制粘贴查询。您可以在BigQuery控制台或AI Platform Jupyter笔记本中尝试查询。

风暴报告数据 (Storm reports data)

As an example, I’ll use a dataset consisting of wind reports phoned into National Weather Service offices by “storm spotters”. This is a public dataset in BigQuery and it can be queried as follows:

举例来说，我将使用由“风暴发现者”致电国家气象局办公室的风报告组成的数据集。这是BigQuery中的公共数据集，可以按以下方式查询：

SELECT 
  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
  ST_GeogPoint(longitude, latitude) AS location,
  comments
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
LIMIT 10

The result looks like this:

结果看起来像这样：

Let’s say that we want to build a SQL query to search for comments that look like “power line down on a home”.

假设我们要构建一个SQL查询来搜索看起来像“家中的电源线”的注释。

Steps:

脚步：

Load a machine learning model that creates an embedding (essentially a compact numerical representation) of some text.
加载一个机器学习模型，该模型创建一些文本的嵌入(本质上是紧凑的数字表示形式)。
Use the model to generate the embedding of our search term.
使用该模型生成搜索词的嵌入。
Use the model to generate the embedding of every comment in the wind reports table.
使用该模型可将每个评论嵌入风报告表中。
Look for rows where the two embeddings are close to each other.
查找两个嵌入彼此靠近的行。

将文本嵌入模型加载到BigQuery中 (Loading a text embedding model into BigQuery)

TensorFlow Hub has a number of text embedding models. For best results, you should use a model that has been trained on data that is similar to your dataset and which has a sufficient number of dimensions so as to capture the nuances of your text.

TensorFlow Hub具有许多文本嵌入模型。为了获得最佳结果，您应该使用经过训练的模型，该数据类似于您的数据集，并且具有足够的维数，以捕获文本的细微差别。

For this demonstration, I’ll use the Swivel embedding which was trained on Google News and has 20 dimensions (i.e., it is pretty coarse). This is sufficient for what we need to do.

在此演示中，我将使用在Google新闻上接受训练的Swivel嵌入，它具有20个维度(即，非常粗略)。这足以满足我们的需求。

The Swivel embedding layer is already available in TensorFlow SavedModel format, so we simply need to download it, extract it from the tarred, gzipped file, and upload it to Google Cloud Storage:

Swivel嵌入层已经可以使用TensorFlow SavedModel格式，因此我们只需要下载它，从压缩后的压缩文件中提取出来，然后将其上传到Google Cloud Storage：

FILE=swivel.tar.gz
wget --quiet -O tmp/swivel.tar.gz  https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1?tf-hub-format=compressed
cd tmp
tar xvfz swivel.tar.gz
cd ..
mv tmp swivel
gsutil -m cp -R swivel gs://${BUCKET}/swivel

Once the model files on GCS, we can load it into BigQuery as an ML model:

将模型文件保存到GCS后，我们可以将其作为ML模型加载到BigQuery中：

CREATE OR REPLACE MODEL advdata.swivel_text_embed
OPTIONS(model_type='tensorflow', model_path='gs://BUCKET/swivel/*')

尝试在BigQuery中嵌入模型 (Try out embedding model in BigQuery)

To try out the model in BigQuery, we need to know its input and output schema. These would be the names of the Keras layers when it was exported. We can get them by going to the BigQuery console and viewing the “Schema” tab of the model:

要在BigQuery中试用模型，我们需要了解其输入和输出架构。这些将是导出时Keras图层的名称。我们可以通过转到BigQuery控制台并查看模型的“架构”标签来获得它们：

Let’s try this model out by getting the embedding for a famous August speech, calling the input text as sentences and knowing that we will get an output column named output_0:

让我们通过获得著名的August演讲的嵌入，将输入文本称为句子并知道我们将得到一个名为output_0的输出列来试用该模型：

SELECT output_0 FROM
ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT "Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially." AS sentences))

The result has 20 numbers as expected, the first few of which are shown below:

结果有20个预期的数字，其中前几个显示如下：

文件相似度搜寻 (Document similarity search)

Define a function to compute the Euclidean squared distance between a pair of embeddings:

定义一个函数来计算一对嵌入之间的欧几里德平方距离：

CREATE TEMPORARY FUNCTION td(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>, idx INT64) AS (
   (a[OFFSET(idx)] - b[OFFSET(idx)]) * (a[OFFSET(idx)] - b[OFFSET(idx)])
);CREATE TEMPORARY FUNCTION term_distance(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>) AS ((
   SELECT SQRT(SUM( td(a, b, idx))) FROM UNNEST(GENERATE_ARRAY(0, 19)) idx
));

Then, compute the embedding for our search term:

然后，为我们的搜索词计算嵌入：

WITH search_term AS (
  SELECT output_0 AS term_embedding FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(SELECT "power line down on a home" AS sentences))
)

and compute the distance between each comment’s embedding and the term_embedding of the search term (above):

并计算每个评论的嵌入与搜索词的term_embedding之间的距离(如上)：

SELECT
  term_distance(term_embedding, output_0) AS termdist,
  comments
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT comments, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
  WHERE EXTRACT(YEAR from timestamp) = 2019
)), search_term
ORDER By termdist ASC
LIMIT 10

The result is:

结果是：

Remember that we searched for “power line down on home”. Note that the top two results are “power line down on house” — the text embedding has been helpful in recognizing that home and house are similar in this context. The next set of top matches are all about power lines, the most unique pair of words in our search term.

请记住，我们搜索的是“家中的电源线”。请注意，最上面的两个结果是“房屋上的电源线断开”-文本嵌入有助于识别房屋和房屋在这种情况下是相似的。下一组热门匹配项都是关于电源线的，这是我们搜索词中最独特的词对。

文件丛集 (Document Clustering)

Document clustering involves using the embeddings as an input to a clustering algorithm such as K-Means. We can do this in BigQuery itself, and to make things a bit more interesting, we’ll use the location and day-of-year as additional inputs to the clustering algorithm.

文档聚类涉及将嵌入用作聚类算法(例如K-Means)的输入。我们可以在BigQuery本身中做到这一点，并使事情变得更加有趣，我们将位置和年份作为聚类算法的其他输入。

CREATE OR REPLACE MODEL advdata.storm_reports_clustering
OPTIONS(model_type='kmeans', NUM_CLUSTERS=10) ASSELECT
  arr_to_input_20(output_0) AS comments_embed,
  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
  longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
  WHERE EXTRACT(YEAR from timestamp) = 2019
))

The embedding (output_0) is an array, but BigQuery ML currently wants named inputs. The work around is to convert the array to a struct:

嵌入(output_0)是一个数组，但是BigQuery ML当前需要命名输入。解决方法是将数组转换为结构：

CREATE TEMPORARY FUNCTION arr_to_input_20(arr ARRAY<FLOAT64>)
RETURNS 
STRUCT<p1 FLOAT64, p2 FLOAT64, p3 FLOAT64, p4 FLOAT64,
       p5 FLOAT64, p6 FLOAT64, p7 FLOAT64, p8 FLOAT64, 
       p9 FLOAT64, p10 FLOAT64, p11 FLOAT64, p12 FLOAT64, 
       p13 FLOAT64, p14 FLOAT64, p15 FLOAT64, p16 FLOAT64,
       p17 FLOAT64, p18 FLOAT64, p19 FLOAT64, p20 FLOAT64>AS (
STRUCT(
    arr[OFFSET(0)]
    , arr[OFFSET(1)]
    , arr[OFFSET(2)]
    , arr[OFFSET(3)]
    , arr[OFFSET(4)]
    , arr[OFFSET(5)]
    , arr[OFFSET(6)]
    , arr[OFFSET(7)]
    , arr[OFFSET(8)]
    , arr[OFFSET(9)]
    , arr[OFFSET(10)]
    , arr[OFFSET(11)]
    , arr[OFFSET(12)]
    , arr[OFFSET(13)]
    , arr[OFFSET(14)]
    , arr[OFFSET(15)]
    , arr[OFFSET(16)]
    , arr[OFFSET(17)]
    , arr[OFFSET(18)]
    , arr[OFFSET(19)]    
));

The resulting ten clusters can visualized in the BigQuery console:

可以在BigQuery控制台中看到生成的十个集群：

What do the comments in cluster #1 look like? The query is:

第1组中的注释是什么样的？查询是：

SELECT sentences 
FROM ML.PREDICT(MODEL `ai-analytics-solutions.advdata.storm_reports_clustering`, 
(
SELECT
  sentences,
  arr_to_input_20(output_0) AS comments_embed,
  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
  longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
  WHERE EXTRACT(YEAR from timestamp) = 2019
))))
WHERE centroid_id = 1

The result shows that these are mostly short, uninformative comments:

结果表明，这些大多是简短的，无用的评论：