大数据(big data)

介绍 (Introduction)

This article will show you one of the ways you can process stock price data using Google Cloud Platform’s BigQuery, and build a simple dashboard on the processed data using Google Data Studio.

本文将向您展示使用Google Cloud Platform的BigQuery处理股价数据以及使用Google Data Studio在处理后的数据上构建简单仪表板的一种方法。

Learning to do so can be especially useful for anyone who wishes to automate the findings from stock price insights, and is looking for an efficient and fast way to store the whole process on a cloud platform.

学会这样做对于希望自动化股价洞察力发现的任何人特别有用，并且他们正在寻找一种高效，快速的方法来将整个过程存储在云平台上。

This article is meant to act as a continuation, or ”part 2", to a previous article in which I showed How to automate financial data collection with Python using APIs and Google Cloud. Feel free to give it a read if you are interested in the upstream data import and script automation side of this workflow. If not, just skip and read on.

本文旨在作为上一篇文章的续篇或“第2部分”，在上一篇文章中，我展示了 如何使用API和Google Cloud使用Python自动执行财务数据收集 。 如果您对以下内容感兴趣，请随时阅读此工作流程的上游数据导入和脚本自动化方面，否则请跳过并继续阅读。

第1步：确定BigQuery的数据源 (Step 1: Identify BigQuery’s Data Sources)

GoogleBigQuery is GoogleCloud’s data warehousing solution (one of the many) and quite ideal for working with relational data such as those in this tutorial.

GoogleBigQuery是GoogleCloud的数据仓库解决方案(众多解决方案之一)，非常适合处理本教程中的关系数据。

In part 1, I illustrated how you can automate the data feeds into BigQuery using Cloud Functions. In this next step, you are going to be using the same data sources (daily stock price data from S&P500 firms, as well as related mapping tables which will allow us to enrich the data with some categorical variables ) to build a simple & neat processing and data visualization pipeline.

在第1部分中，我说明了如何使用Cloud Functions自动化BigQuery中的数据馈送。在下一步中，您将使用相同的数据源(来自S＆P500公司的每日股价数据，以及相关的映射表，这些映射表将使我们能够使用一些分类变量来充实数据)来构建简单整洁的处理程序和数据可视化管道。

N.B. The following screenshots will be taken from my own GCP Console (which I have set up with Italian as a default language). I have documented each screenshot with explanations so that everyone is able to follow along in English.

注意：以下屏幕截图将来自我自己的GCP控制台(我将意大利语设置为默认语言)。 我已经记录了每个屏幕快照的解释，以便每个人都可以用英语跟随。

To get started, once logged into BigQuery’s editor and, provided you have set up a dataset, you can identify the uploaded data sources by simply clicking on the “Resources” tab on the left side of the editor’s page.

首先，登录到BigQuery的编辑器，并在设置好数据集的情况下，只需单击编辑器页面左侧的“资源”标签，即可识别上载的数据源。

This allows you to immediately get a list of your datasets and the tables included in each dataset.

这使您可以立即获取数据集列表以及每个数据集中包含的表。

For this exercise, my data warehouse structure is as follows(I am going to ignore the USA_SectorViews table reported in the above screenshot):

对于本练习，我的数据仓库结构如下(我将忽略上面截图中报告的USA_SectorViews表)：

Datasets: csm

数据集 ： csm

Tables:

桌子：

SPcomponents: A table which identifies the complete list of S&P500 member firms (Source: List of S&P 500 companies)
SPcomponents ：一个表，该表标识了S＆P500成员公司的完整列表(来源： S＆P 500公司列表 )

The majority of the columns in this table are exactly as reported in the above source link, so you can directly use that web-page as a reference.

该表中的大多数列与上面的源链接中的报告完全相同，因此您可以直接使用该网页作为参考。

2. SPhistorical: A table containing daily stock price information for all S&P500 member firms, from 2000 until ~June 2020.

2.历史记录 ：该表包含2000年至2020年6月期间所有S＆P500成员公司的每日股价信息。

步骤2：计算股票指标并在已保存SQL查询中合并类别变量 (Step 2: Calculate Stock Metrics and merge categorical variables in a saved SQL Query)

Using the above two tables, let’s process the data using BigQuery’s SQL editor to derive a comprehensive table featuring both stock price indicators and categorical variables.

使用上面的两个表，让我们使用BigQuerySQL编辑器处理数据，以得出包含股票价格指标和分类变量的综合表。

For the purpose of this example, the final output table will feature the following columns:

就本示例而言，最终输出表将包含以下列：

Timestamp: the date key of each row and stock

时间戳 ：每行和股票的日期键

Symbol: the stock identifier for each S&P500 firm

符号：每个S＆P500公司的股票标识符

GICS_Sector: Column indicating the industry for each firm (Health Care, Consumer, etc..)

GICS _Sector：此列表示每个公司的行业(医疗保健，消费者等)。

Headquarters: the HQ location of each firm

总部： 每个公司的总部位置

Percentage_Daily_Return: the daily return of each stock (based on the Close price)

Percentage_Daily_Return：每只股票的每日收益(基于收盘价)

MA_5_days: The stock’s moving average for the previous 5 days, where the period’s reference is the current row’s date. Based on Close price.

MA_5_days：前5天的股票移动平均线，其中期间的参考是当前行的日期。基于收盘价。

MA_10_days: The stock’s moving average for the previous 10 days, where the period’s reference is the current row’s date. Based on Close price.

MA_10_days：过去10天的股票移动平均线，其中期间的参考是当前行的日期。基于收盘价。

MA_15_days: The stock’s moving average for the previous 15 days, where the period’s reference is the current row’s date. Based on Close price.

MA_15_days：过去15天的股票移动平均线，其中期间的参考是当前行的日期。基于收盘价。

The period choice for calculating moving averages has no intrinsic reason and was just computed for the sake of this walk-through. You can definitely learn more about moving averages online, as there are plenty of valuable tutorials.

计算移动平均线的周期选择没有内在原因，只是为了进行本演练而进行了计算。 您肯定可以在线了解更多有关移动均线的信息，因为这里有很多 有价值的教程 。

Using our two tables, you can see that you have most columns (Timestamp,Symbol,GICS_Sector,Headquarters) ready to go already.

使用我们的两个表，您可以看到大多数列(Timestamp，Symbol， GICS _Sector，Headquarters)已经准备就绪。

Using the Close price column in the SPhistorical table, you can calculate the remaining columns (Percentage_Daily_Returns and the stock’s moving averages across 5–10–15 day periods).

使用SPhistorical表中的“ 收盘价”列，您可以计算剩余的列(Percentage_Daily_Returns和股票在5–10–15天期间的移动平均线)。

First, let’s calculate the daily return for each stock. A stock’s return is calculated as the difference in Close prices between any two days, expressed as a percentage of the previous day’s Close.

首先，让我们计算每只股票的每日收益。股票的收益以任意两天收盘价之差计算，以前一天收盘价的百分比表示。

ROUND(((CAST(Close AS float64)-CAST(LAG(Close,1) OVER(PARTITION BY p.Symbol ORDER BY Timestamp) AS float64))/CAST(Close AS float64))*100,2) AS Percentage_Daily_Return

You can use the LAG function to identify the previous daily Close (for each ticker symbol only, as you do not want to calculate returns based on different stock prices; hence the use of OVER(PARTITION BY Symbol)), take its difference with the current day’s Close, and divide it by the Close price for that particular day to calculate the return.

您可以使用LAG函数来识别以前的每日收盘价(仅针对每个股票代号，因为您不想基于不同的股票价格计算收益；因此使用OVER(PARTITION BY Symbol))将其与当天的收盘价，然后将其除以该特定日期的收盘价即可计算出收益。

This is the essence of the calculation. You can then use the CAST function to convert text data types to floats, in order to be able to add and divide numbers, and the ROUND function to round returns to just two decimals.

这是计算的本质。然后，您可以使用CAST函数将文本数据类型转换为浮点数，以便能够对数字进行加法和除法，而ROUND函数的舍入运算仅返回两位小数。

You can avoid using the CAST function if your data types are already the correct ones (i.e. numeric for numbers, etc..)

如果您的数据类型已经是正确的类型(例如，数字代表数字等)，则可以避免使用CAST函数。

The stock’s moving averages across 5-day periods are then calculated as:

然后将股票在5天期间的移动平均值计算为：

AVG(CAST(Close AS float64)) OVER (PARTITION BY p.Symbol ORDER BY Timestamp ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS MA_5day

you can use the AVG function to compute the mean of the Close price for each different Symbol, considering a row window made up of the current day’s Close and the previous 4 days making up the 5 day time period.

您可以使用AVG函数来计算每个不同交易品种的收盘价平均值，同时考虑由当天的收盘价和构成5天时间段的前4天组成的行窗口。

The same logic is replicated across 10 and 15 day periods.

在10天和15天的时间段内重复相同的逻辑。

AVG(CAST(Close AS float64)) OVER (PARTITION BY p.Symbol ORDER BY Timestamp ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) AS MA_10daysAVG(CAST(Close AS float64)) OVER (PARTITION BY p.Symbol ORDER BY Timestamp ROWS BETWEEN 14 PRECEDING AND CURRENT ROW) AS MA_15days

Putting everything together, you obtain the following SQL query:

将所有内容放在一起，将获得以下SQL查询：

In summary, the query pulls from both the SPhistorical (aliased as “p”) and SPcomponents table (aliased as “c”). The two tables are joined using the mutual Symbol column as a key.

总而言之，该查询同时从SPhistorical (别名为“ p ”)和SPcomponent s表(别名为“ c ”)中提取。这两个表使用共同的符号列作为键进行连接。

I am using the SPhistorical table as the main reference table between the two. Thus I define a LEFT JOIN via which I bring the categorical variables I am interested in (GICS_Sector & Headquarters)from SPcomponents.

我正在使用SPhistorical表作为两者之间的主要参考表。因此，我定义了一个LEFT JOIN，通过它可以从SPcomponents中引入我感兴趣的分类变量( GICS_Sector＆Headquarters )。

Timestamp, Symbol, and daily Close are pulled from p. Categorical variables GICS_Sector and Headquarters are pulled from c. On the daily close, the above-explained calculations are computed.

时间戳记 ，符号和每日关闭均从p拉出。分类变量GICS_Sector和Headquarters从c提取。在每日收盘时，将计算上述计算。

The table is then grouped by the relevant variables and ordered by Symbol and Timestamp. You launch the query, wait for BigQuery to compute and execute, and you obtain the results shortly after.

然后将该表按相关变量分组，并按Symbol和Timestamp排序。您启动查询，等待BigQuery计算和执行，然后不久便获得结果。

You can then either refine your query, compute new metrics and re-run as many times as you like to get to the output you desire. Also, do not forget to save your query as a “Saved Query”, by clicking on the “Save Query” button next to the “Execute” button. Find more info here.

然后，您可以优化查询，计算新指标并重新运行多次，以获得所需的输出。同样，不要忘记通过单击“执行”按钮旁边的“保存查询”按钮将查询另存为“保存的查询”。在此处查找更多信息。

Once done, you are ready to use the query’s results as the data source layer of your Data Studio dashboard.

完成后，您就可以将查询结果用作Data Studio仪表板的数据源层。

步骤3：在Data Studio中，连接到保存的查询，然后将数据拉到仪表板上 (Step 3: From Data Studio, connect to saved Query and pull data onto dashboard)

Option 1:

选项1：

You can now hop onto your Data Studio account and click on the plus sign to start a new report.

现在，您可以跳至Data Studio帐户，然后单击加号以开始新报告。

From here, you can choose among a variety of data connectors:

在这里，您可以选择各种数据连接器：

After selecting BigQuery, you can simply click on Import Personalized Query and paste in our Saved Query you built at step 3. Once done, click Add.

选择BigQuery之后 ，您只需单击Import Personalized Query并将其粘贴到您在第3步中构建的Saved Query中。完成后，单击Add 。

Provided that you are connected, the data will be pulled in and you will be then presented with an blank Report view from which you can start building your dashboard. Notice how BigQuery is selected as my data origin.

前提是已连接，将提取数据，然后为您提供空白的报表视图，从中可以开始构建仪表板。请注意，如何选择BigQuery作为我的数据来源 。

Option 2:

选项2：

Within BigQuery, you can click on “Explore Data” > “Explore with Data Studio”.

在BigQuery中，您可以单击“浏览数据”>“使用Data Studio浏览”。

By clicking on here, a Data Studio data exploration pane will pop up and you will be able to get started plotting and visualizing the data. This is especially useful to conduct quick analysis and get an immediate visual sense of your query results.

单击此处，将弹出一个Data Studio数据浏览窗格，您将可以开始绘制和可视化数据。这对于进行快速分析并立即获得查询结果的视觉效果特别有用。

步骤4：使用Google Data Studio探索数据 (Step 4: Explore the data with Google Data Studio)

At this step, you may prefer to go with option 1 if your end goal is to build out a full dashboard or visualization report.

在此步骤中，如果最终目标是构建完整的仪表板或可视化报告，则可能更喜欢使用选项1。

I will also opt for this option to gain a bit more flexibility when exploring the data (space exploration functionality is still in beta version at the time of this writing).

在探索数据时，我还将选择该选项以提高灵活性(在撰写本文时，太空探索功能仍处于beta版)。

To get a bit more insights about the data and give an overview of Data Studio’s functionality, I will answer the following questions:

为了获得有关数据的更多见解并概述Data Studio的功能，我将回答以下问题：

1) What are the main industry sectors making up the S&P500 in 2020?
1)2020年，标准普尔500构成哪些主要行业？

2) How are these firms distributed geographically (in terms of HQ base)?
2)这些公司如何在地理上分布(以总部为基础)？

3) How has the SPY* performed historically?
3) SPY *的历史表现如何？

4) How does the latest trend in rolling averages look like for the SPY?
4)SPY的最新滚动平均值趋势如何？

*Index Fund tracking the S&P500 as a whole. In part 1, I also included the SPY in addition to each S&P500 individual member, and I am going to make use of it here
*跟踪整个标普500指数基金。在第1部分中，除了每个S＆P500个人成员之外，我还包括SPY，并且我将在这里使用它

Overall, Data Studio acts quite simply as an intuitive drag and drop interface, with the user being able to choose among different chart types and style/format them according the data at their disposal.

总体而言，Data Studio的操作非常简单，就像一个直观的拖放界面，用户可以在不同的图表类型中进行选择，并根据可使用的数据对它们进行样式/格式化。

At this link, you can also reference a great guide which illustrates their entire interface in great detail.

在此 链接上，您还可以参考一份出色的指南，其中详细介绍了它们的整个界面。

Let’s now see how different chart types can help with answering the above questions.

现在，让我们看看不同的图表类型如何帮助回答上述问题。

1) What are the main industry sectors making up the S&P500 in 2020?
1)2020年，标准普尔500构成哪些主要行业？

To tackle question one, a pie chart seems like a good choice. Refer to this quick guide for details on adding charts to your blank report on Data Studio.

为了解决问题，饼图似乎是一个不错的选择。有关将图表添加到Data Studio的空白报表中的详细信息，请参考此快速指南。

The S&P500 is well balanced across industry sectors. The top 3 in order of prevalence are Industrials, Information Technology and Financials, with making up around a 13–14% share of the total.

S＆P500在各个行业之间保持平衡。发生率排名前三的是工业，信息技术和金融，占总数的13–14％。

2) How are these firms distributed geographically (in terms of HQ base)?
2)这些公司在地理上如何分布(以总部为基础)？

In terms of HQ location, one can see a prevalence of US firms, as expected, with some European firms present. When using a map, you can make good use of the toolbar to zoom in&out, entering full screen-mode, and overall adjusting the map view.

就总部所在地而言，正如预期的那样，可以看到美国公司的盛行，一些欧洲公司也出席了会议。使用地图时，您可以充分利用工具栏进行放大和缩小，进入全屏模式以及整体调整地图视图。

3) How has the SPY* performed historically?
3) SPY *的历史表现如何？

Next, I plot the SPY daily Close over the years using a line chart, to get a sense of its trend.

接下来，我使用折线图绘制多年来SPY的每日收盘价，以了解其趋势。

In the long term, the SPY goes from around 150 to 300+ in value, providing for a steady increase.

从长远来看，SPY的价值从150左右增加到300+，可以实现稳定的增长。

4) How does the latest trend in rolling averages look like for the SPY?
4)SPY的最新滚动平均值趋势如何？

I then plot the 5/10/15 day moving averages in more recent times, to see how they stack against each other. Having chosen similar time scales for the three metric, you can see that in general they tend to track each other quite closely, with the 15-day period average showing a bit more variability around the general trend line.

然后，我绘制最近5月15日的移动均线，以查看它们如何相互叠加。为这三个指标选择了类似的时间标度后，您可以看到，总体而言，它们趋向于彼此密切跟踪，而15天的周期平均值在总体趋势线附近显示出更多的可变性。

The period choice for calculating moving averages has no intrinsic reason and was just computed in order to be plotted. You can definitely learn more about moving averages online, as there are plenty of valuable tutorials.

计算移动平均线的周期选择没有内在原因，只是为了绘制而进行了计算。 您肯定可以在线了解更多有关移动均线的信息，因为这里有很多 有价值的教程 。

Putting everything together, you can the add a headline and title to you report, and obtain something along the lines of:

将所有内容放在一起，您可以为报告添加标题和标题，并获得以下内容：

Which allows you to get a complete snapshot of your data and indicators of interest.

这使您可以获得有关数据和指标的完整快照。

As you can see, Data Studio is quite simple to use and provides great data connectors and interactivity.

如您所见，Data Studio的使用非常简单，并提供了出色的数据连接器和交互性。

下一步 (Next steps)

Your workflow is officially set up. You can save all of your BigQuery SQL queries and your Data Studio report, and refresh/extend resources as new data comes in.

您的工作流程已正式设置。您可以保存所有BigQuery SQL查询和Data Studio报表，并在新数据传入时刷新/扩展资源。

I hope to have shown you something useful. You can get started building your Google Cloud own solution with your own data using GCP’s free tier account.

我希望向您展示一些有用的东西。您可以使用GCP的免费套餐帐户开始使用自己的数据构建自己的Google Cloud解决方案。

Thanks very much for reading!

非常感谢您的阅读！

Access my free Data Science resource checklist here

在此处 访问我的免费数据科学资源清单