针对数据科学家和数据工程师的4条SQL技巧

SQL has become a common skill requirement across industries and job profiles over the last decade.

在过去的十年中,SQL已成为跨行业和职位描述的通用技能要求。

Companies like Amazon and Google will often demand that their data analysts, data scientists and product managers are at least be familiar with SQL. This is because SQL remains the language of data. So, in order to be data-driven, people need to know how to access and analyze data.

像Amazon和Google这样的公司通常会要求他们的数据分析师,数据科学家和产品经理至少熟悉SQL。 这是因为SQL仍然是数据的语言。 因此,为了受到数据驱动,人们需要知道如何访问和分析数据。

With so many people looking at, slicing, manipulating, and analyzing data we wanted to provide some tips to help improve your SQL.

由于有如此多的人查看,切片,操作和分析数据,我们希望提供一些技巧来帮助改进SQL 。

These tips and tricks we have picked up along the way while writing SQL. Some of them are do’s and don’ts others are just best practices. Overall we hope that they will help bring your SQL to the next level.

我们在编写SQL的过程中已经掌握了这些技巧。 其中一些是可行的,其他则只是最佳实践。 总体而言,我们希望它们将帮助您将SQL提升到一个新的水平。

Some of the tips will be things you shouldn’t do, even when you might be tempted and others are best practices that will help ensure that you can trust your data. Overall, they are both meant to be informative as well as reduce possible future headaches.

一些技巧是您不应该做的事情,即使您可能会被诱惑,而其他一些最佳实践则可以帮助确保您可以信任自己的数据。 总体而言,它们既可以提供信息,又可以减少将来可能出现的麻烦。

不要平均使用Avg()-不一样 (Don’t Use Avg() on an Average — It’s Not the Same)

A common mistake we see in people’s queries is averaging averages. Some people may think it’s obvious to not average averages. However, across the web, there are discussions and whole articles explaining why it is bad to average averages.

我们在人们的查询中看到的一个常见错误是平均数。 某些人可能认为不平均并不容易。 但是,在网络上,有讨论和整篇文章解释了为什么平均平均值不好。

Why is it bad to average averages, both in SQL and in general? Because it can be skewed by averages that were based on low numbers of whatever you are averaging.

在SQL和一般情况下,为什么平均平均值不好? 因为它可能会由于基于您所求平均值的少量数字的平均值而产生偏差。

For example, look at the table below:

例如,查看下表:

Image for post

Here we’ve already averaged the cost per claim at the county level. What we also can see is that one county’s average is based on 100 claims and the other is based on 2 claims. In a real-life situation, this table would not include the total number of claims — we’re using it to illustrate how easily you can skew an average.

在这里,我们已经在县一级平均了每项索赔的成本。 我们还可以看到,一个县的平均值基于100个索赔,另一个县的平均值基于2个索赔。 在现实生活中,此表将不包含索赔总数–我们正在使用它来说明您可以很容易地使平均值偏斜。

What if we wanted to find the average of all the counties. If you were to try averaging the average, then you would get $525. That doesn’t seem right.

如果我们想找到所有县的平均值怎么办。 如果您尝试对平均值进行平均,那么您将获得525美元。 那似乎不对。

If 100 claims were on average $50 and only 2 averaged 1000, then the average of all those values should be closer to $50 not $500. So, in fact, the average of these claims is about $68. But if you average the average you get a number almost ten times greater.

如果有100个索赔平均为50美元,而只有2个平均为1000美元,那么所有这些值的平均值应该接近50美元,而不是500美元。 因此,实际上,这些索赔的平均值约为68美元。 但是,如果将平均数取平均值,您得到的数字将增加近十倍。

So why do people even ask if it’s OK to average the average? Well, sometimes averaging the average can feel close to the expected output.

那么,为什么人们甚至问平均数是否可以? 好吧,有时取平均值可以感到接近预期的输出。

Let’s look at a SQL example:

让我们看一个SQL示例:

In this case, we’ll be using a table that has the average cost per patient and average visits per patient by county and age. However, we would like to find the average cost per patient and visits per patient at the county level.

在这种情况下,我们将使用一个表,该表具有按县和年龄划分的每位患者的平均费用和每位患者的平均就诊次数。 但是,我们希望找到县级每位患者的平均费用和每位患者的就诊次数。

If we average the averages from the table using the query above it will give us the following output:

如果我们使用上面的查询对表中的平均值求平均值,则会得到以下输出:

Image for post

But we could also correctly write a query that recalculates the average at the county granularity, as shown below:

但是我们也可以正确编写一个查询,以县级粒度重新计算平均值,如下所示:

Let’s compare this query’s output to the previous output:

让我们将此查询的输出与上一个输出进行比较:

Image for post

You will notice a few differences in the King County output. If we compare the average visits they seem quite similar — 2.4 versus 2.6. This is probably why some people fall for the average of averages — they can sometimes be close to the actual output, so it can be tempting to use this method.

您会注意到King County输出中的一些差异。 如果我们比较平均访问量,则它们看起来非常相似-2.4与2.6。 这可能就是为什么有些人无法获得平均值的原因-他们有时可能接近实际产出,因此使用此方法可能很诱人。

However, when we look at the average cost per claim we’ll notice that there’s a nearly $58 difference between about $560 and $620 — almost 10%. When you’re talking about cost-savings, that’s a huge difference.

但是,当我们查看每项索赔的平均费用时,我们会注意到,大约560美元和620美元之间存在近58美元的差额,几乎是10%。 当您谈论节省成本时,那是巨大的差异。

So although the difference between 2.4 and 2.6 may seem small, it can lead to some massive differences.

因此,尽管2.4和2.6之间的差异似乎很小,但可能导致一些巨大差异。

您可以在总和内使用个案陈述 (You Can Use A Case Statement Inside Sum)

Another great tip when writing SQL is learning how to use case statements in your sum clause. This can be very useful when you are trying to write metrics with a ratio or a numerator.

编写SQL时的另一个很棒的技巧是学习如何在sum子句中使用case语句。 当您尝试使用比率或分子编写指标时,这可能非常有用。

For example, take a look at the query below. You will see that we need to hit the table claims twice to get the count of values we are trying to filter as well the total number of rows. However, we could reduce this.

例如,看看下面的查询。 您将看到我们需要按两次表声明才能获得我们要过滤的值的计数以及总行数。 但是,我们可以减少这一点。

We can write a case statement to count the total values where the condition is true and then divide by the total count, as in the query below.

我们可以编写一个case语句来计算条件为true时的总数,然后除以总数,如下面的查询所示。

You’ll notice that we don’t need to hit the table twice to get both numbers. This is also simpler to read. From my experience, this trick is usually picked up by most SQL developers somewhere in their first year or two of using SQL.

您会注意到,我们不需要两次打赌就能获得两个数字。 这也更容易阅读。 根据我的经验,大多数SQL开发人员通常在使用SQL的第一两年中就意识到了这一技巧。

It’s extremely helpful for writing code that counts the percentage of nulls in a row, or to calculate metrics for dashboards. In turn, this is why many analysts and data engineers will become familiar with this trick, as long as they have to write a decent amount of SQL and don’t just use drag and drop solutions.

这对于编写可计算连续空值百分比的代码或计算仪表板的度量标准非常有用。 反过来,这就是为什么许多分析人员和数据工程师将变得熟悉此技巧的原因,因为他们必须编写大量SQL,而不仅仅是使用拖放解决方案。

了解数组以及如何操作它们 (Understanding Arrays and How to Manipulate Them)

Arrays and maps inside of your database tables aren’t too common. However, I’ve noticed more and more teams relying on unstructured data which can often utilize data structures like arrays and array functions.

数据库表内部的数组和映射不太常见。 但是,我注意到越来越多的团队依赖非结构化数据,这些数据通常可以利用数组和数组函数之类的数据结构。

This is because databases like Postgres and SQL engines like Presto allow for you to handle arrays in your query.

这是因为Postgres之类的数据库和Presto之类SQL引擎允许您处理查询中的数组。

Although Arrays and maps are not a new concept they’re a somewhat new concept for some analysts and data scientists who perhaps aren’t as familiar with programming. This means you may need to occasionally learn a few array and map functions to extract data.

尽管数组和映射不是一个新概念,但对于一些对编程不太熟悉的分析师和数据科学家来说,这是一个新概念。 这意味着您可能偶尔需要学习一些数组和映射函数以提取数据。

Let’s start by learning how to unnest a map in presto. A map is a data structure that provides a key:value relationship. This means you can provide a unique key like a specific description about the value like “first_name”:”George”. A map can also contain multiple key-value pairs like the image below.

让我们开始学习如何预先隐藏地图。 映射是一种提供key:value关系的数据结构。 这意味着您可以提供一个唯一键,例如有关值的特定说明,例如“first_name”:”George” 。 映射也可以包含多个键/值对,如下图所示。

In this case, we have two keys, dob and friend_ids that we would like to access:

在这种情况下,我们有两个键,dob和friend_id ,我们想访问它们:

Image for post

So how do we access that data? Let’s check out the query below.

那么我们如何访问这些数据? 让我们看看下面的查询。

As you can see, you can define a row for both the key and value. So when we pull out the data you can get the specific data types.

如您所见,您可以为键和值定义一行。 因此,当我们提取数据时,您可以获得特定的数据类型。

The output will look like this:

输出将如下所示:

Image for post

You can also check the length of arrays, find specific keys, and so much more (read more about presto arrays here). I recommend you don’t just use maps and arrays as replacements for good data modeling, however, they can come in handy when you’re working with data that you might not want a specific schema for.

您还可以检查数组的长度,找到特定的键,等等( 在这里了解更多关于presto数组的信息 )。 我建议您不要仅仅使用映射和数组来代替良好的数据建模,但是,当您使用不需要特定模式的数据时,它们可以派上用场。

领先和落后以避免自我加入 (Lead and Lag to Avoid Self Joins)

Finally, let’s talk about using Lead and Lag window functions to avoid self joins.

最后,让我们谈谈使用LeadLag窗口函数来避免自我连接。

When you’re doing analytics you will often need to compare two events output or calculate the amount of time between two events.

在进行分析时,您通常需要比较两个事件的输出或计算两个事件之间的时间量。

One way you can do this is to self-join a table to itself and connect the two rows. However, other nifty SQL functions are the Lag and Lead functions.

一种执行此操作的方法是将表自连接到自身并连接两行。 但是,其他漂亮SQL函数是LagLead函数。

These allow a user to reference a specified lagging or leading value. You can also specify the desired level of granularity of the lagging and leading values.

这些允许用户参考指定的滞后值或前导值。 您还可以指定所需的滞后值和前导值的粒度级别。

For example, in the query below we are partitioning the lagging and leading value by patient_id. This means we are only looking at lagging and lead claim_dates and claim_costs at the patient level:

例如,在下面的查询中,我们按patient_id对滞后值和前导值进行了patient_id 。 这意味着我们仅在患者级别查看滞后和领先的claim_datesclaim_costs

The output of this query will look like this:

该查询的输出将如下所示:

Image for post

You will notice that for the first date of every patient the lagging claim_date and cost is null. This is because there’s no prior cost or claim date.

您会注意到,对于每个患者的第一个日期, claim_date和cost都为空。 这是因为没有事先的费用或索赔日期。

Overall, the lag and lead functions can make an SQL developer's life much simpler.

总体而言, laglead功能可使SQL开发人员的工作变得更加简单。

SQL的细节问题 (The Details Matter With SQL)

SQL remains the language of data. Learning these tips and tricks can help ensure that your next dashboard or analysis is that much better. Whether you avoid averaging averages, or write data quality checks, these small improvements make a huge difference. Some of these issues have caused large issues and discussions in companies, so we hope this helps bring many of you up to speed.

SQL仍然是数据的语言。 学习这些提示和技巧可以帮助确保您的下一个仪表板或分析效果更好。 无论您是避免求平均值的平均值,还是编写数据质量检查,这些小的改进都将带来巨大的不同。 其中一些问题已引起公司中的大问题和讨论,因此我们希望这有助于使您中的许多人快速入门。

In addition, if you follow these SQL tips, your data analysis will be more accurate and you can be more confident in the numbers you provide.

另外,如果您遵循这些SQL提示,则数据分析将更加准确,并且您对所提供的数字将更有信心。

Thanks for reading.

谢谢阅读。

翻译自: https://medium.com/better-programming/4-sql-tips-for-data-scientist-and-data-engineers-56c41487752f

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388323.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

全排列算法实现

版权声明&#xff1a;本文为博主原创文章&#xff0c;未经博主允许不得转载。 https://blog.csdn.net/summerxiachen/article/details/605796231.全排列的定义和公式&#xff1a; 从n个数中选取m&#xff08;m<n&#xff09;个数按照一定的顺序进行排成一个列&#xff0c;叫…

14.并发容器之ConcurrentHashMap(JDK 1.8版本)

1.ConcurrentHashmap简介 在使用HashMap时在多线程情况下扩容会出现CPU接近100%的情况&#xff0c;因为hashmap并不是线程安全的&#xff0c;通常我们可以使用在java体系中古老的hashtable类&#xff0c;该类基本上所有的方法都采用synchronized进行线程安全的控制&#xff0c;…

服务器虚拟化网口,服务器安装虚拟网口

服务器安装虚拟网口 内容精选换一换Atlas 800 训练服务器(型号 9010)安装上架、服务器基础参数配置、安装操作系统等操作请参见《Atlas 800 训练服务器 用户指南 (型号9010)》。Atlas 800 训练服务器(型号 9010)适配操作系统如表1所示。请参考表2下载驱动和固件包。Atlas 800 训…

芒果云接吗_芒果糯米饭是生产力的关键吗?

芒果云接吗Would you like to know how your mood impact your sleep and how your parents influence your happiness levels?您想知道您的心情如何影响您的睡眠以及您的父母如何影响您的幸福感吗&#xff1f; Become a data nerd, and track it!成为数据书呆子&#xff0c;…

laravel-admin 开发 bootstrap-treeview 扩展包

laravel-admin 扩展开发文档https://laravel-admin.org/doc... 效果图&#xff1a; 开发过程&#xff1a; 1、先创建Laravel项目&#xff0c;并集成laravel-admin&#xff0c;教程&#xff1a; http://note.youdao.com/notesh... 2、生成开发扩展包 php artisan admin:extend c…

怎么看服务器上jdk安装位置,查看云服务器jdk安装路径

查看云服务器jdk安装路径 内容精选换一换用户可以在公有云MRS集群以外的节点上使用客户端&#xff0c;在使用客户端前需要安装客户端。如果集群外的节点已安装客户端且只需要更新客户端&#xff0c;请使用安装客户端的用户例如root。针对MRS 3.x之前版本的集群&#xff0c;需要…

公司生日会生日礼物_你的生日有多受欢迎?

公司生日会生日礼物In the years before 2020, it was common for a large number of school children (20–30 or more) to physically colocate for their math lessons. And in many a class, students were asked to compute the probability that two of them had the sam…

Django思维导图

转载于:https://www.cnblogs.com/liangying666/p/9744477.html

wp7开发环境搭建

简介 本文通过step by step的模式讲述如何从0开始搭建Window Phone 7开发环境&#xff0c;如果开发简单的Windows Phone 7程序。只是一篇介绍性的文章,但是迈进Windows Phone 7开发之路其实就那么简单,一起来开发Windows Phone 7吧。 Windows 7安装 目前Windows Phone 7开发…

旧金山字体_旧金山建筑业的兴衰。 施工趋势与历史

旧金山字体This series of articles is devoted to the study of the construction activity of the main city of Silicon Valley — San Francisco. Charts and calculations were built with the help of Jupyter Notebook (Kaggle)该系列文章专门研究硅谷主要城市旧金山的建…

gym100825G. Tray Bien(轮廓线DP)

题意:3 * N的格子 有一些点是坏的 用1X1和1X2的砖铺有多少种方法 题解:重新学了下轮廓线 写的很舒服 #include <bits/stdc.h> using namespace std; typedef long long ll;int n, m; int vis[30][5]; ll dp[25][1 << 3];void dfs(int num, int i, int state, int n…

lambda函数,函数符_为什么您永远不应该在Lambda函数中使用print()

lambda函数&#xff0c;函数符两个Lambda用户的故事 (A Tale of Two Lambda Users) 故事1&#xff1a;业余 (Tale #1: The Amateur) One moment everything is fine, then … Bam! Your Lambda function raises an exception, you get alerted and everything changes instantl…

ai 中 统计_AI统计(第2部分)

ai 中 统计Today I plan to cover the following topics: Linear independence, special matrices, and matrix decomposition.今天&#xff0c;我计划涵盖以下主题&#xff1a;线性独立性&#xff0c;特殊矩阵和矩阵分解。 线性独立 (Linear independence) A set of vectors …

twitter数据分析_Twitter上最受欢迎的数据科学文章主题

twitter数据分析If you’ve written data science articles or are trying to get started, finding the most popular topics is a big help in getting your articles read. Below are the steps to easily determine what these topics are using R and the results of the …

JAVA遇见HTML——JSP篇(JSP状态管理)

案例&#xff1a;Cookie在登录中的应用 URL编码与解码的工具类解决中文乱码的问题&#xff0c;这个工具类在java.net.*包里 编码&#xff1a;URLEncoder.encode(String s,String enc)//s&#xff1a;对哪个字符串进行编码&#xff0c;enc&#xff1a;用的字符集&#xff08;例&…

PE文件讲解

我们大家都知道&#xff0c;在Windows 9x、NT、2000下&#xff0c;所有的可执行文件都是基于Microsoft设计的一种新的文件格式Portable Executable File Format&#xff08;可移植的执行体&#xff09;&#xff0c;即PE格式。有一些时候&#xff0c;我们需要对这些可执行文件进…

是什么使波西米亚狂想曲成为杰作-数据科学视角

平均“命中率”是什么样的 (What an Average ‘Hit’ looks like) Before we break the song down, let us have a brief analysis of what the greatest hits of all time had in common. I have picked 1500 songs ( charting hits ) right from the ’50s to the’10s, spre…

流行编程语言_编程语言的流行度排名

流行编程语言There has never been a unanimous agreement on what the most popular programming languages are, and probably never will be. Yet we believe that there is merit in trying to come up with ways to rank the popularity of programming languages. It hel…