SQL has become a common skill requirement across industries and job profiles over the last decade.
在过去的十年中,SQL已成为跨行业和职位描述的通用技能要求。
Companies like Amazon and Google will often demand that their data analysts, data scientists and product managers are at least be familiar with SQL. This is because SQL remains the language of data. So, in order to be data-driven, people need to know how to access and analyze data.
像Amazon和Google这样的公司通常会要求他们的数据分析师,数据科学家和产品经理至少熟悉SQL。 这是因为SQL仍然是数据的语言。 因此,为了受到数据驱动,人们需要知道如何访问和分析数据。
With so many people looking at, slicing, manipulating, and analyzing data we wanted to provide some tips to help improve your SQL.
由于有如此多的人查看,切片,操作和分析数据,我们希望提供一些技巧来帮助改进SQL 。
These tips and tricks we have picked up along the way while writing SQL. Some of them are do’s and don’ts others are just best practices. Overall we hope that they will help bring your SQL to the next level.
我们在编写SQL的过程中已经掌握了这些技巧。 其中一些是可行的,其他则只是最佳实践。 总体而言,我们希望它们将帮助您将SQL提升到一个新的水平。
Some of the tips will be things you shouldn’t do, even when you might be tempted and others are best practices that will help ensure that you can trust your data. Overall, they are both meant to be informative as well as reduce possible future headaches.
一些技巧是您不应该做的事情,即使您可能会被诱惑,而其他一些最佳实践则可以帮助确保您可以信任自己的数据。 总体而言,它们既可以提供信息,又可以减少将来可能出现的麻烦。
不要平均使用Avg()-不一样 (Don’t Use Avg() on an Average — It’s Not the Same)
A common mistake we see in people’s queries is averaging averages. Some people may think it’s obvious to not average averages. However, across the web, there are discussions and whole articles explaining why it is bad to average averages.
我们在人们的查询中看到的一个常见错误是平均数。 某些人可能认为不平均并不容易。 但是,在网络上,有讨论和整篇文章解释了为什么平均平均值不好。
Why is it bad to average averages, both in SQL and in general? Because it can be skewed by averages that were based on low numbers of whatever you are averaging.
在SQL和一般情况下,为什么平均平均值不好? 因为它可能会由于基于您所求平均值的少量数字的平均值而产生偏差。
For example, look at the table below:
例如,查看下表:
Here we’ve already averaged the cost per claim at the county level. What we also can see is that one county’s average is based on 100 claims and the other is based on 2 claims. In a real-life situation, this table would not include the total number of claims — we’re using it to illustrate how easily you can skew an average.
在这里,我们已经在县一级平均了每项索赔的成本。 我们还可以看到,一个县的平均值基于100个索赔,另一个县的平均值基于2个索赔。 在现实生活中,此表将不包含索赔总数–我们正在使用它来说明您可以很容易地使平均值偏斜。
What if we wanted to find the average of all the counties. If you were to try averaging the average, then you would get $525. That doesn’t seem right.
如果我们想找到所有县的平均值怎么办。 如果您尝试对平均值进行平均,那么您将获得525美元。 那似乎不对。
If 100 claims were on average $50 and only 2 averaged 1000, then the average of all those values should be closer to $50 not $500. So, in fact, the average of these claims is about $68. But if you average the average you get a number almost ten times greater.
如果有100个索赔平均为50美元,而只有2个平均为1000美元,那么所有这些值的平均值应该接近50美元,而不是500美元。 因此,实际上,这些索赔的平均值约为68美元。 但是,如果将平均数取平均值,您得到的数字将增加近十倍。
So why do people even ask if it’s OK to average the average? Well, sometimes averaging the average can feel close to the expected output.
那么,为什么人们甚至问平均数是否可以? 好吧,有时取平均值可以感到接近预期的输出。
Let’s look at a SQL example:
让我们看一个SQL示例:
In this case, we’ll be using a table that has the average cost per patient and average visits per patient by county and age. However, we would like to find the average cost per patient and visits per patient at the county level.
在这种情况下,我们将使用一个表,该表具有按县和年龄划分的每位患者的平均费用和每位患者的平均就诊次数。 但是,我们希望找到县级每位患者的平均费用和每位患者的就诊次数。
If we average the averages from the table using the query above it will give us the following output:
如果我们使用上面的查询对表中的平均值求平均值,则会得到以下输出:
But we could also correctly write a query that recalculates the average at the county granularity, as shown below:
但是我们也可以正确编写一个查询,以县级粒度重新计算平均值,如下所示:
Let’s compare this query’s output to the previous output:
让我们将此查询的输出与上一个输出进行比较:
You will notice a few differences in the King County output. If we compare the average visits they seem quite similar — 2.4 versus 2.6. This is probably why some people fall for the average of averages — they can sometimes be close to the actual output, so it can be tempting to use this method.
您会注意到King County输出中的一些差异。 如果我们比较平均访问量,则它们看起来非常相似-2.4与2.6。 这可能就是为什么有些人无法获得平均值的原因-他们有时可能接近实际产出,因此使用此方法可能很诱人。
However, when we look at the average cost per claim we’ll notice that there’s a nearly $58 difference between about $560 and $620 — almost 10%. When you’re talking about cost-savings, that’s a huge difference.
但是,当我们查看每项索赔的平均费用时,我们会注意到,大约560美元和620美元之间存在近58美元的差额,几乎是10%。 当您谈论节省成本时,那是巨大的差异。
So although the difference between 2.4 and 2.6 may seem small, it can lead to some massive differences.
因此,尽管2.4和2.6之间的差异似乎很小,但可能导致一些巨大差异。
您可以在总和内使用个案陈述 (You Can Use A Case Statement Inside Sum)
Another great tip when writing SQL is learning how to use case statements in your sum clause. This can be very useful when you are trying to write metrics with a ratio or a numerator.
编写SQL时的另一个很棒的技巧是学习如何在sum子句中使用case语句。 当您尝试使用比率或分子编写指标时,这可能非常有用。
For example, take a look at the query below. You will see that we need to hit the table claims twice to get the count of values we are trying to filter as well the total number of rows. However, we could reduce this.
例如,看看下面的查询。 您将看到我们需要按两次表声明才能获得我们要过滤的值的计数以及总行数。 但是,我们可以减少这一点。
We can write a case statement to count the total values where the condition is true and then divide by the total count, as in the query below.
我们可以编写一个case语句来计算条件为true时的总数,然后除以总数,如下面的查询所示。
You’ll notice that we don’t need to hit the table twice to get both numbers. This is also simpler to read. From my experience, this trick is usually picked up by most SQL developers somewhere in their first year or two of using SQL.
您会注意到,我们不需要两次打赌就能获得两个数字。 这也更容易阅读。 根据我的经验,大多数SQL开发人员通常在使用SQL的第一两年中就意识到了这一技巧。
It’s extremely helpful for writing code that counts the percentage of nulls in a row, or to calculate metrics for dashboards. In turn, this is why many analysts and data engineers will become familiar with this trick, as long as they have to write a decent amount of SQL and don’t just use drag and drop solutions.
这对于编写可计算连续空值百分比的代码或计算仪表板的度量标准非常有用。 反过来,这就是为什么许多分析人员和数据工程师将变得熟悉此技巧的原因,因为他们必须编写大量SQL,而不仅仅是使用拖放解决方案。
了解数组以及如何操作它们 (Understanding Arrays and How to Manipulate Them)
Arrays and maps inside of your database tables aren’t too common. However, I’ve noticed more and more teams relying on unstructured data which can often utilize data structures like arrays and array functions.
数据库表内部的数组和映射不太常见。 但是,我注意到越来越多的团队依赖非结构化数据,这些数据通常可以利用数组和数组函数之类的数据结构。
This is because databases like Postgres and SQL engines like Presto allow for you to handle arrays in your query.
这是因为Postgres之类的数据库和Presto之类SQL引擎允许您处理查询中的数组。
Although Arrays and maps are not a new concept they’re a somewhat new concept for some analysts and data scientists who perhaps aren’t as familiar with programming. This means you may need to occasionally learn a few array and map functions to extract data.
尽管数组和映射不是一个新概念,但对于一些对编程不太熟悉的分析师和数据科学家来说,这是一个新概念。 这意味着您可能偶尔需要学习一些数组和映射函数以提取数据。
Let’s start by learning how to unnest a map in presto. A map is a data structure that provides a key:value
relationship. This means you can provide a unique key like a specific description about the value like “first_name”:”George”
. A map can also contain multiple key-value pairs like the image below.
让我们开始学习如何预先隐藏地图。 映射是一种提供key:value
关系的数据结构。 这意味着您可以提供一个唯一键,例如有关值的特定说明,例如“first_name”:”George”
。 映射也可以包含多个键/值对,如下图所示。
In this case, we have two keys, dob and friend_id
s that we would like to access:
在这种情况下,我们有两个键,dob和friend_id
,我们想访问它们:
So how do we access that data? Let’s check out the query below.
那么我们如何访问这些数据? 让我们看看下面的查询。
As you can see, you can define a row for both the key and value. So when we pull out the data you can get the specific data types.
如您所见,您可以为键和值定义一行。 因此,当我们提取数据时,您可以获得特定的数据类型。
The output will look like this:
输出将如下所示:
You can also check the length of arrays, find specific keys, and so much more (read more about presto arrays here). I recommend you don’t just use maps and arrays as replacements for good data modeling, however, they can come in handy when you’re working with data that you might not want a specific schema for.
您还可以检查数组的长度,找到特定的键,等等( 在这里了解更多关于presto数组的信息 )。 我建议您不要仅仅使用映射和数组来代替良好的数据建模,但是,当您使用不需要特定模式的数据时,它们可以派上用场。
领先和落后以避免自我加入 (Lead and Lag to Avoid Self Joins)
Finally, let’s talk about using Lead
and Lag
window functions to avoid self joins.
最后,让我们谈谈使用Lead
和Lag
窗口函数来避免自我连接。
When you’re doing analytics you will often need to compare two events output or calculate the amount of time between two events.
在进行分析时,您通常需要比较两个事件的输出或计算两个事件之间的时间量。
One way you can do this is to self-join a table to itself and connect the two rows. However, other nifty SQL functions are the Lag
and Lead
functions.
一种执行此操作的方法是将表自连接到自身并连接两行。 但是,其他漂亮SQL函数是Lag
和Lead
函数。
These allow a user to reference a specified lagging or leading value. You can also specify the desired level of granularity of the lagging and leading values.
这些允许用户参考指定的滞后值或前导值。 您还可以指定所需的滞后值和前导值的粒度级别。
For example, in the query below we are partitioning the lagging and leading value by patient_id
. This means we are only looking at lagging and lead claim_dates
and claim_costs
at the patient level:
例如,在下面的查询中,我们按patient_id
对滞后值和前导值进行了patient_id
。 这意味着我们仅在患者级别查看滞后和领先的claim_dates
和claim_costs
:
The output of this query will look like this:
该查询的输出将如下所示:
You will notice that for the first date of every patient the lagging claim_date
and cost is null. This is because there’s no prior cost or claim date.
您会注意到,对于每个患者的第一个日期, claim_date
和cost都为空。 这是因为没有事先的费用或索赔日期。
Overall, the lag
and lead
functions can make an SQL developer's life much simpler.
总体而言, lag
和lead
功能可使SQL开发人员的工作变得更加简单。
SQL的细节问题 (The Details Matter With SQL)
SQL remains the language of data. Learning these tips and tricks can help ensure that your next dashboard or analysis is that much better. Whether you avoid averaging averages, or write data quality checks, these small improvements make a huge difference. Some of these issues have caused large issues and discussions in companies, so we hope this helps bring many of you up to speed.
SQL仍然是数据的语言。 学习这些提示和技巧可以帮助确保您的下一个仪表板或分析效果更好。 无论您是避免求平均值的平均值,还是编写数据质量检查,这些小的改进都将带来巨大的不同。 其中一些问题已引起公司中的大问题和讨论,因此我们希望这有助于使您中的许多人快速入门。
In addition, if you follow these SQL tips, your data analysis will be more accurate and you can be more confident in the numbers you provide.
另外,如果您遵循这些SQL提示,则数据分析将更加准确,并且您对所提供的数字将更有信心。
Thanks for reading.
谢谢阅读。
翻译自: https://medium.com/better-programming/4-sql-tips-for-data-scientist-and-data-engineers-56c41487752f
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388323.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!