sql 左联接 全联接
The last couple of blogs that I have written have been great for beginners ( Data Concepts Without Learning To Code or Developing A Data Scientist’s Mindset). But, I would really like to push myself to create content for other members of my audience as well. So today, we are going to take a step into more intermediate data analysis territory *dun dun dun*…. and discuss self joins: what they are and how to use them to take your analysis to the next level.
我写的最后两个博客对初学者非常有用( 无需学习编码或发展数据科学家思维方式的 数据概念 )。 但是,我真的很想推动自己为观众的其他成员创建内容。 因此,今天,我们将迈入更中间的数据分析领域* dun dun dun *...。 并讨论自我联接:它们是什么以及如何使用它们将您的分析提高到一个新的水平。
Disclaimer: This post assumes that you already understand how joins work in SQL. If you are not familiar with this concept yet, no worries at all! Save this article for later because I think it’ll definitely be useful as you master SQL in the future.
免责声明:本文假定您已经了解联接在SQL中的工作方式。 如果您还不熟悉这个概念,那就不用担心了! 请保存本文以供以后使用,因为我认为将来掌握SQL肯定会很有用。
什么是“自”联接? (What is a “Self” Join?)
A self join is actually as literal as it gets — it is joining a database table to itself. You can use any kind of join you want to perform a self join (left, right, inner, etc.) — what makes it the self join is that you use the same table on both sides. Just make sure that you select the correct join type for your specific scenario and desired outcome.
自我连接实际上就是获得的字面量-它是将数据库表连接到自身。 您可以使用任何类型的联接来执行自联接(左,右,内部等)—使之成为自联接的原因是您在两边都使用相同的表。 只需确保为您的特定方案和所需结果选择正确的联接类型即可。
我应该何时使用自我加入? (When Should I Use a Self Join?)
If you’ve been working or studying in the field of data analytics and data science for more than, say, 5 minutes, you’ll know that there are always 27 ways to solve a problem. Some are better than others of course, but sometimes the differences are almost indiscernible.
如果您在数据分析和数据科学领域从事了超过5分钟的工作或学习,那么您将知道总有27种方法可以解决问题。 当然,有些比其他的要好,但是有时差异几乎是看不到的。
That being said, there is probably never going to be one exact case where you MUST HAVE a self join or your analysis will shrivel up and die with nowhere to turn. *drop me a scenario in the comments below if you’ve got one, of course*
话虽这么说,可能永远不会有一个确切的案例,您必须进行自我加入,否则您的分析将崩溃而死,无处可去。 *如果有,请在下面的评论中给我一个方案,*
But, I do at least have some scenarios where I have used self joins to solve my analytics problems, at work or in personal analysis. Here’s my own spin on two of the best (AKA the ones that I remember and can think of a good example for).
但是,至少在某些情况下,我在工作中或个人分析中使用了自我联接来解决我的分析问题。 这是我自己选出的两个最好的(也就是我记得并可以想到的一个很好的例子)。
方案1:消息/响应 (Scenario #1: Message/Response)
Suppose that there exists a database table called Chats that holds all of the chat messages that have been sent or received by an online clothing store business.
假设存在一个名为Chats的数据库表,其中包含在线服装店业务已发送或接收的所有聊天消息。
It would be extremely beneficial for the clothing store owner to know how long it usually takes her to respond to messages from her customers.
对于服装店的老板来说,知道她通常需要多长时间才能响应来自客户的消息,这将是非常有益的。
But, the messages from her customers and messages to her customers are in the same data source. So, we can use a self join to query the data and provide this analysis to the store owner. We will need one copy of the Chats table to get the initial message from the customer and one copy of the Chats table to get the response from the owner. Then, we can do some date math on the dates associated with those events to figure out how long the store owner is taking to respond.
但是,来自她的客户的消息和发给她的客户的消息位于同一数据源中。 因此,我们可以使用自我联接来查询数据并将此分析提供给商店所有者。 我们将需要一个Chats表副本来获取来自客户的初始消息,并需要一个Chats表副本来获取所有者的响应。 然后,我们可以对与这些事件相关的日期进行一些日期数学运算,以确定商店所有者需要花多长时间进行响应。
I would write this hypothetical self join query as the following:
我将这个假设的自我联接查询编写如下:
SELECT
msg.MessageDateTime AS CustomerMessageDateTime,
resp.MessageDateTime AS ResponseDateTime,
DATEDIFF(day, msg.MessageDateTime, resp.MessageDateTime)
AS DaysToRespond
FROM
Chats msg
INNER JOIN resp ON msg.MsgId = resp.RespondingTo
Note: This SQL query is written using Transact-SQL. Use whatever date functions work for your database at hand.
注意:此SQL查询是使用Transact-SQL编写的。 使用适用于您的数据库的任何日期函数。
This query is relatively straightforward, since the RespondingTo column gives us a one-to-one mapping of which original message to join back to.
该查询相对简单,因为RespondingTo列为我们提供了将原始消息加入其中的一对一映射。
方案2:开启/关闭 (Scenario #2: On/Off)
Let’s say this time you are presented a database table AccountActivity that holds a log of events that can occur on a yoga subscription site. The yoga site offers certain “premium trial periods” where customers can get a discounted membership rate for some period when they first join. The trials starting and ending date are tracked in this table with the Premium_Start and PremiumEnd event types.
假设这次是为您提供一个数据库表AccountActivity,该表包含一个瑜伽预订网站上可能发生的事件的日志。 瑜伽网站提供某些“高级试用期”,在此期间,客户在首次加入时可以享受一定的折扣会员价。 在此表中,使用Premium_Start和PremiumEnd事件类型跟踪审判的开始和结束日期。
Suppose that some employees on the business side at this yoga subscription company are asking 1. how many people have the premium trial period currently active, and 2. how many used to have the premium trial period active but now they don’t.
假设这家瑜伽订阅公司的业务方面的一些员工在问:1.有多少人当前处于保费试用期,并且2.有多少人以前有保费试用期,但现在却没有。
Again, we’ve got the event for the premium period being started and the premium period being ended in the same database table (along with the other account activity as well).
再次,我们在同一数据库表中(同时还有其他帐户活动)开始了保费期开始和保费期结束的事件。
分析请求A:高级试用期内的帐户 (Analysis Request A: Accounts in Premium Trial Period)
To answer the first question, we need to find events where a premium membership was started but has not been ended yet. So, we need to join the AccountActivity table to itself to look for premium start and premium end event matches. But, we can’t use an inner join this time. We need the null rows in the end table… so left join it is.
要回答第一个问题,我们需要找到开始高级会员资格但尚未结束的活动。 因此,我们需要将AccountActivity表自身连接起来,以查找高级开始事件和高级结束事件匹配项。 但是,这次我们不能使用内部联接。 我们需要终端表中的空行…因此需要左连接。
SELECT
t_start.UserId,
t_start.EventDateTime AS PremiumTrialStart,
DATEDIFF(day, t_start.EventDateTime, GETDATE()) AS DaysInTrial
FROM
AccountActivity t_start
LEFT JOIN AccountActivity t_end ON t_start.UserId = t_end.UserId
AND t_start.EventType = 'Premium_Start'
AND t_end.EventType = 'Premium_End'
WHERE
t_end.EventDateTime IS NULL
Notice how we also check and make sure that the events we are joining are in the right order. We want the premium trial start on the left side of the join, and the premium trial end on the right side of the join. We also make sure that the User Id matches on both sides. We wouldn’t want to join events from two different customers!
请注意,我们还如何检查并确保我们加入的事件的顺序正确。 我们希望高级试用版在联接的左侧开始,而高级试用版在联接的右侧结束。 我们还确保用户ID在两侧都匹配。 我们不想参加来自两个不同客户的活动!
分析请求B:曾经处于高级试用期的帐户 (Analysis Request B: Accounts Who Used to Be in Premium Trial Period)
Regarding the second question, we want to find the customers whose sweet premium trial has come to an end. We are going to need to self join AccountActivity again, but this time we can switch it up to be a little stricter. We want matches from both the left and right, since, in this population, the trial has ended. So, we can choose an inner join this time.
关于第二个问题,我们想找到甜蜜溢价试用期已结束的客户。 我们将需要再次自行加入AccountActivity,但是这次我们可以将其更改为更严格一些。 我们需要左右匹配,因为在此人群中,审判已经结束。 因此,这次我们可以选择一个内部联接。
SELECT
t_start.UserId,
t_start.EventDateTime AS PremiumTrialStart,
DATEDIFF(day, t_start.EventDateTime, t_end.EventDateTime)
AS DaysInTrial
FROM
AccountActivity t_start
INNER JOIN AccountActivity t_end ON t_start.UserId = t_end.UserId
AND t_start.EventType = 'Premium_Start'
AND t_end.EventType = 'Premium_End'
See, self joins are pretty fun. They can be pretty useful in cases where you have events that are related to each other in the same database table. Thanks for reading, and happy querying. 🙂
看到,自我加入很有趣。 在同一数据库表中具有彼此相关的事件的情况下,它们非常有用。 感谢您的阅读和查询。 🙂
Originally published at https://datadreamer.io on August 13, 2020.
最初于 2020年8月13日 发布在 https://datadreamer.io 。
翻译自: https://towardsdatascience.com/take-your-sql-skills-to-the-next-level-by-understanding-the-self-join-75f1d52f2322
sql 左联接 全联接
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388107.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!