题目列表：

数据统计:
- 2082. The Number of Rich Customers
- 1173. Immediate Food Delivery I
- 1907. Count Salary Categories
数据分组
- 1741. Find Total Time Spent by Each Employee
- 511. Game Play Analysis I
- 2356. Number of Unique Subjects Taught by Each Teacher
- 596. Classes More Than 5 Students
- 586. Customer Placing the Largest Number of Orders
- 1484. Group Sold Products By The Date（好题）

数据统计:

2082. The Number of Rich Customers

原题链接：2082. The Number of Rich Customers

Table: Store

+-------------+------+
| Column Name | Type |
+-------------+------+
| bill_id     | int  |
| customer_id | int  |
| amount      | int  |
+-------------+------+
bill_id is the primary key (column with unique values) for this table.
Each row contains information about the amount of one bill and the customer associated with it.

Write a solution to report the number of customers who had at least one bill with an amount strictly greater than 500 .

The result format is in the following example.

Example 1:

Input:

Store table:
+---------+-------------+--------+
| bill_id | customer_id | amount |
+---------+-------------+--------+
| 6       | 1           | 549    |
| 8       | 1           | 834    |
| 4       | 2           | 394    |
| 11      | 3           | 657    |
| 13      | 3           | 257    |
+---------+-------------+--------+

Output:

+------------+
| rich_count |
+------------+
| 2          |
+------------+

Explanation:
Customer 1 has two bills with amounts strictly greater than 500.
Customer 2 does not have any bills with an amount strictly greater than 500.
Customer 3 has one bill with an amount strictly greater than 500.

题目大意：
统计一下amount大于500的customer的数目，一个customer会有多张单据，需要去重后统计

pandas思路：
nunique() 方法能直接返回去重后的个数

pandas实现：

import pandas as pddef count_rich_customers(store: pd.DataFrame) -> pd.DataFrame:store = store[store['amount'] > 500]return pd.DataFrame({'rich_count': [store['customer_id'].nunique()]})

MySQL思路：
用一下 count 和 distinct

MySQL实现：

# Write your MySQL query statement below
SELECTCOUNT(DISTINCT customer_id) AS rich_count
FROM Store
WHERE amount > 500

1173. Immediate Food Delivery I

原题链接：1173. Immediate Food Delivery I

Table: Delivery

+-----------------------------+---------+
| Column Name                 | Type    |
+-----------------------------+---------+
| delivery_id                 | int     |
| customer_id                 | int     |
| order_date                  | date    |
| customer_pref_delivery_date | date    |
+-----------------------------+---------+
delivery_id is the primary key (column with unique values) of this table.
The table holds information about food delivery to customers that make orders at some date and specify a preferred delivery date (on the same order date or after it).

If the customer’s preferred delivery date is the same as the order date, then the order is called immediate; otherwise, it is called scheduled.

Write a solution to find the percentage of immediate orders in the table, rounded to 2 decimal places.

The result format is in the following example.

Example 1:

Input:

Delivery table:
+-------------+-------------+------------+-----------------------------+
| delivery_id | customer_id | order_date | customer_pref_delivery_date |
+-------------+-------------+------------+-----------------------------+
| 1           | 1           | 2019-08-01 | 2019-08-02                  |
| 2           | 5           | 2019-08-02 | 2019-08-02                  |
| 3           | 1           | 2019-08-11 | 2019-08-11                  |
| 4           | 3           | 2019-08-24 | 2019-08-26                  |
| 5           | 4           | 2019-08-21 | 2019-08-22                  |
| 6           | 2           | 2019-08-11 | 2019-08-13                  |
+-------------+-------------+------------+-----------------------------+

Output:

+----------------------+
| immediate_percentage |
+----------------------+
| 33.33                |
+----------------------+

Explanation: The orders with delivery id 2 and 3 are immediate while the others are scheduled.

题目大意：
计算一个两个日期相等的条目在整表中所占的比例，保留两位小数

pandas思路：
保留两位小数用 round()

pandas实现：

import pandas as pddef food_delivery(delivery: pd.DataFrame) -> pd.DataFrame:tmp = delivery[delivery['order_date'] == delivery['customer_pref_delivery_date']]return pd.DataFrame({'immediate_percentage': [round((tmp.shape[0] / delivery.shape[0] * 100) , 2)]})

MySQL思路：
用 avg() 计算占比，用round() 保留两位小数

MySQL实现：

SELECTround( 100 * avg( order_date = customer_pref_delivery_date ), 2 ) AS immediate_percentage 
FROMdelivery

1907. Count Salary Categories

原题链接：1907. Count Salary Categories

Table: Accounts

+-------------+------+
| Column Name | Type |
+-------------+------+
| account_id  | int  |
| income      | int  |
+-------------+------+
account_id is the primary key (column with unique values) for this table.
Each row contains information about the monthly income for one bank account.

Write a solution to calculate the number of bank accounts for each salary category. The salary categories are:

“Low Salary”: All the salaries strictly less than $20000.
“Average Salary”: All the salaries in the inclusive range [$20000, $50000].
“High Salary”: All the salaries strictly greater than $50000.
The result table must contain all three categories. If there are no accounts in a category, return 0.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

Accounts table:
+------------+--------+
| account_id | income |
+------------+--------+
| 3          | 108939 |
| 2          | 12747  |
| 8          | 87709  |
| 6          | 91796  |

±-----------±-------+
Output:

+----------------+----------------+
| category       | accounts_count |
+----------------+----------------+
| Low Salary     | 1              |
| Average Salary | 0              |
| High Salary    | 3              |
+----------------+----------------+

Explanation:
Low Salary: Account 2.
Average Salary: No accounts.
High Salary: Accounts 3, 6, and 8.

题目大意：
按照规则统计低收入中等收入高收入三类人的人数

pandas思路：
统计数目比较容易，按行筛选后的行数即可。怎么组成题目要求返回的形式是本题的重点，用到 pd.DataFrame() 函数

pandas实现：

import pandas as pddef count_salary_categories(accounts: pd.DataFrame) -> pd.DataFrame:low_salary = accounts[accounts['income'] < 20000].shape[0]average_salary = accounts[(accounts['income'] >= 20000)&(accounts['income'] <= 50000)].shape[0]high_salary = accounts[accounts['income'] > 50000].shape[0]ans = pd.DataFrame({'category': ['Low Salary', 'Average Salary', 'High Salary'],'accounts_count': [low_salary, average_salary, high_salary]})return ans

MySQL思路：
case wen来按条件筛选，三个sql的结果union一下

MySQL实现：

SELECT 'Low Salary' AS category,SUM(CASE WHEN income < 20000 THEN 1 ELSE 0 END) AS accounts_count
FROM AccountsUNION
SELECT  'Average Salary' category,SUM(CASE WHEN income >= 20000 AND income <= 50000 THEN 1 ELSE 0 END) AS accounts_count
FROM AccountsUNION
SELECT 'High Salary' category,SUM(CASE WHEN income > 50000 THEN 1 ELSE 0 END) AS accounts_count
FROM Accounts

数据分组

1741. Find Total Time Spent by Each Employee

原题链接：1741. Find Total Time Spent by Each Employee

Table: Employees

+-------------+------+
| Column Name | Type |
+-------------+------+
| emp_id      | int  |
| event_day   | date |
| in_time     | int  |
| out_time    | int  |
+-------------+------+
(emp_id, event_day, in_time) is the primary key (combinations of columns with unique values) of this table.
The table shows the employees' entries and exits in an office.
event_day is the day at which this event happened, in_time is the minute at which the employee entered the office, and out_time is the minute at which they left the office.
in_time and out_time are between 1 and 1440.
It is guaranteed that no two events on the same day intersect in time, and in_time < out_time.

Write a solution to calculate the total time in minutes spent by each employee on each day at the office. Note that within one day, an employee can enter and leave more than once. The time spent in the office for a single entry is out_time - in_time.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

Employees table:
+--------+------------+---------+----------+
| emp_id | event_day  | in_time | out_time |
+--------+------------+---------+----------+
| 1      | 2020-11-28 | 4       | 32       |
| 1      | 2020-11-28 | 55      | 200      |
| 1      | 2020-12-03 | 1       | 42       |
| 2      | 2020-11-28 | 3       | 33       |
| 2      | 2020-12-09 | 47      | 74       |
+--------+------------+---------+----------+

Output:

+------------+--------+------------+
| day        | emp_id | total_time |
+------------+--------+------------+
| 2020-11-28 | 1      | 173        |
| 2020-11-28 | 2      | 30         |
| 2020-12-03 | 1      | 41         |
| 2020-12-09 | 2      | 27         |
+------------+--------+------------+

Explanation:
Employee 1 has three events: two on day 2020-11-28 with a total of (32 - 4) + (200 - 55) = 173, and one on day 2020-12-03 with a total of (42 - 1) = 41.
Employee 2 has two events: one on day 2020-11-28 with a total of (33 - 3) = 30, and one on day 2020-12-09 with a total of (74 - 47) = 27.

题目大意：
给出每个员工的进场和离场表，要求统计每个员工在场的总时长

pandas思路：
一个时间段的值作差就可得到，每个员工的总时长可以通过groupby之后sum一下得到

pandas实现：

import pandas as pddef total_time(employees: pd.DataFrame) -> pd.DataFrame:employees['total_time'] = employees['out_time'] - employees['in_time']ans = employees.groupby(by=['event_day', 'emp_id'], as_index=False).agg('sum')ans.rename(columns={'event_day':'day'}, inplace=True)ans = ans[['day', 'emp_id', 'total_time']]return ans

MySQL思路：
groupby一下，查询sum就可以了

MySQL实现：

select event_day as day, emp_id, sum(out_time - in_time) as total_time
from Employees
group by event_day, emp_id

511. Game Play Analysis I

原题链接：511. Game Play Analysis I

Table: Activity

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| player_id    | int     |
| device_id    | int     |
| event_date   | date    |
| games_played | int     |
+--------------+---------+
(player_id, event_date) is the primary key (combination of columns with unique values) of this table.
This table shows the activity of players of some games.
Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.

Write a solution to find the first login date for each player.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

Activity table:
+-----------+-----------+------------+--------------+
| player_id | device_id | event_date | games_played |
+-----------+-----------+------------+--------------+
| 1         | 2         | 2016-03-01 | 5            |
| 1         | 2         | 2016-05-02 | 6            |
| 2         | 3         | 2017-06-25 | 1            |
| 3         | 1         | 2016-03-02 | 0            |
| 3         | 4         | 2018-07-03 | 5            |
+-----------+-----------+------------+--------------+

Output:

+-----------+-------------+
| player_id | first_login |
+-----------+-------------+
| 1         | 2016-03-01  |
| 2         | 2017-06-25  |
| 3         | 2016-03-02  |
+-----------+-------------+

题目大意：
输出员工的最早打卡时间

pandas思路：
排序+去重保留第一条就可以实现题目要求

pandas实现：

import pandas as pddef game_analysis(activity: pd.DataFrame) -> pd.DataFrame:activity.sort_values(by='event_date', inplace=True) # 按照登录时间排序activity.drop_duplicates(subset='player_id', keep='first', inplace=True) # 去重activity.rename(columns={'event_date':'first_login'}, inplace=True)return activity[['player_id', 'first_login']]

2356. Number of Unique Subjects Taught by Each Teacher

原题链接：2356. Number of Unique Subjects Taught by Each Teacher

Table: Teacher

+-------------+------+
| Column Name | Type |
+-------------+------+
| teacher_id  | int  |
| subject_id  | int  |
| dept_id     | int  |
+-------------+------+
(subject_id, dept_id) is the primary key (combinations of columns with unique values) of this table.
Each row in this table indicates that the teacher with teacher_id teaches the subject subject_id in the department dept_id.

Write a solution to calculate the number of unique subjects each teacher teaches in the university.

Return the result table in any order.

The result format is shown in the following example.

Example 1:

Input:

Teacher table:
+------------+------------+---------+
| teacher_id | subject_id | dept_id |
+------------+------------+---------+
| 1          | 2          | 3       |
| 1          | 2          | 4       |
| 1          | 3          | 3       |
| 2          | 1          | 1       |
| 2          | 2          | 1       |
| 2          | 3          | 1       |
| 2          | 4          | 1       |
+------------+------------+---------+

Output:

+------------+-----+
| teacher_id | cnt |
+------------+-----+
| 1          | 2   |
| 2          | 4   |
+------------+-----+

Explanation:
Teacher 1:

They teach subject 2 in departments 3 and 4.
They teach subject 3 in department 3.

Teacher 2:

They teach subject 1 in department 1.
They teach subject 2 in department 1.
They teach subject 3 in department 1.
They teach subject 4 in department 1.

题目大意：
统计一下每个老师所交的课程的数目，同一个课程不同教室只算一门

pandas思路：
用groupby按教师id进行分组，nunique() 用于统计去重后的科目数

pandas实现：

import pandas as pddef count_unique_subjects(teacher: pd.DataFrame) -> pd.DataFrame:ans = teacher.groupby(by='teacher_id')['subject_id'].nunique().reset_index()ans.rename(columns={'subject_id':'cnt'}, inplace=True)return ans

MySQL思路：
group by，然后用count() 计数

MySQL实现：

select teacher_id, count(distinct subject_id) as cnt
from Teacher
group by teacher_id

596. Classes More Than 5 Students

原题链接：596. Classes More Than 5 Students

Table: Courses

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| student     | varchar |
| class       | varchar |
+-------------+---------+
(student, class) is the primary key (combination of columns with unique values) for this table.
Each row of this table indicates the name of a student and the class in which they are enrolled.

Write a solution to find all the classes that have at least five students.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

Courses table:
+---------+----------+
| student | class    |
+---------+----------+
| A       | Math     |
| B       | English  |
| C       | Math     |
| D       | Biology  |
| E       | Math     |
| F       | Computer |
| G       | Math     |
| H       | Math     |
| I       | Math     |
+---------+----------+

Output:

+---------+
| class   |
+---------+
| Math    |
+---------+

Explanation:

Math has 6 students, so we include it.
English has 1 student, so we do not include it.
Biology has 1 student, so we do not include it.
Computer has 1 student, so we do not include it.

题目大意：
返回学生数大于等于5的课程名

pandas思路：
使用 groupby() 按班级分组， size() 计算每个班级的出现次数（也就是学生数），然后进行行筛选即可

pandas实现：

import pandas as pddef find_classes(courses: pd.DataFrame) -> pd.DataFrame:df = courses.groupby('class').size().reset_index(name='count') # size()计算每个值出现的次数df = df[df['count'] >= 5]return df[['class']]

MySQL思路：
where关键字无法和聚合函数一起使用，having子句可以筛选分组后的各组数据

MySQL实现：

select class
from courses
group by class
having count(class) >= 5

586. Customer Placing the Largest Number of Orders

原题链接：586. Customer Placing the Largest Number of Orders

Table: Orders

+-----------------+----------+
| Column Name     | Type     |
+-----------------+----------+
| order_number    | int      |
| customer_number | int      |
+-----------------+----------+
order_number is the primary key (column with unique values) for this table.
This table contains information about the order ID and the customer ID.

Write a solution to find the customer_number for the customer who has placed the largest number of orders.

The test cases are generated so that exactly one customer will have placed more orders than any other customer.

The result format is in the following example.

Example 1:

Input:

Orders table:
+--------------+-----------------+
| order_number | customer_number |
+--------------+-----------------+
| 1            | 1               |
| 2            | 2               |
| 3            | 3               |
| 4            | 3               |
+--------------+-----------------+

Output:

+-----------------+
| customer_number |
+-----------------+
| 3               |
+-----------------+

Explanation:
The customer with number 3 has two orders, which is greater than either customer 1 or 2 because each of them only has one order.
So the result is customer_number 3.

Follow up:
What if more than one customer has the largest number of orders, can you find all the customer_number in this case?

题目大意：
返回点单数最多的顾客的编号

pandas思路：
通过 groupby() 进行分组，通过 size() 来统计出现次数，对出现次数进行降序排列，最后输出第一行即可

pandas实现：

import pandas as pddef largest_orders(orders: pd.DataFrame) -> pd.DataFrame:df = orders.groupby('customer_number').size().reset_index(name='count') # 统计次数df.sort_values(by='count', ascending=False, inplace=True) # 根据次数降序排列return df[['customer_number']].head(1)

MySQL思路：
一样也是 group by 聚合，然后 order by 排序，用 limit 1 来返回第一条

MySQL实现：

select customer_number
from orders
group by customer_number
order by count(*) desc
limit 1

1484. Group Sold Products By The Date（好题）

原题链接：1484. Group Sold Products By The Date

Table Activities :

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| sell_date   | date    |
| product     | varchar |
+-------------+---------+
There is no primary key (column with unique values) for this table. It may contain duplicates.
Each row of this table contains the product name and the date it was sold in a market.

Write a solution to find for each date the number of different products sold and their names.

The sold products names for each date should be sorted lexicographically（字典序）.

Return the result table ordered by sell_date .

The result format is in the following example.

Example 1:

Input:

Activities table:
+------------+------------+
| sell_date  | product     |
+------------+------------+
| 2020-05-30 | Headphone  |
| 2020-06-01 | Pencil     |
| 2020-06-02 | Mask       |
| 2020-05-30 | Basketball |
| 2020-06-01 | Bible      |
| 2020-06-02 | Mask       |
| 2020-05-30 | T-Shirt    |
+------------+------------+

Output:

+------------+----------+------------------------------+
| sell_date  | num_sold | products                     |
+------------+----------+------------------------------+
| 2020-05-30 | 3        | Basketball,Headphone,T-shirt |
| 2020-06-01 | 2        | Bible,Pencil                 |
| 2020-06-02 | 1        | Mask                         |
+------------+----------+------------------------------+

Explanation:
For 2020-05-30, Sold items were (Headphone, Basketball, T-shirt), we sort them lexicographically and separate them by a comma.
For 2020-06-01, Sold items were (Pencil, Bible), we sort them lexicographically and separate them by a comma.
For 2020-06-02, the Sold item is (Mask), we just return it.

题目大意：
按照日期将多行合并为一行，并统计数目

pandas思路：
使用 groupby() 进行分组，然后 agg() 进行聚合操作，重点在于将多行合并到一行，这个时不时会遇到，我觉得一定要掌握！！！

pandas实现：

import pandas as pddef categorize_products(activities: pd.DataFrame) -> pd.DataFrame:groups = activities.groupby('sell_date') # groupby的结果是一个DataFrameGroupBy对象ans = groups.agg(num_sold = ('product', 'nunique'),products = ('product', lambda x : ','.join(sorted(set(x)))) # 去重后按字典序排列, 用逗号分隔组成字符串).reset_index()ans.sort_values('sell_date', inplace=True) # 按照日期排序return ans

MySQL思路：
重点也是多行合并到一行

MySQL实现：

selectsell_date,count(distinct product) as num_sold,group_concat(distinct product order by product separator ',') as products -- 重点
fromactivities
group bysell_date
order bysell_date asc