与SQL的比较#
由于许多潜在的Pandas用户对 SQL ,本页面旨在提供一些示例,说明如何使用Pandas执行各种SQL操作。
如果你是Pandas的新手,你可能想先通读一下 10 Minutes to pandas 让你自己熟悉类库。
按照惯例,我们进口Pandas和NumPy如下:
In [1]: import pandas as pd
In [2]: import numpy as np
大多数示例都将利用 tips
在Pandas测试中发现的数据集。我们将把数据读入一个名为 tips
假设我们有一个名称和结构相同的数据库表。
In [3]: url = (
...: "https://raw.github.com/pandas-dev"
...: "/pandas/main/pandas/tests/io/data/csv/tips.csv"
...: )
...:
In [4]: tips = pd.read_csv(url)
In [5]: tips
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
复印件与就地操作#
大多数大Pandas的操作都会返回 Series
/DataFrame
。要使更改“生效”,您需要为一个新变量赋值:
sorted_df = df.sort_values("col1")
或覆盖原始文件:
df = df.sort_values("col1")
备注
您将看到一个 inplace=True
某些方法可用的关键字参数:
df.sort_values("col1", inplace=True)
它的使用是不鼓励的。 More information.
SELECT#
在SQL中,选择是使用您想要选择的列的逗号分隔列表(或 *
要选择所有列):
SELECT total_bill, tip, smoker, time
FROM tips;
对于PANDA,列选择是通过将列名列表传递给DataFrame来完成的:
In [6]: tips[["total_bill", "tip", "smoker", "time"]]
Out[6]:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner
.. ... ... ... ...
239 29.03 5.92 No Dinner
240 27.18 2.00 Yes Dinner
241 22.67 2.00 Yes Dinner
242 17.82 1.75 No Dinner
243 18.78 3.00 No Dinner
[244 rows x 4 columns]
调用不带列名列表的DataFrame将显示所有列(类似于SQL的 *
)。
在SQL中,您可以添加计算列:
SELECT *, tip/total_bill as tip_rate
FROM tips;
对于Pandas,您可以使用 DataFrame.assign()
用于追加新列的DataFrame方法:
In [7]: tips.assign(tip_rate=tips["tip"] / tips["total_bill"])
Out[7]:
total_bill tip sex smoker day time size tip_rate
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
.. ... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3 0.203927
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
241 22.67 2.00 Male Yes Sat Dinner 2 0.088222
242 17.82 1.75 Male No Sat Dinner 2 0.098204
243 18.78 3.00 Female No Thur Dinner 2 0.159744
[244 rows x 8 columns]
WHERE#
SQL中的过滤是通过WHERE子句完成的。
SELECT *
FROM tips
WHERE time = 'Dinner';
可以通过多种方式过滤DataFrame;其中最直观的是使用 boolean indexing 。
In [8]: tips[tips["total_bill"] > 10]
Out[8]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[227 rows x 7 columns]
上面的语句只是将一个 Series
of True
/False
对象绑定到DataFrame,并使用 True
。
In [9]: is_dinner = tips["time"] == "Dinner"
In [10]: is_dinner
Out[10]:
0 True
1 True
2 True
3 True
4 True
...
239 True
240 True
241 True
242 True
243 True
Name: time, Length: 244, dtype: bool
In [11]: is_dinner.value_counts()
Out[11]:
True 176
False 68
Name: time, dtype: int64
In [12]: tips[is_dinner]
Out[12]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[176 rows x 7 columns]
Just like SQL's OR
and AND
, multiple conditions can be passed to a DataFrame using |
(OR
) and &
(AND
).
晚餐时5美元以上的小费:
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
In [13]: tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)]
Out[13]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
170 50.81 10.00 Male Yes Sat Dinner 3
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
至少5名用餐者或账单总额超过45美元的派对小费:
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
In [14]: tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)]
Out[14]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5
In [15]: frame = pd.DataFrame(
....: {"col1": ["A", "B", np.NaN, "C", "D"], "col2": ["F", np.NaN, "G", "H", "I"]}
....: )
....:
In [16]: frame
Out[16]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
假设我们有一个与上面的DataFrame具有相同结构的表。我们只能看到记录中 col2
使用以下查询时为空:
SELECT *
FROM frame
WHERE col2 IS NULL;
In [17]: frame[frame["col2"].isna()]
Out[17]:
col1 col2
1 B NaN
获取项目的位置 col1
IS NOT NULL可以用 notna()
。
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [18]: frame[frame["col1"].notna()]
Out[18]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I
分组依据#
在Pandas身上,SQL的 GROUP BY
操作是使用名称类似的 groupby()
方法。 groupby()
通常是指我们想要将数据集分割成组,应用一些函数(通常是聚合),然后将组组合在一起的过程。
一个常见的SQL操作是获取整个数据集中每个组中的记录计数。例如,一个按性别统计小费数量的查询:
SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/
大Pandas的等价物是:
In [19]: tips.groupby("sex").size()
Out[19]:
sex
Female 87
Male 157
dtype: int64
请注意,在我们使用的Pandas代码中 size()
而不是 count()
。这是因为 count()
将该函数应用于每一列,返回 NOT NULL
每条中的记录。
In [20]: tips.groupby("sex").count()
Out[20]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157
或者,我们也可以应用 count()
方法绑定到单个列:
In [21]: tips.groupby("sex")["total_bill"].count()
Out[21]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64
还可以同时应用多种功能。例如,假设我们想要查看小费金额在一周中的不同天是如何变化的- agg()
允许您将字典传递给分组的DataFrame,以指示要将哪些函数应用于特定列。
SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thu 2.771452 62
*/
In [22]: tips.groupby("day").agg({"tip": np.mean, "day": np.size})
Out[22]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
按多个列分组的方法是将列列表传递给 groupby()
方法。
SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thu 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thu 17 3.030000
*/
In [23]: tips.groupby(["smoker", "day"]).agg({"tip": [np.size, np.mean]})
Out[23]:
tip
size mean
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
JOIN#
JOIN
s可使用以下命令执行 join()
或 merge()
。默认情况下, join()
将加入DataFrame的索引。每种方法都有允许您指定要执行的连接类型的参数 (LEFT
, RIGHT
, INNER
, FULL
)或要联接的列(列名或索引)。
警告
如果两个键列都包含键为空值的行,则将对这些行进行相互匹配。这与通常的SQL连接行为不同,可能会导致意外的结果。
In [24]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
In [25]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})
假设我们有两个名称和结构与DataFrames相同的数据库表。
现在让我们来看看各种类型的 JOIN
s.
内连接#
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
# merge performs an INNER JOIN by default
In [26]: pd.merge(df1, df2, on="key")
Out[26]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
merge()
还提供了用于连接一个DataFrame的列和另一个DataFrame的索引的情况的参数。
In [27]: indexed_df2 = df2.set_index("key")
In [28]: pd.merge(df1, indexed_df2, left_on="key", right_index=True)
Out[28]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
3 D -1.135632 0.119209
左向外连接#
显示来自的所有记录 df1
。
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
In [29]: pd.merge(df1, df2, on="key", how="left")
Out[29]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
右连接#
显示来自的所有记录 df2
。
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
In [30]: pd.merge(df1, df2, on="key", how="right")
Out[30]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236
完全加入#
Pandas还允许 FULL JOIN
显示数据集的两侧,而不管联接的列是否找到匹配项。在撰写本文时, FULL JOIN
并非所有RDBMS(MySQL)都支持。
显示两个表中的所有记录。
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;
In [31]: pd.merge(df1, df2, on="key", how="outer")
Out[31]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236
UNION#
UNION ALL
可以使用以下工具执行 concat()
。
In [32]: df1 = pd.DataFrame(
....: {"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)}
....: )
....:
In [33]: df2 = pd.DataFrame(
....: {"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]}
....: )
....:
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/
In [34]: pd.concat([df1, df2])
Out[34]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5
SQL的 UNION
类似于 UNION ALL
然而, UNION
将删除重复行。
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
*/
在Pandas身上,你可以使用 concat()
与 drop_duplicates()
。
In [35]: pd.concat([df1, df2]).drop_duplicates()
Out[35]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5
LIMIT#
SELECT * FROM tips
LIMIT 10;
In [36]: tips.head(10)
Out[36]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2
某些SQL分析函数和聚合函数的PANDA等效项#
带偏移量的前n行#
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [37]: tips.nlargest(10 + 5, columns="tip").tail(10)
Out[37]:
total_bill tip sex smoker day time size
183 23.17 6.50 Male Yes Sun Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
47 32.40 6.00 Male No Sun Dinner 4
239 29.03 5.92 Male No Sat Dinner 3
88 24.71 5.85 Male No Thur Lunch 2
181 23.33 5.65 Male Yes Sun Dinner 2
44 30.40 5.60 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
211 25.89 5.16 Male Yes Sat Dinner 4
每组前n行#
-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
In [38]: (
....: tips.assign(
....: rn=tips.sort_values(["total_bill"], ascending=False)
....: .groupby(["day"])
....: .cumcount()
....: + 1
....: )
....: .query("rn < 3")
....: .sort_values(["day", "rn"])
....: )
....:
Out[38]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
170 50.81 10.00 Male Yes Sat Dinner 3 1
212 48.33 9.00 Male No Sat Dinner 4 2
156 48.17 5.00 Male No Sun Dinner 6 1
182 45.35 3.50 Male Yes Sun Dinner 3 2
197 43.11 5.00 Female Yes Thur Lunch 4 1
142 41.19 5.00 Male No Thur Lunch 5 2
使用相同的方法 rank(method='first')
功能
In [39]: (
....: tips.assign(
....: rnk=tips.groupby(["day"])["total_bill"].rank(
....: method="first", ascending=False
....: )
....: )
....: .query("rnk < 3")
....: .sort_values(["day", "rnk"])
....: )
....:
Out[39]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
212 48.33 9.00 Male No Sat Dinner 4 2.0
156 48.17 5.00 Male No Sun Dinner 6 1.0
182 45.35 3.50 Male Yes Sun Dinner 3 2.0
197 43.11 5.00 Female Yes Thur Lunch 4 1.0
142 41.19 5.00 Male No Thur Lunch 5 2.0
-- Oracle's RANK() analytic function
SELECT * FROM (
SELECT
t.*,
RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
FROM tips t
WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;
让我们找出每个性别组(提示<2)的提示(排名<3)。请注意,在使用 rank(method='min')
功能 rnk_min
保持原样不变 tip
(作为甲骨文的 RANK()
功能)
In [40]: (
....: tips[tips["tip"] < 2]
....: .assign(rnk_min=tips.groupby(["sex"])["tip"].rank(method="min"))
....: .query("rnk_min < 3")
....: .sort_values(["sex", "rnk_min"])
....: )
....:
Out[40]:
total_bill tip sex smoker day time size rnk_min
67 3.07 1.00 Female Yes Sat Dinner 1 1.0
92 5.75 1.00 Female Yes Fri Dinner 2 1.0
111 7.25 1.00 Female No Sat Dinner 1 1.0
236 12.60 1.00 Male Yes Sat Dinner 2 1.0
237 32.83 1.17 Male Yes Sat Dinner 2 2.0
UPDATE#
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
In [41]: tips.loc[tips["tip"] < 2, "tip"] *= 2
DELETE#
DELETE FROM tips
WHERE tip > 9;
在Pandas中,我们选择应该保留的行,而不是删除它们:
In [42]: tips = tips.loc[tips["tip"] <= 9]