「目录」
数据规整:聚合、合并和重塑
Data Wrangling: Join, Combine, and Reshape
8.1 => 层次化索引 8.2 => 合并数据集
--------> 数据库风格的DataFrame合并
--------> 索引上的合并
--------> 轴向连接
8.3 => 重塑和轴向旋转
轴向连接
另一种数据合并运算也被称作连接(concatenation)、绑定(binding)或堆叠(stacking)。Numpy的concatenation函数可以用Numpy数组来做:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: arr = np.arange(12).reshape((3, 4))
In [4]: arr
Out[4]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [5]: np.concatenate([arr, arr], axis=1)
Out[5]:
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])复制
假设有三个没有重叠索引的Series:
In [6]: s1 = pd.Series([0, 1], index=['a', 'b'])
In [7]: s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
In [8]: s3 = pd.Series([5, 6], index=['f', 'g'])复制
对这些对象调用concat可以将值和索引粘合在一起:
In [9]: pd.concat([s1, s2, s3])
Out[9]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64复制
默认情况下,concat是在axis=0上工作的,最终产生一个新的Series。如果传入axis=1,则结果就会变成一个DataFrame(axis=1是列):
In [10]: pd.concat([s1, s2, s3], axis=1)
Out[10]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0复制
传入join='inner'可以得到他们的交集:
In [11]: s4 = pd.concat([s1, s3])
In [12]: s4
Out[12]:
a 0
b 1
f 5
g 6
dtype: int64
In [13]: pd.concat([s1, s4], axis=1)
Out[13]:
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
In [14]: pd.concat([s1,s4], axis=1, join='inner')
Out[14]:
0 1
a 0 0
b 1 1复制
现在我们可以看到因为传入了join='inner',f索引和g索引那两行消失了。
我们还可以通过join_axes指定要使用的索引:
In [16]: pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
Out[16]:
0 1
a 0.0 0.0
c NaN NaN
b 1.0 1.0
e NaN NaN复制
一个可能遇到的问题是,如果我们将s1,s1和s3拼接的时候,可能因为重复的索引导致无法区分结果。
In [35]: pd.concat([s1,s1,s3])
Out[35]:
a 0
b 1
a 0
b 1
f 5
g 6
dtype: int64复制
解决的办法是我们可以在连接轴上创建一个层次化索引。使用keys参数即可达到这个目的:
In [17]: result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])
In [18]: result
Out[18]:
one a 0
b 1
two a 0
b 1
three f 5
g 6
dtype: int64复制
如果沿着axis=1对Series进行合并,则keys就会成为DataFrame的列名:
In [20]: pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])
Out[20]:
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0复制
同样的逻辑适用于DataFrame对象:
In [21]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], columns=['one', 'two'])
In [22]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'], columns=['three', 'four'])
In [23]: df1
Out[23]:
one two
a 0 1
b 2 3
c 4 5
In [24]: df2
Out[24]:
three four
a 5 6
c 7 8
In [25]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
Out[25]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0复制
如果传入的不是列表而是一个字典,则字典的键就会被当作keys选项的值:
In [28]: pd.concat({'level1':df1, 'level2':df2}, axis=1)
Out[28]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0复制
另外,我们还可以用names参数命名创建的轴级别:
In [29]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'], names=['upper', 'lower'])
Out[29]:
upper level1 level2
lower one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0复制
本篇最后一个关于DataFrame的问题,若DataFrame的行索引不包含任何相关数据:
In [30]: df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
In [31]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
In [32]: df1
Out[32]:
a b c d
0 -0.402664 -0.861283 0.419665 -0.010507
1 0.244767 -1.023330 -0.509473 0.773439
2 -0.073538 0.966644 0.583360 0.593361
In [33]: df2
Out[33]:
b d a
0 -0.353377 -1.607486 0.207777
1 0.858092 0.795677 0.639189复制
在这种情况下,我们可以传入ignore_index=True,结果就不会保留连接轴上的索引,而是产生一组新索引:
In [36]: pd.concat([df1, df2])
Out[36]:
a b c d
0 -0.402664 -0.861283 0.419665 -0.010507
1 0.244767 -1.023330 -0.509473 0.773439
2 -0.073538 0.966644 0.583360 0.593361
0 0.207777 -0.353377 NaN -1.607486
1 0.639189 0.858092 NaN 0.795677
In [37]: pd.concat([df1, df2], ignore_index=True)
Out[37]:
a b c d
0 -0.402664 -0.861283 0.419665 -0.010507
1 0.244767 -1.023330 -0.509473 0.773439
2 -0.073538 0.966644 0.583360 0.593361
3 0.207777 -0.353377 NaN -1.607486
4 0.639189 0.858092 NaN 0.795677复制
往期回顾



Stay hungry, stay foolish