Python数据分析笔记#8.2.3 轴向连接

Yuan的学习笔记 2022-02-04

440

「目录」

数据规整：聚合、合并和重塑

Data Wrangling: Join, Combine, and Reshape

8.1 => 层次化索引
8.2 => 合并数据集

--------> 数据库风格的DataFrame合并

--------> 索引上的合并

--------> 轴向连接

8.3 => 重塑和轴向旋转

轴向连接

另一种数据合并运算也被称作连接（concatenation)、绑定（binding）或堆叠（stacking）。Numpy的concatenation函数可以用Numpy数组来做：

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: arr = np.arange(12).reshape((3, 4))

In [4]: arr
Out[4]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [5]: np.concatenate([arr, arr], axis=1)
Out[5]:
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

复制

假设有三个没有重叠索引的Series：

In [6]: s1 = pd.Series([0, 1], index=['a', 'b'])

In [7]: s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])

In [8]: s3 = pd.Series([5, 6], index=['f', 'g'])

复制

对这些对象调用concat可以将值和索引粘合在一起：

In [9]: pd.concat([s1, s2, s3])
Out[9]:
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64   

复制

默认情况下，concat是在axis=0上工作的，最终产生一个新的Series。如果传入axis=1，则结果就会变成一个DataFrame（axis=1是列）：

In [10]: pd.concat([s1, s2, s3], axis=1)
Out[10]:
     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0

复制

传入join='inner'可以得到他们的交集：

In [11]: s4 = pd.concat([s1, s3])

In [12]: s4
Out[12]:
a    0
b    1
f    5
g    6
dtype: int64

In [13]: pd.concat([s1, s4], axis=1)
Out[13]:
     0  1
a  0.0  0
b  1.0  1
f  NaN  5
g  NaN  6

In [14]: pd.concat([s1,s4], axis=1, join='inner')
Out[14]:
   0  1
a  0  0
b  1  1

复制

现在我们可以看到因为传入了join='inner'，f索引和g索引那两行消失了。

我们还可以通过join_axes指定要使用的索引：

In [16]: pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
Out[16]:
     0    1
a  0.0  0.0
c  NaN  NaN
b  1.0  1.0
e  NaN  NaN

复制

一个可能遇到的问题是，如果我们将s1，s1和s3拼接的时候，可能因为重复的索引导致无法区分结果。

In [35]: pd.concat([s1,s1,s3])
Out[35]:
a    0
b    1
a    0
b    1
f    5
g    6
dtype: int64

复制

解决的办法是我们可以在连接轴上创建一个层次化索引。使用keys参数即可达到这个目的：

In [17]: result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])

In [18]: result
Out[18]:
one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

复制

如果沿着axis=1对Series进行合并，则keys就会成为DataFrame的列名：

In [20]: pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])
Out[20]:
   one  two  three
a  0.0  NaN    NaN
b  1.0  NaN    NaN
c  NaN  2.0    NaN
d  NaN  3.0    NaN
e  NaN  4.0    NaN
f  NaN  NaN    5.0
g  NaN  NaN    6.0

复制

同样的逻辑适用于DataFrame对象：

In [21]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], columns=['one', 'two'])

In [22]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'], columns=['three', 'four'])

In [23]: df1
Out[23]:
   one  two
a    0    1
b    2    3
c    4    5

In [24]: df2
Out[24]:
   three  four
a      5     6
c      7     8

In [25]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
Out[25]:
  level1     level2
     one two  three four
a      0   1    5.0  6.0
b      2   3    NaN  NaN
c      4   5    7.0  8.0

复制

如果传入的不是列表而是一个字典，则字典的键就会被当作keys选项的值：

In [28]: pd.concat({'level1':df1, 'level2':df2}, axis=1)
Out[28]:
  level1     level2
     one two  three four
a      0   1    5.0  6.0
b      2   3    NaN  NaN
c      4   5    7.0  8.0

复制

另外，我们还可以用names参数命名创建的轴级别：

In [29]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'], names=['upper', 'lower'])
Out[29]:
upper level1     level2
lower    one two  three four
a          0   1    5.0  6.0
b          2   3    NaN  NaN
c          4   5    7.0  8.0

复制

本篇最后一个关于DataFrame的问题，若DataFrame的行索引不包含任何相关数据：

In [30]: df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])

In [31]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

In [32]: df1
Out[32]:
          a         b         c         d
0 -0.402664 -0.861283  0.419665 -0.010507
1  0.244767 -1.023330 -0.509473  0.773439
2 -0.073538  0.966644  0.583360  0.593361

In [33]: df2
Out[33]:
          b         d         a
0 -0.353377 -1.607486  0.207777
1  0.858092  0.795677  0.639189

复制

在这种情况下，我们可以传入ignore_index=True，结果就不会保留连接轴上的索引，而是产生一组新索引：

In [36]: pd.concat([df1, df2])
Out[36]:
          a         b         c         d
0 -0.402664 -0.861283  0.419665 -0.010507
1  0.244767 -1.023330 -0.509473  0.773439
2 -0.073538  0.966644  0.583360  0.593361
0  0.207777 -0.353377       NaN -1.607486
1  0.639189  0.858092       NaN  0.795677

In [37]: pd.concat([df1, df2], ignore_index=True)
Out[37]:
          a         b         c         d
0 -0.402664 -0.861283  0.419665 -0.010507
1  0.244767 -1.023330 -0.509473  0.773439
2 -0.073538  0.966644  0.583360  0.593361
3  0.207777 -0.353377       NaN -1.607486
4  0.639189  0.858092       NaN  0.795677

复制

-END-

往期回顾

索引上的合并

Stay hungry, stay foolish

python

文章转载自Yuan的学习笔记，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

Python数据分析笔记#8.2.3 轴向连接

评论