「目录」
数据规整:聚合、合并和重塑
Data Wrangling: Join, Combine, and Reshape
8.1 => 层次化索引 8.2 => 合并数据集
--------> 数据库风格的DataFrame合并
--------> 索引上的合并
--------> 轴向连接
--------> 合并重叠数据
8.3 => 重塑和轴向旋转
重塑层次化索引
今天情人节,朋友圈一堆晒花的......
原书第八章的最后一篇笔记了,第八章开头就讲了层次化索引,这一篇的内容就是将如何对层次化索引的DataFrame重新排列,比如如何把行索引旋转为列索引,列索引又如何旋转为行索引。
本篇涉及到的函数方法:
stack:将数据的列“旋转”为行 unstack:将数据的行“旋转”为列
先看个例子吧,创一个DataFrame:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = pd.DataFrame(np.arange(6).reshape((2, 3)),
...: index=pd.Index(['Ohio', 'Colorado'], name='state'),
...: columns=pd.Index(['one', 'two', 'three'], name='number'))
In [4]: data
Out[4]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
对这个DataFrame使用stack方法即可将列转换为行,得到一个Series
In [5]: result = data.stack()
In [6]: result
Out[6]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
对于上面这个层次化索引的Series,我们可以使用unstack将其重排为原来的DataFrame:
In [7]: result.unstack()
Out[7]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
默认情况下,unstack和stack操作的都是最内层。传入分层级别的编号或名称可以对指定的级别level进行stack或unstack操作:
In [8]: result.unstack(0)
Out[8]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
In [9]: result.unstack('state')
Out[9]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
若不是所有级别的值都能在各分组中找到的话,则unstack操作可能会引入缺失数据。
比如我们看下面的例子,在把data2的第二级行索引'a','b','c','d','e'通过unstack旋转为列索引的时候,就出现了缺失值:
In [10]: s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
In [11]: s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
In [12]: data2 = pd.concat([s1, s2], keys=['one', 'two'])
In [13]: data2
Out[13]:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
In [14]: data2.unstack()
Out[14]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
stack默认会滤除缺失数据,但我们也可以传入dropna=False参数,这样就不会自动滤除缺失值了:
In [14]: data2.unstack()
Out[14]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
In [15]: data2.unstack().stack()
Out[15]:
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
In [16]: data2.unstack().stack(dropna=False)
Out[16]:
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
还有在对DataFrame进行unstack操作时,作为旋转轴的级别会成为结果中的最低级别:
In [17]: df = pd.DataFrame({'left':result, 'right':result+5}, columns=pd.Index(['left', 'right'], name='side'))
In [18]: df
Out[18]:
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
In [19]: df.unstack('state')
Out[19]:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
类似的,当调用stack时,我们可以指明轴的名字:
In [21]: df.unstack('state').stack('side')
Out[21]:
state Colorado Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7
下一章就是绘图与可视化了,以学一天休一周的进度终于到这了,这一章肯定比前面的内容有趣的多。
BYE-BYE吧!!!
往期回顾

”打补丁“


Stay hungry, stay foolish