Pandas计算哑变量

漫谈大数据与数据分析 2020-04-20

467

在统计建模或机器学习中经常需要将分类变量转换为“哑变量”或“指标矩阵”。

pandas有一个get_dummies函数可以快捷的实现该功能：

In [109]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
   .....:                    'data1': range(6)})


In [110]: pd.get_dummies(df['key'])
Out[110]: 
   a  b  c
0  0  1  0
1  0  1  0
2  1  0  0
3  0  0  1
4  1  0  0
5  0  1  0

get_dummies的prefix参数可以给列加上一个前缀：

In [111]: dummies = pd.get_dummies(df['key'], prefix='key')
In [112]: df_with_dummy = df[['data1']].join(dummies)
In [113]: df_with_dummy
Out[113]: 
   data1  key_a  key_b  key_c
0      0      0      1      0
1      1      0      1      0
2      2      1      0      0
3      3      0      0      1
4      4      1      0      0
5      5      0      1      0

如果DataFrame中的某行同属于多个分类，这时处理起来会稍微有些麻烦：

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('movies.dat',
                       sep='::',
                       header=None,
                       names=mnames)
movies.head()
Out[116]: 
 movie_id  title  genres
0  1  Toy Story (1995)  Animation|Children's|Comedy
1  2  Jumanji (1995)  Adventure|Children's|Fantasy
2  3  Grumpier Old Men (1995)  Comedy|Romance
3  4  Waiting to Exhale (1995)  Comedy|Drama
4  5  Father of the Bride Part II (1995)  Comedy

首先要获取所有的genre组成的去重列表：

In [117]: all_genres = []
In [118]: for x in movies.genres:
   .....:     all_genres.extend(x.split('|'))


In [119]: genres = pd.unique(all_genres)
In [120]: genres
Out[120]: 
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller','Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

构建一个用于存储指标的DataFrame，初始值为0：

In [121]: zero_matrix = np.zeros((len(movies), len(genres)))
In [122]: dummies = pd.DataFrame(zero_matrix, columns=genres)

使用dummies.columns可获取每个类型的列索引：

In [123]: gen = movies.genres[0]
In [124]: gen.split('|')
Out[124]: ['Animation', "Children's", 'Comedy']


In [125]: dummies.columns.get_indexer(gen.split('|'))
Out[125]: array([0, 1, 2])

现在，迭代每一部电影，并将dummies各行的条目设为1。根据索引，使用.iloc设定值：

for i, gen in movies.genres.iteritems():
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

再将其与movies原始数据合并起来：

In [127]: movies_windic = movies.join(dummies.add_prefix('Genre_'))
In [128]: movies_windic.iloc[0]
Out[128]: 
movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
                                ...             
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western                                  0
Name: 0, Length: 21, dtype: object

一个结合get_dummies和cut离散化函数的示例：

In [129]: np.random.seed(12345)
In [130]: values = np.random.rand(10)


In [131]: values
Out[131]: 
array([ 0.9296,  0.3164,  0.1839,  0.2046,  0.5677,  0.5955,  0.9645,
        0.6532,  0.7489,  0.6536])


In [132]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]


In [133]: pd.get_dummies(pd.cut(values, bins))
Out[133]: 
   (0.0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1.0]
0           0           0           0           0           1
1           0           1           0           0           0
2           1           0           0           0           0
3           0           1           0           0           0
4           0           0           1           0           0
5           0           0           1           0           0
6           0           0           0           0           1
7           0           0           0           1           0
8           0           0           0           1           0
9           0           0           0           1           0

数据库

文章转载自漫谈大数据与数据分析，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

Pandas计算哑变量

在统计建模或机器学习中经常需要将分类变量转换为“哑变量”或“指标矩阵”。

评论