暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Pandas计算哑变量

漫谈大数据与数据分析 2020-04-20
467

统计建模或机器学习中经常需要将分类变量转换为“哑变量”或“指标矩阵”。

pandas有一个get_dummies函数可以快捷的实现该功能:

In [109]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
.....: 'data1': range(6)})


In [110]: pd.get_dummies(df['key'])
Out[110]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0

get_dummies的prefix参数可以列加上一个前缀:

In [111]: dummies = pd.get_dummies(df['key'], prefix='key')
In [112]: df_with_dummy = df[['data1']].join(dummies)
In [113]: df_with_dummy
Out[113]:
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0



如果DataFrame中的某行同属于多个分类,这时处理起来会稍微有些麻烦:

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('movies.dat',
sep='::',
header=None,
names=mnames)
movies.head()
Out[116]:
 movie_id  title  genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

首先要获取所有的genre组成的去重列表:

In [117]: all_genres = []
In [118]: for x in movies.genres:
.....: all_genres.extend(x.split('|'))


In [119]: genres = pd.unique(all_genres)
In [120]: genres
Out[120]:
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller','Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)

构建一个用于存储指标的DataFrame,初始值为0:

In [121]: zero_matrix = np.zeros((len(movies), len(genres)))
In [122]: dummies = pd.DataFrame(zero_matrix, columns=genres)

使用dummies.columns可获取每个类型的列索引:

In [123]: gen = movies.genres[0]
In [124]: gen.split('|')
Out[124]: ['Animation', "Children's", 'Comedy']


In [125]: dummies.columns.get_indexer(gen.split('|'))
Out[125]: array([0, 1, 2])

现在,迭代每一部电影,并将dummies各行的条目设为1。根据索引,使用.iloc设定值:

for i, gen in movies.genres.iteritems():
indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

再将其与movies原始数据合并起来:

In [127]: movies_windic = movies.join(dummies.add_prefix('Genre_'))
In [128]: movies_windic.iloc[0]
Out[128]:
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
Genre_Animation 1
Genre_Children's 1
Genre_Comedy 1
Genre_Adventure 0
Genre_Fantasy 0
Genre_Romance 0
Genre_Drama 0
...
Genre_Crime 0
Genre_Thriller 0
Genre_Horror 0
Genre_Sci-Fi 0
Genre_Documentary 0
Genre_War 0
Genre_Musical 0
Genre_Mystery 0
Genre_Film-Noir 0
Genre_Western 0
Name: 0, Length: 21, dtype: object


一个结合get_dummies和cut离散化函数的示例:

In [129]: np.random.seed(12345)
In [130]: values = np.random.rand(10)


In [131]: values
Out[131]:
array([ 0.9296, 0.3164, 0.1839, 0.2046, 0.5677, 0.5955, 0.9645,
0.6532, 0.7489, 0.6536])


In [132]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]


In [133]: pd.get_dummies(pd.cut(values, bins))
Out[133]:
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 0 0 1
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 1 0 0
6 0 0 0 0 1
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0







文章转载自漫谈大数据与数据分析,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论