在统计建模或机器学习中经常需要将分类变量转换为“哑变量”或“指标矩阵”。
pandas有一个get_dummies函数可以快捷的实现该功能:
In [109]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],.....: 'data1': range(6)})In [110]: pd.get_dummies(df['key'])Out[110]:a b c0 0 1 01 0 1 02 1 0 03 0 0 14 1 0 05 0 1 0
get_dummies的prefix参数可以给列加上一个前缀:
In [111]: dummies = pd.get_dummies(df['key'], prefix='key')In [112]: df_with_dummy = df[['data1']].join(dummies)In [113]: df_with_dummyOut[113]:data1 key_a key_b key_c0 0 0 1 01 1 0 1 02 2 1 0 03 3 0 0 14 4 1 0 05 5 0 1 0
如果DataFrame中的某行同属于多个分类,这时处理起来会稍微有些麻烦:
mnames = ['movie_id', 'title', 'genres']movies = pd.read_table('movies.dat',sep='::',header=None,names=mnames)movies.head()Out[116]:movie_id title genres0 1 Toy Story (1995) Animation|Children's|Comedy1 2 Jumanji (1995) Adventure|Children's|Fantasy2 3 Grumpier Old Men (1995) Comedy|Romance3 4 Waiting to Exhale (1995) Comedy|Drama4 5 Father of the Bride Part II (1995) Comedy
首先要获取所有的genre组成的去重列表:
In [117]: all_genres = []In [118]: for x in movies.genres:.....: all_genres.extend(x.split('|'))In [119]: genres = pd.unique(all_genres)In [120]: genresOut[120]:array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy','Romance', 'Drama', 'Action', 'Crime', 'Thriller','Horror','Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir','Western'], dtype=object)
构建一个用于存储指标的DataFrame,初始值为0:
In [121]: zero_matrix = np.zeros((len(movies), len(genres)))In [122]: dummies = pd.DataFrame(zero_matrix, columns=genres)
使用dummies.columns可获取每个类型的列索引:
In [123]: gen = movies.genres[0]In [124]: gen.split('|')Out[124]: ['Animation', "Children's", 'Comedy']In [125]: dummies.columns.get_indexer(gen.split('|'))Out[125]: array([0, 1, 2])
现在,迭代每一部电影,并将dummies各行的条目设为1。根据索引,使用.iloc设定值:
for i, gen in movies.genres.iteritems():indices = dummies.columns.get_indexer(gen.split('|'))dummies.iloc[i, indices] = 1
再将其与movies原始数据合并起来:
In [127]: movies_windic = movies.join(dummies.add_prefix('Genre_'))In [128]: movies_windic.iloc[0]Out[128]:movie_id 1title Toy Story (1995)genres Animation|Children's|ComedyGenre_Animation 1Genre_Children's 1Genre_Comedy 1Genre_Adventure 0Genre_Fantasy 0Genre_Romance 0Genre_Drama 0...Genre_Crime 0Genre_Thriller 0Genre_Horror 0Genre_Sci-Fi 0Genre_Documentary 0Genre_War 0Genre_Musical 0Genre_Mystery 0Genre_Film-Noir 0Genre_Western 0Name: 0, Length: 21, dtype: object
一个结合get_dummies和cut离散化函数的示例:
In [129]: np.random.seed(12345)In [130]: values = np.random.rand(10)In [131]: valuesOut[131]:array([ 0.9296, 0.3164, 0.1839, 0.2046, 0.5677, 0.5955, 0.9645,0.6532, 0.7489, 0.6536])In [132]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]In [133]: pd.get_dummies(pd.cut(values, bins))Out[133]:(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]0 0 0 0 0 11 0 1 0 0 02 1 0 0 0 03 0 1 0 0 04 0 0 1 0 05 0 0 1 0 06 0 0 0 0 17 0 0 0 1 08 0 0 0 1 09 0 0 0 1 0
文章转载自漫谈大数据与数据分析,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




