Replacing Category Data (pandas) -
i have large files several category columns. category kind of generous word because these descriptions/partial sentences.
here unique values per category:
category 1 = 15 category 2 = 94 category 3 = 294 category 4 = 401 location 1 = 30 location 2 = 60 then there users recurring data (first name, last name, ids etc).
i thinking of following solutions make file size smaller:
1) create file matches each category unique integer
2) create map (is there way reading file? create .csv , load dataframe , match it? or literally have type out initially?)
or
3) join (vlookup) , del old column long object names
pd.merge(df1, categories, on = 'category1', how = 'left') del df1['category1'] what people in case? these files pretty huge. 60 columns , of data long, repeating categories , timestamps. literally no numerical data @ all. it's fine me, sharing files impossible due shared drive space allocations more few months.
to benefit categorical dtype when saving csv might want follow process:
- extract category definitions separate dataframes / files
- convert categorical data int codes
- save converted dataframe csv along definitions dataframes
when need use them again:
- restore dataframes csv files
- map dataframe int codes category definitions
- convert mapped columns categorical
to illustrate process:
make sample dataframe:
df = pd.dataframe(index=pd.np.arange(0,100000)) df.index.name = 'index' df['categories'] = 'category' df['locations'] = 'location' n1 = pd.np.tile(pd.np.arange(1,5), df.shape[0]/4) n2 = pd.np.tile(pd.np.arange(1,3), df.shape[0]/2) df['categories'] = df['categories'] + pd.series(n1).astype(str) df['locations'] = df['locations'] + pd.series(n2).astype(str) print df.info() <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 0 99999 data columns (total 2 columns): categories 100000 non-null object locations 100000 non-null object dtypes: object(2) memory usage: 2.3+ mb none note size: 2.3+ mb - size of csv file. convert these data categorical:
df['categories'] = df['categories'].astype('category') df['locations'] = df['locations'].astype('category') print df.info() <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 0 99999 data columns (total 2 columns): categories 100000 non-null category locations 100000 non-null category dtypes: category(2) memory usage: 976.6 kb none note drop in memory usage down 976.6 kb if save csv now:
df.to_csv('test1.csv') ...you see inside file:
index,categories,locations 0,category1,location1 1,category2,location2 2,category3,location1 3,category4,location2 which means 'categorical' has been converted strings saving in csv. let's rid of labels in categorical data after save definitions:
categories_details = pd.dataframe(df.categories.drop_duplicates(), columns=['categories']) print categories_details categories index 0 category1 1 category2 2 category3 3 category4 locations_details = pd.dataframe(df.locations.drop_duplicates(), columns=['locations']) print locations_details index 0 location1 1 location2 now covert categorical int dtype:
for col in df.select_dtypes(include=['category']).columns: df[col] = df[col].cat.codes print df.head() categories locations index 0 0 0 1 1 1 2 2 0 3 3 1 4 0 0 print df.info() <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 0 99999 data columns (total 2 columns): categories 100000 non-null int8 locations 100000 non-null int8 dtypes: int8(2) memory usage: 976.6 kb none save converted data csv , note file has numbers without labels. file size reflect change.
df.to_csv('test2.csv') index,categories,locations 0,0,0 1,1,1 2,2,0 3,3,1 save definitions well:
categories_details.to_csv('categories_details.csv') locations_details.to_csv('locations_details.csv') when need restore files, load them csv files:
df2 = pd.read_csv('test2.csv', index_col='index') print df2.head() categories locations index 0 0 0 1 1 1 2 2 0 3 3 1 4 0 0 print df2.info() <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 0 99999 data columns (total 2 columns): categories 100000 non-null int64 locations 100000 non-null int64 dtypes: int64(2) memory usage: 2.3 mb none categories_details2 = pd.read_csv('categories_details.csv', index_col='index') print categories_details2.head() categories index 0 category1 1 category2 2 category3 3 category4 print categories_details2.info() <class 'pandas.core.frame.dataframe'> int64index: 4 entries, 0 3 data columns (total 1 columns): categories 4 non-null object dtypes: object(1) memory usage: 64.0+ bytes none locations_details2 = pd.read_csv('locations_details.csv', index_col='index') print locations_details2.head() locations index 0 location1 1 location2 print locations_details2.info() <class 'pandas.core.frame.dataframe'> int64index: 2 entries, 0 1 data columns (total 1 columns): locations 2 non-null object dtypes: object(1) memory usage: 32.0+ bytes none now use map replace int coded data categories descriptions , convert them categorical:
df2['categories'] = df2.categories.map(categories_details2.to_dict()['categories']).astype('category') df2['locations'] = df2.locations.map(locations_details2.to_dict()['locations']).astype('category') print df2.head() categories locations index 0 category1 location1 1 category2 location2 2 category3 location1 3 category4 location2 4 category1 location1 print df2.info() <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 0 99999 data columns (total 2 columns): categories 100000 non-null category locations 100000 non-null category dtypes: category(2) memory usage: 976.6 kb none note memory usage when first converted data categorical. should not hard automate process if need repeat many time.
Comments
Post a Comment