Replacing Category Data (pandas) -


i have large files several category columns. category kind of generous word because these descriptions/partial sentences.

here unique values per category:

category 1 = 15  category 2 = 94 category 3 = 294 category 4 = 401  location 1 = 30 location 2 = 60  

then there users recurring data (first name, last name, ids etc).

i thinking of following solutions make file size smaller:

1) create file matches each category unique integer

2) create map (is there way reading file? create .csv , load dataframe , match it? or literally have type out initially?)

or

3) join (vlookup) , del old column long object names

pd.merge(df1, categories, on = 'category1', how = 'left')  del df1['category1'] 

what people in case? these files pretty huge. 60 columns , of data long, repeating categories , timestamps. literally no numerical data @ all. it's fine me, sharing files impossible due shared drive space allocations more few months.

to benefit categorical dtype when saving csv might want follow process:

  1. extract category definitions separate dataframes / files
  2. convert categorical data int codes
  3. save converted dataframe csv along definitions dataframes

when need use them again:

  1. restore dataframes csv files
  2. map dataframe int codes category definitions
  3. convert mapped columns categorical

to illustrate process:

make sample dataframe:

df = pd.dataframe(index=pd.np.arange(0,100000)) df.index.name = 'index' df['categories'] = 'category' df['locations'] = 'location' n1 = pd.np.tile(pd.np.arange(1,5), df.shape[0]/4) n2 = pd.np.tile(pd.np.arange(1,3), df.shape[0]/2) df['categories'] = df['categories'] + pd.series(n1).astype(str) df['locations'] = df['locations'] + pd.series(n2).astype(str) print df.info()      <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 0 99999 data columns (total 2 columns): categories    100000 non-null object locations     100000 non-null object dtypes: object(2) memory usage: 2.3+ mb none 

note size: 2.3+ mb - size of csv file. convert these data categorical:

df['categories'] = df['categories'].astype('category') df['locations'] = df['locations'].astype('category') print df.info()  <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 0 99999 data columns (total 2 columns): categories    100000 non-null category locations     100000 non-null category dtypes: category(2) memory usage: 976.6 kb none 

note drop in memory usage down 976.6 kb if save csv now:

df.to_csv('test1.csv') 

...you see inside file:

index,categories,locations 0,category1,location1 1,category2,location2 2,category3,location1 3,category4,location2 

which means 'categorical' has been converted strings saving in csv. let's rid of labels in categorical data after save definitions:

categories_details = pd.dataframe(df.categories.drop_duplicates(), columns=['categories']) print categories_details        categories index            0      category1 1      category2 2      category3 3      category4  locations_details = pd.dataframe(df.locations.drop_duplicates(), columns=['locations']) print locations_details         index            0      location1 1      location2 

now covert categorical int dtype:

for col in df.select_dtypes(include=['category']).columns:     df[col] = df[col].cat.codes print df.head()         categories  locations index                        0               0          0 1               1          1 2               2          0 3               3          1 4               0          0  print df.info()  <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 0 99999 data columns (total 2 columns): categories    100000 non-null int8 locations     100000 non-null int8 dtypes: int8(2) memory usage: 976.6 kb none 

save converted data csv , note file has numbers without labels. file size reflect change.

df.to_csv('test2.csv')  index,categories,locations 0,0,0 1,1,1 2,2,0 3,3,1 

save definitions well:

categories_details.to_csv('categories_details.csv') locations_details.to_csv('locations_details.csv') 

when need restore files, load them csv files:

df2 = pd.read_csv('test2.csv', index_col='index') print df2.head()         categories  locations index                        0               0          0 1               1          1 2               2          0 3               3          1 4               0          0  print df2.info()  <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 0 99999 data columns (total 2 columns): categories    100000 non-null int64 locations     100000 non-null int64 dtypes: int64(2) memory usage: 2.3 mb none  categories_details2 = pd.read_csv('categories_details.csv', index_col='index') print categories_details2.head()        categories index            0      category1 1      category2 2      category3 3      category4  print categories_details2.info()  <class 'pandas.core.frame.dataframe'> int64index: 4 entries, 0 3 data columns (total 1 columns): categories    4 non-null object dtypes: object(1) memory usage: 64.0+ bytes none  locations_details2 = pd.read_csv('locations_details.csv', index_col='index') print locations_details2.head()         locations index            0      location1 1      location2  print locations_details2.info()  <class 'pandas.core.frame.dataframe'> int64index: 2 entries, 0 1 data columns (total 1 columns): locations    2 non-null object dtypes: object(1) memory usage: 32.0+ bytes none 

now use map replace int coded data categories descriptions , convert them categorical:

df2['categories'] = df2.categories.map(categories_details2.to_dict()['categories']).astype('category') df2['locations'] = df2.locations.map(locations_details2.to_dict()['locations']).astype('category') print df2.head()        categories  locations index                       0      category1  location1 1      category2  location2 2      category3  location1 3      category4  location2 4      category1  location1  print df2.info()  <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 0 99999 data columns (total 2 columns): categories    100000 non-null category locations     100000 non-null category dtypes: category(2) memory usage: 976.6 kb none 

note memory usage when first converted data categorical. should not hard automate process if need repeat many time.


Comments

Popular posts from this blog

IF statement in MySQL trigger -

c++ - What does MSC in "// appease MSC" comments mean? -

javascript - Blogger related post gadget image Resize s72-c [ Need Expert Help ] -