I know this might be easy to do. I can do it in Stata but I'm trying to move to Python.
I have a big dataset that it's unbalance. It looks like this:
And I need to get a dataset as follows:
Any guidance it's welcome. Thanks a lot!
解决方案
one way is to set 'year' as another level of index with set_index, reindex using pd.MultiIndex.from_product and reset_index the data from 'year' as a column.
Example dataframe with the same structure:
import pandas as pd
df = pd.DataFrame( {'year':[2003,2004,2002,2004,2005,2006],
'city_code':['a']*2+['b']*4,
'total_tax':pd.np.random.randint(100,1000,6)},
index=pd.Index(data=[9]*2+[54]*4,name='id_inf'))
print(df)
city_code total_tax year
id_inf
9 a 417 2003
9 a 950 2004
54 b 801 2002
54 b 218 2004
54 b 886 2005
54 b 855 2006
Now you can create the df_balanced with the method:
df_balanced = (df.set_index('year',append=True)
.reindex(pd.MultiIndex.from_product([df.index.unique(),
range(df.year.min(),df.year.max()+1)],
names=['id_inf','year']))
.reset_index(level=1))
And you get:
print (df_balanced)
year city_code total_tax
id_inf
9 2002 NaN NaN
9 2003 a 417.0
9 2004 a 950.0
9 2005 NaN NaN
9 2006 NaN NaN
54 2002 b 801.0
54 2003 NaN NaN
54 2004 b 218.0
54 2005 b 886.0
54 2006 b 855.0
To fill the NaN, different methods but here two ways. For the column 'city_code', you can use groupby and transform with max to get the value and for the column 'total_tax', just fillna with 0 such as:
df_balanced['city_code'] = df_balanced.groupby(level=0)['city_code'].transform(max)
df_balanced['total_tax'] = df_balanced['total_tax'].fillna(0)
print (df_balanced)
year city_code total_tax
id_inf
9 2002 a 0.0
9 2003 a 417.0
9 2004 a 950.0
9 2005 a 0.0
9 2006 a 0.0
54 2002 b 801.0
54 2003 b 0.0
54 2004 b 218.0
54 2005 b 886.0
54 2006 b 855.0