import pandas as pd
Multi-index for more dimensions¶
Pandas is not limited to a 2-dimensional array. It can have as many dimensions as desired bot can display it is only in 2 dimensions. Therefore sub-indices (subindex or sub-column) are used to increase the number of dimensions. So a value can be for incomes
France
and 1960
which is 3 dimensions.
To make our table with 3 dimensions, we will merge the 2 following tables which will generate a double index.
# incomes per person, inflation adjusted, US $ 2000, source World Bank
incomes = pd.DataFrame({'Argentina': [5251, 6611, 7540, 5581, 7695, 10749],
'Egypt': [430, 565, 856, 1153, 1475, 1975],
'France': [7499, 11572, 15641, 18731, 21774, 22758],
'Sweden': [11360, 16570, 19470, 23492, 27869, 32631],
'USA': [13723, 18228, 22630, 28298, 35081, 37329]},
index=[1960,1970,1980,1990,2000,2010])
# children per woman
children = pd.DataFrame({'Argentina': [3.11, 3.07, 3.33, 2.99, 2.48, 2.22],
'Egypt': [6.63, 5.94, 5.37, 4.35, 3.31, 2.88],
'France': [2.77, 2.49, 1.83, 1.75, 1.82, 1.98],
'Sweden': [2.2, 1.92, 1.68, 2.14, 1.56, 1.99],
'USA': [3.67, 2.46, 1.82, 2.07, 2.05, 1.93]},
index=[1960,1970,1980,1990,2000,2010])
# concat is explain in lesson "Pandas 6 -- operations with 2 dataframes"
df = pd.concat({'incomes': incomes, 'children': children})
df
Argentina | Egypt | France | Sweden | USA | ||
---|---|---|---|---|---|---|
incomes | 1960 | 5251.00 | 430.00 | 7499.00 | 11360.00 | 13723.00 |
1970 | 6611.00 | 565.00 | 11572.00 | 16570.00 | 18228.00 | |
1980 | 7540.00 | 856.00 | 15641.00 | 19470.00 | 22630.00 | |
1990 | 5581.00 | 1153.00 | 18731.00 | 23492.00 | 28298.00 | |
2000 | 7695.00 | 1475.00 | 21774.00 | 27869.00 | 35081.00 | |
2010 | 10749.00 | 1975.00 | 22758.00 | 32631.00 | 37329.00 | |
children | 1960 | 3.11 | 6.63 | 2.77 | 2.20 | 3.67 |
1970 | 3.07 | 5.94 | 2.49 | 1.92 | 2.46 | |
1980 | 3.33 | 5.37 | 1.83 | 1.68 | 1.82 | |
1990 | 2.99 | 4.35 | 1.75 | 2.14 | 2.07 | |
2000 | 2.48 | 3.31 | 1.82 | 1.56 | 2.05 | |
2010 | 2.22 | 2.88 | 1.98 | 1.99 | 1.93 |
Le résultat est un tableau avec un multi-index.
df.index
MultiIndex([( 'incomes', 1960), ( 'incomes', 1970), ( 'incomes', 1980), ( 'incomes', 1990), ( 'incomes', 2000), ( 'incomes', 2010), ('children', 1960), ('children', 1970), ('children', 1980), ('children', 1990), ('children', 2000), ('children', 2010)], )
Retrieve the data¶
As always, the data underlying the DataFrames are the columns so here the countries.
df.Egypt
incomes 1960 430.00 1970 565.00 1980 856.00 1990 1153.00 2000 1475.00 2010 1975.00 children 1960 6.63 1970 5.94 1980 5.37 1990 4.35 2000 3.31 2010 2.88 Name: Egypt, dtype: float64
We can indicate the index or the indexes that interest us those in different ways:
df.Egypt['incomes']
1960 430.0 1970 565.0 1980 856.0 1990 1153.0 2000 1475.0 2010 1975.0 Name: Egypt, dtype: float64
print("Children per women in the USA in 1970 = ", df.USA['children'][1970])
print("Children per women in the USA in 1990 = ", df.USA['children',1990])
Children per women in the USA in 1970 = 2.46 Children per women in the USA in 1990 = 2.07
display('France data in 2000:', df['France'][:, 2000])
'France data in 2000:'
incomes 21774.00 children 1.82 Name: France, dtype: float64
Extracting a block¶
With loc
we get a sub-array by specifying the values of the 2 indexes and the values of the columns that we want.
You can also specify only the level 0 index or slices by specifying each subindex.
df.loc[('incomes', [1960,2010]), :]
Argentina | Egypt | France | Sweden | USA | ||
---|---|---|---|---|---|---|
incomes | 1960 | 5251.0 | 430.0 | 7499.0 | 11360.0 | 13723.0 |
2010 | 10749.0 | 1975.0 | 22758.0 | 32631.0 | 37329.0 |
df.loc['children']
Argentina | Egypt | France | Sweden | USA | |
---|---|---|---|---|---|
1960 | 3.11 | 6.63 | 2.77 | 2.20 | 3.67 |
1970 | 3.07 | 5.94 | 2.49 | 1.92 | 2.46 |
1980 | 3.33 | 5.37 | 1.83 | 1.68 | 1.82 |
1990 | 2.99 | 4.35 | 1.75 | 2.14 | 2.07 |
2000 | 2.48 | 3.31 | 1.82 | 1.56 | 2.05 |
2010 | 2.22 | 2.88 | 1.98 | 1.99 | 1.93 |
df.loc[('incomes',1970):('incomes', 2000), 'Egypt':'Sweden']
Egypt | France | Sweden | ||
---|---|---|---|---|
incomes | 1970 | 565.0 | 11572.0 | 16570.0 |
1980 | 856.0 | 15641.0 | 19470.0 | |
1990 | 1153.0 | 18731.0 | 23492.0 | |
2000 | 1475.0 | 21774.0 | 27869.0 |
All the elements of an index¶
Python syntax forbides to use the :
sign between parentheses. Therefore
we cannot do df.loc[(:, 1960),:]
to retreive all the values of 1960. We must use at choice:
slice(None)
instead of:
pd.IndexSlice [:, 1960]
Attention df.loc['children', 'France']
does not work because we do not specify the year. On the other hand, df.loc['children']['France']
works because the sub-array of children is extracted and on this result, the column France is requested.
df.loc[(slice(None), 1960), :]
Argentina | Egypt | France | Sweden | USA | ||
---|---|---|---|---|---|---|
incomes | 1960 | 5251.00 | 430.00 | 7499.00 | 11360.0 | 13723.00 |
children | 1960 | 3.11 | 6.63 | 2.77 | 2.2 | 3.67 |
df.loc[pd.IndexSlice[:, 1990:2010], :]
Argentina | Egypt | France | Sweden | USA | ||
---|---|---|---|---|---|---|
incomes | 1990 | 5581.00 | 1153.00 | 18731.00 | 23492.00 | 28298.00 |
2000 | 7695.00 | 1475.00 | 21774.00 | 27869.00 | 35081.00 | |
2010 | 10749.00 | 1975.00 | 22758.00 | 32631.00 | 37329.00 | |
children | 1990 | 2.99 | 4.35 | 1.75 | 2.14 | 2.07 |
2000 | 2.48 | 3.31 | 1.82 | 1.56 | 2.05 | |
2010 | 2.22 | 2.88 | 1.98 | 1.99 | 1.93 |
Multi-columns¶
Multi-index can also be multi-column. Everything is symmetrical. We can also have multi-index and multi-columns with as much level of depth as we want.
df.T # transpose
incomes | children | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
1960 | 1970 | 1980 | 1990 | 2000 | 2010 | 1960 | 1970 | 1980 | 1990 | 2000 | 2010 | |
Argentina | 5251.0 | 6611.0 | 7540.0 | 5581.0 | 7695.0 | 10749.0 | 3.11 | 3.07 | 3.33 | 2.99 | 2.48 | 2.22 |
Egypt | 430.0 | 565.0 | 856.0 | 1153.0 | 1475.0 | 1975.0 | 6.63 | 5.94 | 5.37 | 4.35 | 3.31 | 2.88 |
France | 7499.0 | 11572.0 | 15641.0 | 18731.0 | 21774.0 | 22758.0 | 2.77 | 2.49 | 1.83 | 1.75 | 1.82 | 1.98 |
Sweden | 11360.0 | 16570.0 | 19470.0 | 23492.0 | 27869.0 | 32631.0 | 2.20 | 1.92 | 1.68 | 2.14 | 1.56 | 1.99 |
USA | 13723.0 | 18228.0 | 22630.0 | 28298.0 | 35081.0 | 37329.0 | 3.67 | 2.46 | 1.82 | 2.07 | 2.05 | 1.93 |
df.T.children[1970]
Argentina 3.07 Egypt 5.94 France 2.49 Sweden 1.92 USA 2.46 Name: 1970, dtype: float64
Column to index and vice versa¶
If set_levels
allows you to modify an index level, [set_index
](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set _index.html) allows to modify an index either by giving
directly its values, either by choosing a column to becomes the index (most common case).
The inverse is possible with reset_index
.
df2 = df.reset_index(level=1) # index to column
print(df2.columns)
df2.rename(columns={'level_1':'year'}, inplace=True)
df2
Index(['level_1', 'Argentina', 'Egypt', 'France', 'Sweden', 'USA'], dtype='object')
year | Argentina | Egypt | France | Sweden | USA | |
---|---|---|---|---|---|---|
incomes | 1960 | 5251.00 | 430.00 | 7499.00 | 11360.00 | 13723.00 |
incomes | 1970 | 6611.00 | 565.00 | 11572.00 | 16570.00 | 18228.00 |
incomes | 1980 | 7540.00 | 856.00 | 15641.00 | 19470.00 | 22630.00 |
incomes | 1990 | 5581.00 | 1153.00 | 18731.00 | 23492.00 | 28298.00 |
incomes | 2000 | 7695.00 | 1475.00 | 21774.00 | 27869.00 | 35081.00 |
incomes | 2010 | 10749.00 | 1975.00 | 22758.00 | 32631.00 | 37329.00 |
children | 1960 | 3.11 | 6.63 | 2.77 | 2.20 | 3.67 |
children | 1970 | 3.07 | 5.94 | 2.49 | 1.92 | 2.46 |
children | 1980 | 3.33 | 5.37 | 1.83 | 1.68 | 1.82 |
children | 1990 | 2.99 | 4.35 | 1.75 | 2.14 | 2.07 |
children | 2000 | 2.48 | 3.31 | 1.82 | 1.56 | 2.05 |
children | 2010 | 2.22 | 2.88 | 1.98 | 1.99 | 1.93 |
df2.set_index('year', append=True) # without append it would remove the previous index (try it!)
Argentina | Egypt | France | Sweden | USA | ||
---|---|---|---|---|---|---|
year | ||||||
incomes | 1960 | 5251.00 | 430.00 | 7499.00 | 11360.00 | 13723.00 |
1970 | 6611.00 | 565.00 | 11572.00 | 16570.00 | 18228.00 | |
1980 | 7540.00 | 856.00 | 15641.00 | 19470.00 | 22630.00 | |
1990 | 5581.00 | 1153.00 | 18731.00 | 23492.00 | 28298.00 | |
2000 | 7695.00 | 1475.00 | 21774.00 | 27869.00 | 35081.00 | |
2010 | 10749.00 | 1975.00 | 22758.00 | 32631.00 | 37329.00 | |
children | 1960 | 3.11 | 6.63 | 2.77 | 2.20 | 3.67 |
1970 | 3.07 | 5.94 | 2.49 | 1.92 | 2.46 | |
1980 | 3.33 | 5.37 | 1.83 | 1.68 | 1.82 | |
1990 | 2.99 | 4.35 | 1.75 | 2.14 | 2.07 | |
2000 | 2.48 | 3.31 | 1.82 | 1.56 | 2.05 | |
2010 | 2.22 | 2.88 | 1.98 | 1.99 | 1.93 |
When adding an index level it can be usefull to sort the indexes with sort_index
to improve the display and some algorithms that will apply on the DataFrame.
Stack & Unstack¶
It is possible to stack
the structure with less dimension and more depth. Thus our 3-dimensional table, 2 depths for the indexes and a depth for the columns can become a Serie (thus 1 dimension) at 3 depths.
# let use a smaller DataFrame
df_small =df.loc[(['incomes','children'], [1970,1990,2010]), ['Argentina','Sweden']]
df_small
Argentina | Sweden | ||
---|---|---|---|
incomes | 1970 | 6611.00 | 16570.00 |
1990 | 5581.00 | 23492.00 | |
2010 | 10749.00 | 32631.00 | |
children | 1970 | 3.07 | 1.92 |
1990 | 2.99 | 2.14 | |
2010 | 2.22 | 1.99 |
df_small.stack()
incomes 1970 Argentina 6611.00 Sweden 16570.00 1990 Argentina 5581.00 Sweden 23492.00 2010 Argentina 10749.00 Sweden 32631.00 children 1970 Argentina 3.07 Sweden 1.92 1990 Argentina 2.99 Sweden 2.14 2010 Argentina 2.22 Sweden 1.99 dtype: float64
We can see that our 3-dimensional dataset.
To unstack we use unstack
:
Direction¶
stack
moves columns to subindex until you have a Seriesunstack
moves indexes to sub-columns until you have a Series
df_small.unstack()
Argentina | Sweden | |||||
---|---|---|---|---|---|---|
1970 | 1990 | 2010 | 1970 | 1990 | 2010 | |
incomes | 6611.00 | 5581.00 | 10749.00 | 16570.00 | 23492.00 | 32631.00 |
children | 3.07 | 2.99 | 2.22 | 1.92 | 2.14 | 1.99 |
df_small.unstack().unstack()
Argentina 1970 incomes 6611.00 children 3.07 1990 incomes 5581.00 children 2.99 2010 incomes 10749.00 children 2.22 Sweden 1970 incomes 16570.00 children 1.92 1990 incomes 23492.00 children 2.14 2010 incomes 32631.00 children 1.99 dtype: float64
Là encore on arrive à une séries mais avec l'ordre des index différent du premier cas puisqu'il s'agit des colonnes, sous-colonnes et sous-sous-colonnes (qui deviennent l'index de la séries).
Direct operations on multi-indexes¶
- Reorder To have indexes in the order of your choice you can use:
swaplevel
which allows to invert 2 indexes (useaxis = 'columns'
if multi-columns)reorder_levels
which allows global reordering of indexes
- Rename
set_levels
changes values of a multi-index level - Remove
droplevel
removes specified level of multi-index
Columns can also be reordered by specifying the axis:
- on indexes (
axis = 0
oraxis = 'index'
) - on columns (
axis = 1
oraxis = 'columns'
)
df_small.reorder_levels([1,0], axis='index').sort_index() # swaplevel(0,1) would have done the same
Argentina | Sweden | ||
---|---|---|---|
1970 | children | 3.07 | 1.92 |
incomes | 6611.00 | 16570.00 | |
1990 | children | 2.99 | 2.14 |
incomes | 5581.00 | 23492.00 | |
2010 | children | 2.22 | 1.99 |
incomes | 10749.00 | 32631.00 |
To get a better grasp, let's look with 3 indexes:
dfss = df_small.stack()
dfss
incomes 1970 Argentina 6611.00 Sweden 16570.00 1990 Argentina 5581.00 Sweden 23492.00 2010 Argentina 10749.00 Sweden 32631.00 children 1970 Argentina 3.07 Sweden 1.92 1990 Argentina 2.99 Sweden 2.14 2010 Argentina 2.22 Sweden 1.99 dtype: float64
dfss.swaplevel(0,1)
1970 incomes Argentina 6611.00 Sweden 16570.00 1990 incomes Argentina 5581.00 Sweden 23492.00 2010 incomes Argentina 10749.00 Sweden 32631.00 1970 children Argentina 3.07 Sweden 1.92 1990 children Argentina 2.99 Sweden 2.14 2010 children Argentina 2.22 Sweden 1.99 dtype: float64
Ce n'est pas très joli d'avoir plusieurs fois la même année dans l'index (simple problème d'affichage). Pour éviter cela on va trier l'index dans le cas suivant après avoir totalement réordonné l'index.
dfss.reorder_levels([2,1,0]).sort_index()
Argentina 1970 children 3.07 incomes 6611.00 1990 children 2.99 incomes 5581.00 2010 children 2.22 incomes 10749.00 Sweden 1970 children 1.92 incomes 16570.00 1990 children 2.14 incomes 23492.00 2010 children 1.99 incomes 32631.00 dtype: float64
# Renomer
df2 = df.copy()
df2.index = df2.index.set_levels(['enfants', 'revenus'], level=0)
df2
Argentina | Egypt | France | Sweden | USA | ||
---|---|---|---|---|---|---|
enfants | 1960 | 5251.00 | 430.00 | 7499.00 | 11360.00 | 13723.00 |
1970 | 6611.00 | 565.00 | 11572.00 | 16570.00 | 18228.00 | |
1980 | 7540.00 | 856.00 | 15641.00 | 19470.00 | 22630.00 | |
1990 | 5581.00 | 1153.00 | 18731.00 | 23492.00 | 28298.00 | |
2000 | 7695.00 | 1475.00 | 21774.00 | 27869.00 | 35081.00 | |
2010 | 10749.00 | 1975.00 | 22758.00 | 32631.00 | 37329.00 | |
revenus | 1960 | 3.11 | 6.63 | 2.77 | 2.20 | 3.67 |
1970 | 3.07 | 5.94 | 2.49 | 1.92 | 2.46 | |
1980 | 3.33 | 5.37 | 1.83 | 1.68 | 1.82 | |
1990 | 2.99 | 4.35 | 1.75 | 2.14 | 2.07 | |
2000 | 2.48 | 3.31 | 1.82 | 1.56 | 2.05 | |
2010 | 2.22 | 2.88 | 1.98 | 1.99 | 1.93 |
# Effacer
df2.droplevel(1)
Argentina | Egypt | France | Sweden | USA | |
---|---|---|---|---|---|
enfants | 5251.00 | 430.00 | 7499.00 | 11360.00 | 13723.00 |
enfants | 6611.00 | 565.00 | 11572.00 | 16570.00 | 18228.00 |
enfants | 7540.00 | 856.00 | 15641.00 | 19470.00 | 22630.00 |
enfants | 5581.00 | 1153.00 | 18731.00 | 23492.00 | 28298.00 |
enfants | 7695.00 | 1475.00 | 21774.00 | 27869.00 | 35081.00 |
enfants | 10749.00 | 1975.00 | 22758.00 | 32631.00 | 37329.00 |
revenus | 3.11 | 6.63 | 2.77 | 2.20 | 3.67 |
revenus | 3.07 | 5.94 | 2.49 | 1.92 | 2.46 |
revenus | 3.33 | 5.37 | 1.83 | 1.68 | 1.82 |
revenus | 2.99 | 4.35 | 1.75 | 2.14 | 2.07 |
revenus | 2.48 | 3.31 | 1.82 | 1.56 | 2.05 |
revenus | 2.22 | 2.88 | 1.98 | 1.99 | 1.93 |