In [1]:
import pandas as pd

Multi-index for more dimensions¶

Pandas is not limited to a 2-dimensional array. It can have as many dimensions as desired bot can display it is only in 2 dimensions. Therefore sub-indices (subindex or sub-column) are used to increase the number of dimensions. So a value can be for incomes France and 1960 which is 3 dimensions.

To make our table with 3 dimensions, we will merge the 2 following tables which will generate a double index.

In [2]:
# incomes per person, inflation adjusted, US $ 2000, source World Bank
incomes = pd.DataFrame({'Argentina': [5251, 6611, 7540, 5581, 7695, 10749],
                        'Egypt': [430, 565, 856, 1153, 1475, 1975],
                        'France': [7499, 11572, 15641, 18731, 21774, 22758],
                        'Sweden': [11360, 16570, 19470, 23492, 27869, 32631],
                        'USA': [13723, 18228, 22630, 28298, 35081, 37329]},
                       index=[1960,1970,1980,1990,2000,2010])
In [3]:
# children per woman
children = pd.DataFrame({'Argentina': [3.11, 3.07, 3.33, 2.99, 2.48, 2.22],
                        'Egypt': [6.63, 5.94, 5.37, 4.35, 3.31, 2.88],
                        'France': [2.77, 2.49, 1.83, 1.75, 1.82, 1.98],
                        'Sweden': [2.2, 1.92, 1.68, 2.14, 1.56, 1.99],
                        'USA': [3.67, 2.46, 1.82, 2.07, 2.05, 1.93]},
                       index=[1960,1970,1980,1990,2000,2010])
In [4]:
# concat is explain in lesson "Pandas 6 -- operations with 2 dataframes"
df = pd.concat({'incomes': incomes, 'children': children})
df
Out[4]:
Argentina Egypt France Sweden USA
incomes 1960 5251.00 430.00 7499.00 11360.00 13723.00
1970 6611.00 565.00 11572.00 16570.00 18228.00
1980 7540.00 856.00 15641.00 19470.00 22630.00
1990 5581.00 1153.00 18731.00 23492.00 28298.00
2000 7695.00 1475.00 21774.00 27869.00 35081.00
2010 10749.00 1975.00 22758.00 32631.00 37329.00
children 1960 3.11 6.63 2.77 2.20 3.67
1970 3.07 5.94 2.49 1.92 2.46
1980 3.33 5.37 1.83 1.68 1.82
1990 2.99 4.35 1.75 2.14 2.07
2000 2.48 3.31 1.82 1.56 2.05
2010 2.22 2.88 1.98 1.99 1.93

Le résultat est un tableau avec un multi-index.

In [5]:
df.index
Out[5]:
MultiIndex([( 'incomes', 1960),
            ( 'incomes', 1970),
            ( 'incomes', 1980),
            ( 'incomes', 1990),
            ( 'incomes', 2000),
            ( 'incomes', 2010),
            ('children', 1960),
            ('children', 1970),
            ('children', 1980),
            ('children', 1990),
            ('children', 2000),
            ('children', 2010)],
           )

Retrieve the data¶

As always, the data underlying the DataFrames are the columns so here the countries.

In [6]:
df.Egypt
Out[6]:
incomes   1960     430.00
          1970     565.00
          1980     856.00
          1990    1153.00
          2000    1475.00
          2010    1975.00
children  1960       6.63
          1970       5.94
          1980       5.37
          1990       4.35
          2000       3.31
          2010       2.88
Name: Egypt, dtype: float64

We can indicate the index or the indexes that interest us those in different ways:

In [7]:
df.Egypt['incomes']
Out[7]:
1960     430.0
1970     565.0
1980     856.0
1990    1153.0
2000    1475.0
2010    1975.0
Name: Egypt, dtype: float64
In [8]:
print("Children per women in the USA in 1970 = ", df.USA['children'][1970])
print("Children per women in the USA in 1990 = ", df.USA['children',1990])
Children per women in the USA in 1970 =  2.46
Children per women in the USA in 1990 =  2.07
In [9]:
display('France data in 2000:',                   df['France'][:, 2000])
'France data in 2000:'
incomes     21774.00
children        1.82
Name: France, dtype: float64

Extracting a block¶

With loc we get a sub-array by specifying the values of the 2 indexes and the values of the columns that we want.

You can also specify only the level 0 index or slices by specifying each subindex.

In [10]:
df.loc[('incomes', [1960,2010]), :]
Out[10]:
Argentina Egypt France Sweden USA
incomes 1960 5251.0 430.0 7499.0 11360.0 13723.0
2010 10749.0 1975.0 22758.0 32631.0 37329.0
In [11]:
df.loc['children'] 
Out[11]:
Argentina Egypt France Sweden USA
1960 3.11 6.63 2.77 2.20 3.67
1970 3.07 5.94 2.49 1.92 2.46
1980 3.33 5.37 1.83 1.68 1.82
1990 2.99 4.35 1.75 2.14 2.07
2000 2.48 3.31 1.82 1.56 2.05
2010 2.22 2.88 1.98 1.99 1.93
In [12]:
df.loc[('incomes',1970):('incomes', 2000), 'Egypt':'Sweden']
Out[12]:
Egypt France Sweden
incomes 1970 565.0 11572.0 16570.0
1980 856.0 15641.0 19470.0
1990 1153.0 18731.0 23492.0
2000 1475.0 21774.0 27869.0

All the elements of an index¶

Python syntax forbides to use the : sign between parentheses. Therefore we cannot do df.loc[(:, 1960),:] to retreive all the values of 1960. We must use at choice:

  • slice(None) instead of :
  • pd.IndexSlice [:, 1960]

Attention df.loc['children', 'France'] does not work because we do not specify the year. On the other hand, df.loc['children']['France'] works because the sub-array of children is extracted and on this result, the column France is requested.

In [13]:
df.loc[(slice(None), 1960), :]
Out[13]:
Argentina Egypt France Sweden USA
incomes 1960 5251.00 430.00 7499.00 11360.0 13723.00
children 1960 3.11 6.63 2.77 2.2 3.67
In [14]:
df.loc[pd.IndexSlice[:, 1990:2010], :]
Out[14]:
Argentina Egypt France Sweden USA
incomes 1990 5581.00 1153.00 18731.00 23492.00 28298.00
2000 7695.00 1475.00 21774.00 27869.00 35081.00
2010 10749.00 1975.00 22758.00 32631.00 37329.00
children 1990 2.99 4.35 1.75 2.14 2.07
2000 2.48 3.31 1.82 1.56 2.05
2010 2.22 2.88 1.98 1.99 1.93

Multi-columns¶

Multi-index can also be multi-column. Everything is symmetrical. We can also have multi-index and multi-columns with as much level of depth as we want.

In [15]:
df.T  # transpose
Out[15]:
incomes children
1960 1970 1980 1990 2000 2010 1960 1970 1980 1990 2000 2010
Argentina 5251.0 6611.0 7540.0 5581.0 7695.0 10749.0 3.11 3.07 3.33 2.99 2.48 2.22
Egypt 430.0 565.0 856.0 1153.0 1475.0 1975.0 6.63 5.94 5.37 4.35 3.31 2.88
France 7499.0 11572.0 15641.0 18731.0 21774.0 22758.0 2.77 2.49 1.83 1.75 1.82 1.98
Sweden 11360.0 16570.0 19470.0 23492.0 27869.0 32631.0 2.20 1.92 1.68 2.14 1.56 1.99
USA 13723.0 18228.0 22630.0 28298.0 35081.0 37329.0 3.67 2.46 1.82 2.07 2.05 1.93
In [16]:
df.T.children[1970]
Out[16]:
Argentina    3.07
Egypt        5.94
France       2.49
Sweden       1.92
USA          2.46
Name: 1970, dtype: float64

Column to index and vice versa¶

If set_levels allows you to modify an index level, [set_index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set _index.html) allows to modify an index either by giving directly its values, either by choosing a column to becomes the index (most common case).

The inverse is possible with reset_index.

In [17]:
df2 = df.reset_index(level=1) # index to column
print(df2.columns)
df2.rename(columns={'level_1':'year'}, inplace=True)
df2
Index(['level_1', 'Argentina', 'Egypt', 'France', 'Sweden', 'USA'], dtype='object')
Out[17]:
year Argentina Egypt France Sweden USA
incomes 1960 5251.00 430.00 7499.00 11360.00 13723.00
incomes 1970 6611.00 565.00 11572.00 16570.00 18228.00
incomes 1980 7540.00 856.00 15641.00 19470.00 22630.00
incomes 1990 5581.00 1153.00 18731.00 23492.00 28298.00
incomes 2000 7695.00 1475.00 21774.00 27869.00 35081.00
incomes 2010 10749.00 1975.00 22758.00 32631.00 37329.00
children 1960 3.11 6.63 2.77 2.20 3.67
children 1970 3.07 5.94 2.49 1.92 2.46
children 1980 3.33 5.37 1.83 1.68 1.82
children 1990 2.99 4.35 1.75 2.14 2.07
children 2000 2.48 3.31 1.82 1.56 2.05
children 2010 2.22 2.88 1.98 1.99 1.93
In [18]:
df2.set_index('year', append=True)  # without append it would remove the previous index (try it!)
Out[18]:
Argentina Egypt France Sweden USA
year
incomes 1960 5251.00 430.00 7499.00 11360.00 13723.00
1970 6611.00 565.00 11572.00 16570.00 18228.00
1980 7540.00 856.00 15641.00 19470.00 22630.00
1990 5581.00 1153.00 18731.00 23492.00 28298.00
2000 7695.00 1475.00 21774.00 27869.00 35081.00
2010 10749.00 1975.00 22758.00 32631.00 37329.00
children 1960 3.11 6.63 2.77 2.20 3.67
1970 3.07 5.94 2.49 1.92 2.46
1980 3.33 5.37 1.83 1.68 1.82
1990 2.99 4.35 1.75 2.14 2.07
2000 2.48 3.31 1.82 1.56 2.05
2010 2.22 2.88 1.98 1.99 1.93

When adding an index level it can be usefull to sort the indexes with sort_index to improve the display and some algorithms that will apply on the DataFrame.

Stack & Unstack¶

It is possible to stack the structure with less dimension and more depth. Thus our 3-dimensional table, 2 depths for the indexes and a depth for the columns can become a Serie (thus 1 dimension) at 3 depths.

In [19]:
# let use a smaller DataFrame
df_small =df.loc[(['incomes','children'], [1970,1990,2010]), ['Argentina','Sweden']]
df_small
Out[19]:
Argentina Sweden
incomes 1970 6611.00 16570.00
1990 5581.00 23492.00
2010 10749.00 32631.00
children 1970 3.07 1.92
1990 2.99 2.14
2010 2.22 1.99
In [20]:
df_small.stack()
Out[20]:
incomes   1970  Argentina     6611.00
                Sweden       16570.00
          1990  Argentina     5581.00
                Sweden       23492.00
          2010  Argentina    10749.00
                Sweden       32631.00
children  1970  Argentina        3.07
                Sweden           1.92
          1990  Argentina        2.99
                Sweden           2.14
          2010  Argentina        2.22
                Sweden           1.99
dtype: float64

We can see that our 3-dimensional dataset.

To unstack we use unstack:

Direction¶

  • stack moves columns to subindex until you have a Series
  • unstack moves indexes to sub-columns until you have a Series
In [21]:
df_small.unstack()
Out[21]:
Argentina Sweden
1970 1990 2010 1970 1990 2010
incomes 6611.00 5581.00 10749.00 16570.00 23492.00 32631.00
children 3.07 2.99 2.22 1.92 2.14 1.99
In [22]:
df_small.unstack().unstack()
Out[22]:
Argentina  1970  incomes      6611.00
                 children        3.07
           1990  incomes      5581.00
                 children        2.99
           2010  incomes     10749.00
                 children        2.22
Sweden     1970  incomes     16570.00
                 children        1.92
           1990  incomes     23492.00
                 children        2.14
           2010  incomes     32631.00
                 children        1.99
dtype: float64

Là encore on arrive à une séries mais avec l'ordre des index différent du premier cas puisqu'il s'agit des colonnes, sous-colonnes et sous-sous-colonnes (qui deviennent l'index de la séries).

Direct operations on multi-indexes¶

  • Reorder To have indexes in the order of your choice you can use:
    • swaplevel which allows to invert 2 indexes (use axis = 'columns' if multi-columns)
    • reorder_levels which allows global reordering of indexes
  • Rename set_levels changes values of a multi-index level
  • Remove droplevel removes specified level of multi-index

Columns can also be reordered by specifying the axis:

  • on indexes (axis = 0 or axis = 'index')
  • on columns (axis = 1 or axis = 'columns')
In [23]:
df_small.reorder_levels([1,0], axis='index').sort_index()  # swaplevel(0,1) would have done the same
Out[23]:
Argentina Sweden
1970 children 3.07 1.92
incomes 6611.00 16570.00
1990 children 2.99 2.14
incomes 5581.00 23492.00
2010 children 2.22 1.99
incomes 10749.00 32631.00

To get a better grasp, let's look with 3 indexes:

In [24]:
dfss = df_small.stack()
dfss
Out[24]:
incomes   1970  Argentina     6611.00
                Sweden       16570.00
          1990  Argentina     5581.00
                Sweden       23492.00
          2010  Argentina    10749.00
                Sweden       32631.00
children  1970  Argentina        3.07
                Sweden           1.92
          1990  Argentina        2.99
                Sweden           2.14
          2010  Argentina        2.22
                Sweden           1.99
dtype: float64
In [25]:
dfss.swaplevel(0,1)
Out[25]:
1970  incomes   Argentina     6611.00
                Sweden       16570.00
1990  incomes   Argentina     5581.00
                Sweden       23492.00
2010  incomes   Argentina    10749.00
                Sweden       32631.00
1970  children  Argentina        3.07
                Sweden           1.92
1990  children  Argentina        2.99
                Sweden           2.14
2010  children  Argentina        2.22
                Sweden           1.99
dtype: float64

Ce n'est pas très joli d'avoir plusieurs fois la même année dans l'index (simple problème d'affichage). Pour éviter cela on va trier l'index dans le cas suivant après avoir totalement réordonné l'index.

In [26]:
dfss.reorder_levels([2,1,0]).sort_index()
Out[26]:
Argentina  1970  children        3.07
                 incomes      6611.00
           1990  children        2.99
                 incomes      5581.00
           2010  children        2.22
                 incomes     10749.00
Sweden     1970  children        1.92
                 incomes     16570.00
           1990  children        2.14
                 incomes     23492.00
           2010  children        1.99
                 incomes     32631.00
dtype: float64
In [31]:
# Renomer

df2 = df.copy()
df2.index = df2.index.set_levels(['enfants', 'revenus'], level=0)
df2
Out[31]:
Argentina Egypt France Sweden USA
enfants 1960 5251.00 430.00 7499.00 11360.00 13723.00
1970 6611.00 565.00 11572.00 16570.00 18228.00
1980 7540.00 856.00 15641.00 19470.00 22630.00
1990 5581.00 1153.00 18731.00 23492.00 28298.00
2000 7695.00 1475.00 21774.00 27869.00 35081.00
2010 10749.00 1975.00 22758.00 32631.00 37329.00
revenus 1960 3.11 6.63 2.77 2.20 3.67
1970 3.07 5.94 2.49 1.92 2.46
1980 3.33 5.37 1.83 1.68 1.82
1990 2.99 4.35 1.75 2.14 2.07
2000 2.48 3.31 1.82 1.56 2.05
2010 2.22 2.88 1.98 1.99 1.93
In [32]:
# Effacer

df2.droplevel(1)
Out[32]:
Argentina Egypt France Sweden USA
enfants 5251.00 430.00 7499.00 11360.00 13723.00
enfants 6611.00 565.00 11572.00 16570.00 18228.00
enfants 7540.00 856.00 15641.00 19470.00 22630.00
enfants 5581.00 1153.00 18731.00 23492.00 28298.00
enfants 7695.00 1475.00 21774.00 27869.00 35081.00
enfants 10749.00 1975.00 22758.00 32631.00 37329.00
revenus 3.11 6.63 2.77 2.20 3.67
revenus 3.07 5.94 2.49 1.92 2.46
revenus 3.33 5.37 1.83 1.68 1.82
revenus 2.99 4.35 1.75 2.14 2.07
revenus 2.48 3.31 1.82 1.56 2.05
revenus 2.22 2.88 1.98 1.99 1.93
In [ ]: