np03 Manipulations.ipynb

the reorganization of the table (reindexing)
the aggregation of 2 or more arrays
the division of a table in 2 or more

Before looking at these points, let's look at how Numpy presents the dimensions of a multidimensional array with the notion of axes.

In [1]:

import numpy as np

# An array of marks of 3 exams for 4 students in two subjects 
# (therefore 6 marks per students or 12 per subjects)

#                    stud.1     stud.2     stud.3     stud.4
marks = np.array([[[7,13,11],  [7,7,13],  [5,9,11],  [7,17,15]],    # subject 1
                   [[8,12,14], [8,12,12], [8,12,10], [12,16,12]]])  # subject 2
marks.shape

Out[1]:

(2, 4, 3)

The axes¶

A table has axes which correspond to the axes of a coordinate system in space. The order of the axes is that of the inclusion of square brackets. In 2D an array of array is an array of rows with each row which is a 1D array of values. The order is therefore rows then columns (unlike the $(x,y)$ axis in space). In 3D the order is line, column, depth if you want to have an image, otherwise it's 0, 1 and 2.

Many operations on tables are done along one of the axes of the table so it is important to understand what axes are.

Let's look at the example notes above. The axes are

materials
students
exams

Making the average of the values along axis 1 means to take data along axis 1 and performing the calculations on it, so here outputting an average.

In [2]:

marks.mean(axis=1)   # give means for each exam in each subject

Out[2]:

array([[ 6.5, 11.5, 12.5],
       [ 9. , 13. , 12. ]])

Another way to look at axes is to think of them as projection axes. If I project a 3D object along the $y$ axis, the result is a 2D object in $(x,z)$. There is thus a reduction in dimension.

If I sum on the 0 axis an array of dimension (2,4,3) as is our marks array, this means that I lose the 0 dimension and therefore the dimension of the result is (4,3).

In [3]:

marks.mean(axis=0).shape  # mean along axis 0 (subjects) therefore this axis disapears

Out[3]:

(4, 3)

Some functions that support axes¶

All functions that apply a set of values to produce a result should be able to use axis concept (I don't have them all checked but do not hesitate to give me a counter-example). We have the following mathematical functions:

arithmetic: sum, prod, cumsum, cumprod
statistics: min, max, argmin, argmax, mean (average), average (weighted average), std (standard deviation), var, median , percentile, quantile
others: gradient, diff, fft

Moreover, it is possible to sort the values of an array according to the axis of your choice with sort. However, they can be mixed, with shuffle, only along the 0 axis.

Apply a function along an axis¶

The function apply_along_axis allows to apply a 1D functionto a table along an axis. This is the axis that will disappear in the result:

In [4]:

def diff_min_max(a):
    print('->', a, a.max() - a.min())
    return a.max() - a.min()

np.apply_along_axis(diff_min_max, axis=-1, arr=marks)   # -1 is the last axis, marks in our case

-> [ 7 13 11] 6
-> [ 7  7 13] 6
-> [ 5  9 11] 6
-> [ 7 17 15] 10
-> [ 8 12 14] 6
-> [ 8 12 12] 4
-> [ 8 12 10] 4
-> [12 16 12] 4

Out[4]:

array([[ 6,  6,  6, 10],
       [ 6,  4,  4,  4]])

Question: this is the difference between the marks, but which ones?

Apply a function along several axes¶

Some operations may take a list of axes and not a single axis.

In [5]:

print('a.max \n', marks.max(axis=(1,2)), '\n') 
print('a.max keepdim \n', marks.max(axis=(1,2), keepdims=True), '\n')

a.max 
 [17 16] 

a.max keepdim 
 [[[17]]

 [[16]]]

Question: what do the 2 output values correspond to?

One can also use the function apply_over_axes to apply a function along the given axes.

Beware, the function given in argument will receive the whole table and the axis on which it must work, the axes being given one after the other and the table being modified at each stage.

In [6]:

def mymax(array, axis):
    print('Apply over axis', axis)
    print(array, '\n')
    return array.max(axis)

np.apply_over_axes(mymax, marks, axes=(1,2))

Apply over axis 1
[[[ 7 13 11]
  [ 7  7 13]
  [ 5  9 11]
  [ 7 17 15]]

 [[ 8 12 14]
  [ 8 12 12]
  [ 8 12 10]
  [12 16 12]]] 

Apply over axis 2
[[[ 7 17 15]]

 [[12 16 14]]]

Out[6]:

array([[[17]],

       [[16]]])

Arranging a table¶

We have already seen reshape to change the shape of an array. To flatten an array in 1 dimension, we have :

ravel which returns a 1D view of the array
flatten which returns a 1D copy of the array
flat which returns an iterator on the array

unravel_index transforme a 1D index to a multidimension index.

Reorder axes¶

`moveaxis` moves an axis¶

In our marks example, the 3 axes are subjects, students and exams. The moveaxis function is used to move an axis. If so I want the exams to become the first axis in order to make the marks stand out, I move axis 2 to position 0 and the other axes slide to make room, axis 0 becomes axis 1 and axis 1 becomes axis 2:

In [7]:

print('marks.shape = ',marks.shape, '\n')
b = np.moveaxis(marks, 2, 0) 
print('b.shape = ', b.shape)
b

marks.shape =  (2, 4, 3) 

b.shape =  (3, 2, 4)

Out[7]:

array([[[ 7,  7,  5,  7],
        [ 8,  8,  8, 12]],

       [[13,  7,  9, 17],
        [12, 12, 12, 16]],

       [[11, 13, 11, 15],
        [14, 12, 10, 12]]])

It is easier to see that the first examination was difficult.

`swapaxes` 2 axis swap¶

Rather than inserting one axis at a new position and dragging the others, you may want to swap two of them. Here is how to get the marks for each subject and for each exam:

In [8]:

marks.swapaxes(1,2)

Out[8]:

array([[[ 7,  7,  5,  7],
        [13,  7,  9, 17],
        [11, 13, 11, 15]],

       [[ 8,  8,  8, 12],
        [12, 12, 12, 16],
        [14, 12, 10, 12]]])

`transpose` to do everything¶

Finally transpose allows you to reorder all the axes as you want, thus: transpose((2,0,1)) puts

axis 2 in position 0,
axis 0 in place 1
axis 1 in place 2.

A simpler and faster apply over axis¶

Unfortunately the apply_over_axis function is not optimized, so in some cases it may be preferable to make a loop on its table, which means to put the axes which will remain at the beginning and those on which we make our reduction at the end:

In [9]:

print("Means per students", [m.mean() for m in marks.transpose((1,0,2))])

%timeit [m.mean() for m in marks.transpose((1,0,2))]
%timeit np.apply_over_axes(np.mean, marks, axes=(0,2))

Means per students [10.833333333333334, 9.833333333333334, 9.166666666666666, 13.166666666666666]
24 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
29.6 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Changing the order of array elements¶

You can flip the values of an array along an axis with flip which can also be done by indicating it at the level of the indices. Thus np.flip(a, n) is equivalent to a[:,:,..,::-1,:,..,:] with ::-1 in $n$-th position.

In [10]:

a = np.arange(24).reshape([2,3,4])
np.flip(a,0)

Out[10]:

array([[[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]],

       [[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]]])

You can roll values along an axis with roll by specifying how much to slide them:

In [11]:

np.roll(a, 2, axis=1)              # roll elements by 2 along axis 1

Out[11]:

array([[[ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [ 0,  1,  2,  3]],

       [[16, 17, 18, 19],
        [20, 21, 22, 23],
        [12, 13, 14, 15]]])

The transpose also applies regardless of the dimension. By default it reverses the order of the axes but you can specify the desired output order.

In [12]:

a.T

Out[12]:

array([[[ 0, 12],
        [ 4, 16],
        [ 8, 20]],

       [[ 1, 13],
        [ 5, 17],
        [ 9, 21]],

       [[ 2, 14],
        [ 6, 18],
        [10, 22]],

       [[ 3, 15],
        [ 7, 19],
        [11, 23]]])

In [13]:

np.transpose(a, (0,2,1))

Out[13]:

array([[[ 0,  4,  8],
        [ 1,  5,  9],
        [ 2,  6, 10],
        [ 3,  7, 11]],

       [[12, 16, 20],
        [13, 17, 21],
        [14, 18, 22],
        [15, 19, 23]]])

Aggregation¶

Concatenation¶

The basic function is concatenate indicating the axis chosen for concatenation. This is, in my opinion, the method the safest and it works whatever the size.

However, for 2D or 3D arrays, we can use:

vstack or row_stack for vertical concatenation
hstack or column_stack for horizontal concatenation
dstack for deep concatenation

All of these functions take a list of arrays to concatenate as an argument. Of course the sizes of the tables must be compatible.

In [14]:

a = np.zeros((2,3))
b = np.ones((2,3))

print(np.concatenate((a,b), axis=0), '\n')   # same than vstack
print(np.hstack((a,b)))                      # same than concatenate with axis=1

[[0. 0. 0.]
 [0. 0. 0.]
 [1. 1. 1.]
 [1. 1. 1.]] 

[[0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1.]]

Stacking¶

Unlike concatenation, stacking adds dimension. Stack is useful for storing a bunch of 2D arrays, images for example, in a 3D array. We use the function stack.

In [15]:

c = np.stack((a,b))   #  c[0] is a
c

Out[15]:

array([[[0., 0., 0.],
        [0., 0., 0.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

Note that stack has an axis option to indicate the direction in which one wishes to store the given arrays.

Splitting¶

The inverse function of concatenation is splitting with split which asks as arguments:

the array to split
in how many pieces or at what indices
the direction (the axis)

To find our two tables that generated the result of the previous cell, we cut in 2 along the 0 axis. We can also cut along another axis.

In [16]:

e,f = np.split(c, 2, 1)  # splits in 2 along axis 1
print("split part 1\n", e, '\n')
print("split part 2\n", f)

split part 1
 [[[0. 0. 0.]]

 [[1. 1. 1.]]] 

split part 2
 [[[0. 0. 0.]]

 [[1. 1. 1.]]]

There are also hsplit, vsplit and dsplit to split along axes 0, 1 and 2.

From Python to Numpy¶

If you want to dig and look at many examples, you can read N. Rougier's book From Python to Numpy.

Pandas too¶

We will find these manipulations with Pandas which is the super spreadsheet of Python. It also works on array-like structures but without the constraint that all values are of the same type.

In [ ]: