Array manipulations include:
- the reorganization of the table (reindexing)
- the aggregation of 2 or more arrays
- the division of a table in 2 or more
Before looking at these points, let's look at how Numpy presents the dimensions of a multidimensional array with the notion of axes.
import numpy as np
# An array of marks of 3 exams for 4 students in two subjects
# (therefore 6 marks per students or 12 per subjects)
# stud.1 stud.2 stud.3 stud.4
marks = np.array([[[7,13,11], [7,7,13], [5,9,11], [7,17,15]], # subject 1
[[8,12,14], [8,12,12], [8,12,10], [12,16,12]]]) # subject 2
marks.shape
(2, 4, 3)
The axes¶
A table has axes which correspond to the axes of a coordinate system in space. The order of the axes is that of the inclusion of square brackets. In 2D an array of array is an array of rows with each row which is a 1D array of values. The order is therefore rows then columns (unlike the $(x,y)$ axis in space). In 3D the order is line, column, depth if you want to have an image, otherwise it's 0, 1 and 2.
Many operations on tables are done along one of the axes of the table so it is important to understand what axes are.
Let's look at the example notes above. The axes are
- materials
- students
- exams
Making the average of the values along axis 1 means to take data along axis 1 and performing the calculations on it, so here outputting an average.
marks.mean(axis=1) # give means for each exam in each subject
array([[ 6.5, 11.5, 12.5], [ 9. , 13. , 12. ]])
Another way to look at axes is to think of them as projection axes. If I project a 3D object along the $y$ axis, the result is a 2D object in $(x,z)$. There is thus a reduction in dimension.
If I sum on the 0 axis an array of dimension (2,4,3) as is our marks array, this means that I lose the 0 dimension and therefore the dimension of the result is (4,3).
marks.mean(axis=0).shape # mean along axis 0 (subjects) therefore this axis disapears
(4, 3)
Some functions that support axes¶
All functions that apply a set of values to produce a result should be able to use axis concept (I don't have them all checked but do not hesitate to give me a counter-example). We have the following mathematical functions:
- arithmetic:
sum
,prod
,cumsum
,cumprod
- statistics:
min
,max
,argmin
,argmax
,mean
(average),average
(weighted average),std
(standard deviation),var
,median
,percentile
,quantile
- others:
gradient
,diff
,fft
Moreover, it is possible to sort the values of an array according to the axis of your choice with sort
.
However, they can be mixed, with shuffle
, only along the 0 axis.
Apply a function along an axis¶
The function apply_along_axis
allows to apply a 1D functionto a table along an axis. This is the axis that will disappear in the result:
def diff_min_max(a):
print('->', a, a.max() - a.min())
return a.max() - a.min()
np.apply_along_axis(diff_min_max, axis=-1, arr=marks) # -1 is the last axis, marks in our case
-> [ 7 13 11] 6 -> [ 7 7 13] 6 -> [ 5 9 11] 6 -> [ 7 17 15] 10 -> [ 8 12 14] 6 -> [ 8 12 12] 4 -> [ 8 12 10] 4 -> [12 16 12] 4
array([[ 6, 6, 6, 10], [ 6, 4, 4, 4]])
Question: this is the difference between the marks, but which ones?
Apply a function along several axes¶
Some operations may take a list of axes and not a single axis.
print('a.max \n', marks.max(axis=(1,2)), '\n')
print('a.max keepdim \n', marks.max(axis=(1,2), keepdims=True), '\n')
a.max [17 16] a.max keepdim [[[17]] [[16]]]
Question: what do the 2 output values correspond to?
One can also use the function apply_over_axes
to apply a
function along the given axes.
Beware, the function given in argument will receive the whole table and the axis on which it must work, the axes being given one after the other and the table being modified at each stage.
def mymax(array, axis):
print('Apply over axis', axis)
print(array, '\n')
return array.max(axis)
np.apply_over_axes(mymax, marks, axes=(1,2))
Apply over axis 1 [[[ 7 13 11] [ 7 7 13] [ 5 9 11] [ 7 17 15]] [[ 8 12 14] [ 8 12 12] [ 8 12 10] [12 16 12]]] Apply over axis 2 [[[ 7 17 15]] [[12 16 14]]]
array([[[17]], [[16]]])
Arranging a table¶
We have already seen reshape
to change the shape of an array, flatten
to flatten it in 1 dimension, let's have a look at
other array manipulation functions.
Reorder axes¶
moveaxis
moves an axis¶
In our marks example, the 3 axes are subjects, students and exams.
The moveaxis
function is used to move an axis. If so I want the exams to become the first axis
in order to make the marks stand out, I move axis 2 to position 0 and the other axes slide to make room, axis 0 becomes axis 1 and axis 1 becomes axis 2:
print('marks.shape = ',marks.shape, '\n')
b = np.moveaxis(marks, 2, 0)
print('b.shape = ', b.shape)
b
marks.shape = (2, 4, 3) b.shape = (3, 2, 4)
array([[[ 7, 7, 5, 7], [ 8, 8, 8, 12]], [[13, 7, 9, 17], [12, 12, 12, 16]], [[11, 13, 11, 15], [14, 12, 10, 12]]])
It is easier to see that the first examination was difficult.
swapaxes
2 axis swap¶
Rather than inserting one axis at a new position and dragging the others, you may want to swap two of them. Here is how to get the marks for each subject and for each exam:
marks.swapaxes(1,2)
array([[[ 7, 7, 5, 7], [13, 7, 9, 17], [11, 13, 11, 15]], [[ 8, 8, 8, 12], [12, 12, 12, 16], [14, 12, 10, 12]]])
transpose
to do everything¶
Finally transpose
allows you to reorder all the axes as you want, thus: transpose((2,0,1))
puts
- axis 2 in position 0,
- axis 0 in place 1
- axis 1 in place 2.
A simpler and faster apply over axis¶
Unfortunately the apply_over_axis
function is not optimized, so in some cases it may
be preferable to make a loop on its table, which means to put the axes which will remain at the beginning and those on which we make our reduction at the end:
print("Means per students", [m.mean() for m in marks.transpose((1,0,2))])
%timeit [m.mean() for m in marks.transpose((1,0,2))]
%timeit np.apply_over_axes(np.mean, marks, axes=(0,2))
Means per students [10.833333333333334, 9.833333333333334, 9.166666666666666, 13.166666666666666] 24 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 29.6 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
a = np.arange(24).reshape([2,3,4])
np.flip(a,0)
array([[[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]], [[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]])
You can roll values along an axis with roll
by specifying how much to slide them:
np.roll(a, 2, axis=1) # roll elements by 2 along axis 1
array([[[ 4, 5, 6, 7], [ 8, 9, 10, 11], [ 0, 1, 2, 3]], [[16, 17, 18, 19], [20, 21, 22, 23], [12, 13, 14, 15]]])
The transpose also applies regardless of the dimension. By default it reverses the order of the axes but you can specify the desired output order.
a.T
array([[[ 0, 12], [ 4, 16], [ 8, 20]], [[ 1, 13], [ 5, 17], [ 9, 21]], [[ 2, 14], [ 6, 18], [10, 22]], [[ 3, 15], [ 7, 19], [11, 23]]])
np.transpose(a, (0,2,1))
array([[[ 0, 4, 8], [ 1, 5, 9], [ 2, 6, 10], [ 3, 7, 11]], [[12, 16, 20], [13, 17, 21], [14, 18, 22], [15, 19, 23]]])
Aggregation¶
Concatenation¶
The basic function is concatenate
indicating the axis chosen for concatenation. This is, in my opinion, the method
the safest and it works whatever the size.
However, for 2D or 3D arrays, we can use:
vstack
orrow_stack
for vertical concatenationhstack
orcolumn_stack
for horizontal concatenationdstack
for deep concatenation
All of these functions take a list of arrays to concatenate as an argument. Of course the sizes of the tables must be compatible.
a = np.zeros((2,3))
b = np.ones((2,3))
print(np.concatenate((a,b), axis=0), '\n') # same than vstack
print(np.hstack((a,b))) # same than concatenate with axis=1
[[0. 0. 0.] [0. 0. 0.] [1. 1. 1.] [1. 1. 1.]] [[0. 0. 0. 1. 1. 1.] [0. 0. 0. 1. 1. 1.]]
c = np.stack((a,b)) # c[0] is a
c
array([[[0., 0., 0.], [0., 0., 0.]], [[1., 1., 1.], [1., 1., 1.]]])
Note that stack
has an axis
option to indicate the direction in which one wishes to store the given arrays.
Splitting¶
The inverse function of concatenation is splitting with split
which asks as arguments:
- the array to split
- in how many pieces or at what indices
- the direction (the axis)
To find our two tables that generated the result of the previous cell, we cut in 2 along the 0 axis. We can also cut along another axis.
e,f = np.split(c, 2, 1) # splits in 2 along axis 1
print("split part 1\n", e, '\n')
print("split part 2\n", f)
split part 1 [[[0. 0. 0.]] [[1. 1. 1.]]] split part 2 [[[0. 0. 0.]] [[1. 1. 1.]]]
There are also hsplit
, vsplit
and dsplit
to split along axes 0, 1 and 2.
From Python to Numpy¶
If you want to dig and look at many examples, you can read N. Rougier's book From Python to Numpy.
Pandas too¶
We will find these manipulations with Pandas which is the super spreadsheet of Python. It also works on array-like structures but without the constraint that all values are of the same type.
{{ PreviousNext("np02 Filtres.ipynb", "np04 Xarray.ipynb")}}