Time series¶

A time series is a series with dates as index.

  1. Creation & use
  2. Data manipulation
  3. Examples

Note: see what is a date/time in Python if you haven't: [Introduction to DateTime](../lesson2 Deeper in Python/21 datetime.ipynb)

Ref: http://pandas.pydata.org/pandas-docs/stable/timeseries.html

Creation and uses¶

Dates are ordered and can be given in a list of dates, as any index, or with the methods pd.date_range(start, end, periode, frequence) where

  • you should choose between end and period, period being the number of iteration
  • frequence is define by these acronymes:
B    business day frequency
C    custom business day frequency (experimental)
D    calendar day frequency
W    weekly frequency
M    month end frequency
BM   business month end frequency
CBM  custom business month end frequency
MS   month start frequency
BMS  business month start frequency
CBMS custom business month start frequency
Q    quarter end frequency
BQ   business quarter end frequency
QS   quarter start frequency
BQS  business quarter start frequency
A    year end frequency
BA   business year end frequency
AS   year start frequency
BAS  business year start frequency
BH   business hour frequency
H    hourly frequency
T, min minutely frequency
S      secondly frequency
L, ms  milliseconds
U, us  microseconds
N      nanoseconds

see http://pandas.pydata.org/pandas-docs/stable/timeseries.html

In [1]:
import numpy as np
import pandas as pd
np.random.seed(1)
In [2]:
dates = pd.date_range('2016-08-28', '2016-09-06', freq='B') # begin, end, only business days
dates
Out[2]:
DatetimeIndex(['2016-08-29', '2016-08-30', '2016-08-31', '2016-09-01',
               '2016-09-02', '2016-09-05', '2016-09-06'],
              dtype='datetime64[ns]', freq='B')

Avec cet index on peut créer un tableau chronologique :

It is easy to extract parts of a TimeSeries:

In [3]:
tdf1 = pd.DataFrame({'temperature': 20 + np.random.randint(0,5,7),
                     'pression'   : 1 + np.random.random(7)/10 },
                    index=dates)
tdf1
Out[3]:
temperature pression
2016-08-29 23 1.039658
2016-08-30 24 1.038791
2016-08-31 20 1.066975
2016-09-01 21 1.093554
2016-09-02 23 1.084631
2016-09-05 20 1.031327
2016-09-06 20 1.052455

As for usual tables we can select the parts that interest us with loc and filters. It is also possible to constrain the dates:

In [4]:
tdf1.loc['2016-08']  # just August
Out[4]:
temperature pression
2016-08-29 23 1.039658
2016-08-30 24 1.038791
2016-08-31 20 1.066975
In [5]:
tdf1.loc['2016-09-03':]  # after that date even if the date is not in the index
Out[5]:
temperature pression
2016-09-05 20 1.031327
2016-09-06 20 1.052455

Manipulation¶

Fill the holes¶

If two sources of information are incomplete, let's use methods we have already seen to fill in the gaps.

In [6]:
tdf2 = tdf1.copy()
tdf1.drop(tdf1.index[[0,1,3]], inplace=True)   # we remove some data
tdf2.drop(tdf2.index[[5,6]], inplace=True)     # more data removed
tdf2.drop(columns='pression', inplace=True)
display(tdf1, tdf2)
temperature pression
2016-08-31 20 1.066975
2016-09-02 23 1.084631
2016-09-05 20 1.031327
2016-09-06 20 1.052455
temperature
2016-08-29 23
2016-08-30 24
2016-08-31 20
2016-09-01 21
2016-09-02 23

We use merge to aggregate the data from the two arrays. As we want to be based on the temperature as well as on the index, it is necessary to set index as a column (it is either merge on index or on columns).

In [7]:
res = pd.merge(tdf1.reset_index(), tdf2.reset_index(), on=['temperature', 'index'], how='outer')
display(res)
res.set_index('index').sort_index()
index temperature pression
0 2016-08-31 20 1.066975
1 2016-09-05 20 1.031327
2 2016-09-06 20 1.052455
3 2016-09-01 21 NaN
4 2016-08-29 23 NaN
5 2016-09-02 23 1.084631
6 2016-08-30 24 NaN
Out[7]:
temperature pression
index
2016-08-29 23 NaN
2016-08-30 24 NaN
2016-08-31 20 1.066975
2016-09-01 21 NaN
2016-09-02 23 1.084631
2016-09-05 20 1.031327
2016-09-06 20 1.052455

If the two data sources do not agree on a value, what happens?

In [8]:
tdf1.loc['2016-08-31','temperature'] = 19
res = pd.merge(tdf1.reset_index(), tdf2.reset_index(), on=['temperature', 'index'], how='outer')
res = res.set_index('index').sort_index()
res
Out[8]:
temperature pression
index
2016-08-29 23 NaN
2016-08-30 24 NaN
2016-08-31 19 1.066975
2016-08-31 20 NaN
2016-09-01 21 NaN
2016-09-02 23 1.084631
2016-09-05 20 1.031327
2016-09-06 20 1.052455

merge in outer mode keeps all values, also at 2 different temperatures for 08/31.

Interpolation¶

As with Numpy, the interpolation can be done by taking into account the dates and therefore the difference between 2 successive dates.

In [9]:
res.interpolate(method='time')
Out[9]:
temperature pression
index
2016-08-29 23 NaN
2016-08-30 24 NaN
2016-08-31 19 1.066975
2016-08-31 20 1.066975
2016-09-01 21 1.075803
2016-09-02 23 1.084631
2016-09-05 20 1.031327
2016-09-06 20 1.052455

It is also possible to change the index and ask to recalculate the values ​​on the new index. Strangely, this only works on a Serie (one column):

In [10]:
tdf1['temperature'].resample('30h').interpolate('time')
Out[10]:
2016-08-31 00:00:00    19.00
2016-09-01 06:00:00    19.25
2016-09-02 12:00:00    19.50
2016-09-03 18:00:00    19.75
2016-09-05 00:00:00    20.00
Freq: 30h, Name: temperature, dtype: float64

To do it on a complete DataFrame we can interpolate columns one by one or do that:

In [11]:
interpol = tdf1.asfreq('30h')
display(interpol)
tmp = pd.concat([tdf1, interpol]).sort_index().interpolate(method='time').drop_duplicates()
tmp.loc[interpol.index]
temperature pression
2016-08-31 00:00:00 19.0 1.066975
2016-09-01 06:00:00 NaN NaN
2016-09-02 12:00:00 NaN NaN
2016-09-03 18:00:00 NaN NaN
2016-09-05 00:00:00 20.0 1.031327
Out[11]:
temperature pression
2016-08-31 00:00:00 19.00 1.066975
2016-09-01 06:00:00 21.50 1.078010
2016-09-02 12:00:00 22.50 1.075747
2016-09-03 18:00:00 21.25 1.053537
2016-09-05 00:00:00 20.00 1.031327

Grouping data¶

When a column has dates of type datetime, we can sense that it will not be possible to do a groupby directly since all the dates are generally different (at least down to the millisecond). So, it is necessary to group the data according to an interval, which is done with dt.to_period:

In [12]:
df = pd.DataFrame({'date': pd.date_range(start="2020-01-01", periods=15, freq='5D'),
                   'day sales': np.random.randint(50,size=15)}).sort_values('date')

df
Out[12]:
date day sales
0 2020-01-01 42
1 2020-01-06 28
2 2020-01-11 29
3 2020-01-16 14
4 2020-01-21 4
5 2020-01-26 23
6 2020-01-31 23
7 2020-02-05 41
8 2020-02-10 49
9 2020-02-15 30
10 2020-02-20 32
11 2020-02-25 22
12 2020-03-01 13
13 2020-03-06 41
14 2020-03-11 9
In [13]:
df.groupby(df['date'].dt.to_period('W'))['day sales'].mean()
Out[13]:
date
2019-12-30/2020-01-05    42.0
2020-01-06/2020-01-12    28.5
2020-01-13/2020-01-19    14.0
2020-01-20/2020-01-26    13.5
2020-01-27/2020-02-02    23.0
2020-02-03/2020-02-09    41.0
2020-02-10/2020-02-16    39.5
2020-02-17/2020-02-23    32.0
2020-02-24/2020-03-01    17.5
2020-03-02/2020-03-08    41.0
2020-03-09/2020-03-15     9.0
Freq: W-SUN, Name: day sales, dtype: float64

We can see that in date, we have the start and end of the week, which is indeed a period. Note that Freq indicates W-Sun to highlight that the weeks end on Sunday.

Resample¶

It is possible to group data by a date, for example, the last day of the week. In this case date remains a date and you need to use resample with the argument

  • 'W-Sun' to group all the data of the week on Sunday and have the same behavior as above. You can choose a different day of the week.
  • 'ME' to group the data at the end of the month
  • 'YE' to group the data at the end of the year. You can choose the month with Y-Mar for a sliding window.
In [14]:
df.resample('W-Sun', on='date').mean()  # without 'on', dates shoud be in the index
Out[14]:
day sales
date
2020-01-05 42.0
2020-01-12 28.5
2020-01-19 14.0
2020-01-26 13.5
2020-02-02 23.0
2020-02-09 41.0
2020-02-16 39.5
2020-02-23 32.0
2020-03-01 17.5
2020-03-08 41.0
2020-03-15 9.0

More¶

For more information on time tables, see the page on time series: http://pandas.pydata.org/pandas-docs/stable/timeseries.html

In [ ]: