A time series is a series with dates as index.
Note: see what is a date/time in Python if you haven't: [Introduction to DateTime](../lesson2 Deeper in Python/21 datetime.ipynb)
Ref: http://pandas.pydata.org/pandas-docs/stable/timeseries.html
Dates are ordered and can be given in a list of dates, as any index, or with the methods pd.date_range(start, end, periode, frequence)
where
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter end frequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
see http://pandas.pydata.org/pandas-docs/stable/timeseries.html
import numpy as np
import pandas as pd
np.random.seed(1)
dates = pd.date_range('2016-08-28', '2016-09-06', freq='B') # begin, end, only business days
dates
DatetimeIndex(['2016-08-29', '2016-08-30', '2016-08-31', '2016-09-01', '2016-09-02', '2016-09-05', '2016-09-06'], dtype='datetime64[ns]', freq='B')
Avec cet index on peut créer un tableau chronologique :
It is easy to extract parts of a TimeSeries:
tdf1 = pd.DataFrame({'temperature': 20 + np.random.randint(0,5,7),
'pression' : 1 + np.random.random(7)/10 },
index=dates)
tdf1
temperature | pression | |
---|---|---|
2016-08-29 | 23 | 1.039658 |
2016-08-30 | 24 | 1.038791 |
2016-08-31 | 20 | 1.066975 |
2016-09-01 | 21 | 1.093554 |
2016-09-02 | 23 | 1.084631 |
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
As for usual tables we can select the parts that interest us with loc
and filters.
It is also possible to constrain the dates:
tdf1.loc['2016-08'] # just August
temperature | pression | |
---|---|---|
2016-08-29 | 23 | 1.039658 |
2016-08-30 | 24 | 1.038791 |
2016-08-31 | 20 | 1.066975 |
tdf1.loc['2016-09-03':] # after that date even if the date is not in the index
temperature | pression | |
---|---|---|
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
If two sources of information are incomplete, let's use methods we have already seen to fill in the gaps.
tdf2 = tdf1.copy()
tdf1.drop(tdf1.index[[0,1,3]], inplace=True) # we remove some data
tdf2.drop(tdf2.index[[5,6]], inplace=True) # more data removed
tdf2.drop(columns='pression', inplace=True)
display(tdf1, tdf2)
temperature | pression | |
---|---|---|
2016-08-31 | 20 | 1.066975 |
2016-09-02 | 23 | 1.084631 |
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
temperature | |
---|---|
2016-08-29 | 23 |
2016-08-30 | 24 |
2016-08-31 | 20 |
2016-09-01 | 21 |
2016-09-02 | 23 |
We use merge
to aggregate the data from the two arrays. As we want to be based on the temperature as well as on the index, it is necessary to set index as a column (it is either merge on index or on columns).
res = pd.merge(tdf1.reset_index(), tdf2.reset_index(), on=['temperature', 'index'], how='outer')
display(res)
res.set_index('index').sort_index()
index | temperature | pression | |
---|---|---|---|
0 | 2016-08-31 | 20 | 1.066975 |
1 | 2016-09-05 | 20 | 1.031327 |
2 | 2016-09-06 | 20 | 1.052455 |
3 | 2016-09-01 | 21 | NaN |
4 | 2016-08-29 | 23 | NaN |
5 | 2016-09-02 | 23 | 1.084631 |
6 | 2016-08-30 | 24 | NaN |
temperature | pression | |
---|---|---|
index | ||
2016-08-29 | 23 | NaN |
2016-08-30 | 24 | NaN |
2016-08-31 | 20 | 1.066975 |
2016-09-01 | 21 | NaN |
2016-09-02 | 23 | 1.084631 |
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
If the two data sources do not agree on a value, what happens?
tdf1.loc['2016-08-31','temperature'] = 19
res = pd.merge(tdf1.reset_index(), tdf2.reset_index(), on=['temperature', 'index'], how='outer')
res = res.set_index('index').sort_index()
res
temperature | pression | |
---|---|---|
index | ||
2016-08-29 | 23 | NaN |
2016-08-30 | 24 | NaN |
2016-08-31 | 19 | 1.066975 |
2016-08-31 | 20 | NaN |
2016-09-01 | 21 | NaN |
2016-09-02 | 23 | 1.084631 |
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
merge
inouter
mode keeps all values, also at 2 different temperatures for 08/31.
As with Numpy, the interpolation can be done by taking into account the dates and therefore the difference between 2 successive dates.
res.interpolate(method='time')
temperature | pression | |
---|---|---|
index | ||
2016-08-29 | 23 | NaN |
2016-08-30 | 24 | NaN |
2016-08-31 | 19 | 1.066975 |
2016-08-31 | 20 | 1.066975 |
2016-09-01 | 21 | 1.075803 |
2016-09-02 | 23 | 1.084631 |
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
It is also possible to change the index and ask to recalculate the values on the new index. Strangely, this only works on a Serie (one column):
tdf1['temperature'].resample('30h').interpolate('time')
2016-08-31 00:00:00 19.00 2016-09-01 06:00:00 19.25 2016-09-02 12:00:00 19.50 2016-09-03 18:00:00 19.75 2016-09-05 00:00:00 20.00 Freq: 30h, Name: temperature, dtype: float64
To do it on a complete DataFrame we can interpolate columns one by one or do that:
interpol = tdf1.asfreq('30h')
display(interpol)
tmp = pd.concat([tdf1, interpol]).sort_index().interpolate(method='time').drop_duplicates()
tmp.loc[interpol.index]
temperature | pression | |
---|---|---|
2016-08-31 00:00:00 | 19.0 | 1.066975 |
2016-09-01 06:00:00 | NaN | NaN |
2016-09-02 12:00:00 | NaN | NaN |
2016-09-03 18:00:00 | NaN | NaN |
2016-09-05 00:00:00 | 20.0 | 1.031327 |
temperature | pression | |
---|---|---|
2016-08-31 00:00:00 | 19.00 | 1.066975 |
2016-09-01 06:00:00 | 21.50 | 1.078010 |
2016-09-02 12:00:00 | 22.50 | 1.075747 |
2016-09-03 18:00:00 | 21.25 | 1.053537 |
2016-09-05 00:00:00 | 20.00 | 1.031327 |
When a column has dates of type datetime
, we can sense that it will not be possible to do a groupby
directly since all the dates are generally different (at least down to the millisecond).
So, it is necessary to group the data according to an interval, which is done with dt.to_period
:
df = pd.DataFrame({'date': pd.date_range(start="2020-01-01", periods=15, freq='5D'),
'day sales': np.random.randint(50,size=15)}).sort_values('date')
df
date | day sales | |
---|---|---|
0 | 2020-01-01 | 42 |
1 | 2020-01-06 | 28 |
2 | 2020-01-11 | 29 |
3 | 2020-01-16 | 14 |
4 | 2020-01-21 | 4 |
5 | 2020-01-26 | 23 |
6 | 2020-01-31 | 23 |
7 | 2020-02-05 | 41 |
8 | 2020-02-10 | 49 |
9 | 2020-02-15 | 30 |
10 | 2020-02-20 | 32 |
11 | 2020-02-25 | 22 |
12 | 2020-03-01 | 13 |
13 | 2020-03-06 | 41 |
14 | 2020-03-11 | 9 |
df.groupby(df['date'].dt.to_period('W'))['day sales'].mean()
date 2019-12-30/2020-01-05 42.0 2020-01-06/2020-01-12 28.5 2020-01-13/2020-01-19 14.0 2020-01-20/2020-01-26 13.5 2020-01-27/2020-02-02 23.0 2020-02-03/2020-02-09 41.0 2020-02-10/2020-02-16 39.5 2020-02-17/2020-02-23 32.0 2020-02-24/2020-03-01 17.5 2020-03-02/2020-03-08 41.0 2020-03-09/2020-03-15 9.0 Freq: W-SUN, Name: day sales, dtype: float64
We can see that in date
, we have the start and end of the week, which is indeed a period. Note that Freq
indicates W-Sun
to highlight that the weeks end on Sunday.
It is possible to group data by a date, for example, the last day of the week. In this case date
remains a date and you need to use resample
with the argument
Y-Mar
for a sliding window.df.resample('W-Sun', on='date').mean() # without 'on', dates shoud be in the index
day sales | |
---|---|
date | |
2020-01-05 | 42.0 |
2020-01-12 | 28.5 |
2020-01-19 | 14.0 |
2020-01-26 | 13.5 |
2020-02-02 | 23.0 |
2020-02-09 | 41.0 |
2020-02-16 | 39.5 |
2020-02-23 | 32.0 |
2020-03-01 | 17.5 |
2020-03-08 | 41.0 |
2020-03-15 | 9.0 |
For more information on time tables, see the page on time series: http://pandas.pydata.org/pandas-docs/stable/timeseries.html