A time series is a series with dates as index.
Note: see what is a date/time in Python if you haven't: [Introduction to DateTime](../lesson2 Deeper in Python/21 datetime.ipynb)
Ref: http://pandas.pydata.org/pandas-docs/stable/timeseries.html
Dates are ordered and can be given in a list of dates, as any index, or with the methods pd.date_range(start, end, periode, frequence)
where
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter end frequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
see http://pandas.pydata.org/pandas-docs/stable/timeseries.html
import numpy as np
import pandas as pd
np.random.seed(1)
dates = pd.date_range('2016-08-28', '2016-09-06', freq='B') # begin, end, only business days
dates
DatetimeIndex(['2016-08-29', '2016-08-30', '2016-08-31', '2016-09-01', '2016-09-02', '2016-09-05', '2016-09-06'], dtype='datetime64[ns]', freq='B')
Avec cet index on peut créer un tableau chronologique :
It is easy to extract parts of a TimeSeries:
tdf1 = pd.DataFrame({'temperature': 20 + np.random.randint(0,5,7),
'pression' : 1 + np.random.random(7)/10 },
index=dates)
tdf1
temperature | pression | |
---|---|---|
2016-08-29 | 23 | 1.039658 |
2016-08-30 | 24 | 1.038791 |
2016-08-31 | 20 | 1.066975 |
2016-09-01 | 21 | 1.093554 |
2016-09-02 | 23 | 1.084631 |
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
As for usual tables we can select the parts that interest us with loc
and filters.
It is also possible to constrain the dates:
tdf1.loc['2016-08'] # just August
temperature | pression | |
---|---|---|
2016-08-29 | 23 | 1.039658 |
2016-08-30 | 24 | 1.038791 |
2016-08-31 | 20 | 1.066975 |
tdf1.loc['2016-09-03':] # after that date even if the date is not in the index
temperature | pression | |
---|---|---|
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
If two sources of information are incomplete, let's use methods we have already seen to fill in the gaps.
tdf2 = tdf1.copy()
tdf1.drop(tdf1.index[[0,1,3]], inplace=True) # we remove some data
tdf2.drop(tdf2.index[[5,6]], inplace=True) # more data removed
tdf2.drop(columns='pression', inplace=True)
display(tdf1, tdf2)
temperature | pression | |
---|---|---|
2016-08-31 | 20 | 1.066975 |
2016-09-02 | 23 | 1.084631 |
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
temperature | |
---|---|
2016-08-29 | 23 |
2016-08-30 | 24 |
2016-08-31 | 20 |
2016-09-01 | 21 |
2016-09-02 | 23 |
We use merge
to aggregate the data from the two arrays. As we want to be based on the temperature as well as on the index, it is necessary to set index as a column (it is either merge on index or on columns).
res = pd.merge(tdf1.reset_index(), tdf2.reset_index(), on=['temperature', 'index'], how='outer')
display(res)
res.set_index('index').sort_index()
index | temperature | pression | |
---|---|---|---|
0 | 2016-08-31 | 20 | 1.066975 |
1 | 2016-09-05 | 20 | 1.031327 |
2 | 2016-09-06 | 20 | 1.052455 |
3 | 2016-09-01 | 21 | NaN |
4 | 2016-08-29 | 23 | NaN |
5 | 2016-09-02 | 23 | 1.084631 |
6 | 2016-08-30 | 24 | NaN |
temperature | pression | |
---|---|---|
index | ||
2016-08-29 | 23 | NaN |
2016-08-30 | 24 | NaN |
2016-08-31 | 20 | 1.066975 |
2016-09-01 | 21 | NaN |
2016-09-02 | 23 | 1.084631 |
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
If the two data sources do not agree on a value, what happens?
tdf1.loc['2016-08-31','temperature'] = 19
res = pd.merge(tdf1.reset_index(), tdf2.reset_index(), on=['temperature', 'index'], how='outer')
res = res.set_index('index').sort_index()
res
temperature | pression | |
---|---|---|
index | ||
2016-08-29 | 23 | NaN |
2016-08-30 | 24 | NaN |
2016-08-31 | 19 | 1.066975 |
2016-08-31 | 20 | NaN |
2016-09-01 | 21 | NaN |
2016-09-02 | 23 | 1.084631 |
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
merge
inouter
mode keeps all values, also at 2 different temperatures for 08/31.
As with Numpy, the interpolation can be done by taking into account the dates and therefore the difference between 2 successive dates.
res.interpolate(method='time')
temperature | pression | |
---|---|---|
index | ||
2016-08-29 | 23 | NaN |
2016-08-30 | 24 | NaN |
2016-08-31 | 19 | 1.066975 |
2016-08-31 | 20 | 1.066975 |
2016-09-01 | 21 | 1.075803 |
2016-09-02 | 23 | 1.084631 |
2016-09-05 | 20 | 1.031327 |
2016-09-06 | 20 | 1.052455 |
It is also possible to change the index and ask to recalculate the values on the new index. Strangely, this only works on a Serie (one column):
tdf1['temperature'].resample('30h').interpolate('time')
2016-08-31 00:00:00 19.00 2016-09-01 06:00:00 19.25 2016-09-02 12:00:00 19.50 2016-09-03 18:00:00 19.75 2016-09-05 00:00:00 20.00 Freq: 30h, Name: temperature, dtype: float64
To do it on a complete DataFrame we can interpolate columns one by one or do that:
interpol = tdf1.asfreq('30h')
display(interpol)
tmp = pd.concat([tdf1, interpol]).sort_index().interpolate(method='time').drop_duplicates()
tmp.loc[interpol.index]
temperature | pression | |
---|---|---|
2016-08-31 00:00:00 | 19.0 | 1.066975 |
2016-09-01 06:00:00 | NaN | NaN |
2016-09-02 12:00:00 | NaN | NaN |
2016-09-03 18:00:00 | NaN | NaN |
2016-09-05 00:00:00 | 20.0 | 1.031327 |
temperature | pression | |
---|---|---|
2016-08-31 00:00:00 | 19.00 | 1.066975 |
2016-09-01 06:00:00 | 21.50 | 1.078010 |
2016-09-02 12:00:00 | 22.50 | 1.075747 |
2016-09-03 18:00:00 | 21.25 | 1.053537 |
2016-09-05 00:00:00 | 20.00 | 1.031327 |
Since version 0.18, the resample
method has become a temporal group-by which simplifies the
usual groupby
operations on Time-indexed DataFrame:
tdf1.resample('W').mean() # to get means for each week
temperature | pression | |
---|---|---|
2016-09-04 | 21.0 | 1.075803 |
2016-09-11 | 20.0 | 1.041891 |
For more information on time tables, see the page on time series: http://pandas.pydata.org/pandas-docs/stable/timeseries.html