Look at a DataFrame¶
A first analysis of a DataFrame is done by looking at it. Of course, if the table is very large, a simple print (or display) will not suffice. Here are two ways to look at a large painting in detail.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000, 4), # random variation everyday, an array of 1000 x 4
columns=['A', 'B', 'C', 'D'],
index=pd.date_range('1/1/2000', periods=1000))
df = df.cumsum() # cumulative sum to make something like a regular variation (temperature, stock...)
Dtale¶
It is a package that allows to have an interactive presentation of a table. Its possibilities are very numerous and allow a good understanding of the data.
To test you can
- click on column names to do many things
- use the menus (go to the narrow space above the columns and then you will see them appear)
- click in a box to modify its value
- click in the triangle at the top left to open the table in another window or perform global manipulations
As much as I appreciate this tool for analyzing data, I don't want to use it to modify the table because then I would lose the history of the modifications made. I modify with Pandas commands that I can modify and relaunch if needed.
import dtale
dtale.show(df)
The following command allows the Jupyter sheet to take the whole width of the window (often it is only a part):
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
Charts¶
We poor humans have trouble reading thousands of numbers. On the other hand, a good graph makes it possible to understand.
Also the next chapter is entirely
devoted to the various graphic libraries.
In the meantime, let's use the plot function plot
which is often sufficient.
It is not necessary to go into the details of this function for the moment. It works like
the same function of the Matplotlib library that we will see in the next chapter.
Note: under Jupyter it is necessary to indicate %matplotlib inline
so that the graphs are displayed (not to be done
only once per sheet). If you have a high resolution screen, use the retina
mode to have a rendering
prettier.
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
df.plot(figsize=(14,6))
<Axes: >
Sliding window¶
A very useful Pandas tool for preparing data to be plotted is the sliding window on which one applies a function of one's choice (often the average but the set of functions statistics quoted above can be used).
The sliding window size is
- an integer that indicates the number of rows chosen
- a time interval (only for timeseries)
df.rolling(window=30).mean().plot(subplots=True)
array([<Axes: >, <Axes: >, <Axes: >, <Axes: >], dtype=object)
df['A'].rolling('30d').max().plot(color='red')
df['A'].rolling('30d').mean().plot(color='purple', style=':')
df['A'].rolling('30d').min().plot(title="Rolling max / mean / min over 30 days of M", color='orange')
<Axes: title={'center': 'Rolling max / mean / min over 30 days of M'}>
You can also do statistics on sliding windows:
df.A.cov(df.A.rolling(window=30).mean())
32.62982887672163