import numpy as np
import pandas as pd
np.random.seed(1) # set random seed to get the same random series
Clean up your data¶
Data cleansing is a fundamental step in data processing. Since you have a lot of data you're likely to have:
- duplicates (which Pandas finds easily),
- unnecessary data (sometimes redundant, sometimes not used, they occupy the memory while removing them),
- errors (NaN if you're lucky)
- inconsistencies (not always easy to find but curves or statistics can make them appear),
- inaccuracies (as for inconsistencies),
- missing data (that Pandas finds when loading or as NaN in the DataFrame).
In the last 4 cases, once the problematic cases are found, the question arises of how to correct them.
Putting everything clean can take a long time.
Remove duplicates¶
Pandas offers the drop_duplicates
method to remove duplicates.
df1 = df1.drop_duplicates()
df1.drop_duplicates(inplace = True)
df1.drop_duplicates(subset=["A", "C"], inplace = True)
Be careful this method like many others does not modify its own DataFrame.
subset
allows to choose on which columns you look for duplicates.
np.random.seed(0)
df = pd.DataFrame({'foo': np.random.randint(3,size=7), 'bar': np.random.randint(3,size=7), 'baz': np.random.randint(3,size=7)})
df.iloc[1] = df.iloc[0]
df
foo | bar | baz | |
---|---|---|---|
0 | 0 | 2 | 2 |
1 | 0 | 2 | 2 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 1 |
4 | 1 | 2 | 1 |
5 | 2 | 1 | 1 |
6 | 0 | 2 | 0 |
df.drop_duplicates(subset=["foo", "baz"], inplace=True)
df
foo | bar | baz | |
---|---|---|---|
0 | 0 | 2 | 2 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 1 |
5 | 2 | 1 | 1 |
6 | 0 | 2 | 0 |
If you want to work on duplicates, then there is the duplicate
command to find out which lines are duplicated:
df = pd.DataFrame({'foo': np.random.randint(5,size=5), 'bar': np.random.randint(5,size=5)})
df.iloc[1] = df.iloc[0]
df.duplicated()
0 False 1 True 2 False 3 False 4 False dtype: bool
Remove unnecessary data¶
For this the command is [drop
] (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html). You can choose to remove columns or rows.
As with the previous method, assign the result to the dataframe or
use inplace
to store the result.
df.drop(columns='foo')
bar | |
---|---|
0 | 0 |
1 | 0 |
2 | 3 |
3 | 0 |
4 | 1 |
df.drop(index=[2,3]) # 2 & 3 are labels
foo | bar | |
---|---|---|
0 | 1 | 0 |
1 | 1 | 0 |
4 | 3 | 1 |
You can also specify the lines to remove as result of a logical filter:
to_be_dropped = df[df.foo % 2 == 0].index
df.drop(index = to_be_dropped)
foo | bar | |
---|---|---|
0 | 1 | 0 |
1 | 1 | 0 |
2 | 3 | 3 |
4 | 3 | 1 |
Manage NaN¶
A NaN, Not a Number, is the result of a computation error or missing data if the loading method has chosen this way to specify it (see http://pandas.pydata.org/pandas-docs/stable/missing_data.html ). For example:
- 0/0 gives a NaN
- add a column to a table without specifying all the values will put NaN where information is missing
df.at[0,'bar'] = 0
df.at[0,'foo'] = 0 # to get a Not a Number after division
df.at[1,'foo'] = 3 # makes df.at[1,'bar'] = Nan since line 1 has been removed
df.at[2,'bar'] = None # None makes NaN
df['div'] = df.bar / df.foo
df
foo | bar | div | |
---|---|---|---|
0 | 0 | 0.0 | NaN |
1 | 3 | 0.0 | 0.000000 |
2 | 3 | NaN | NaN |
3 | 0 | 0.0 | NaN |
4 | 3 | 1.0 | 0.333333 |
We can extract NaN with the following logical filters:
df.isna() # shows NaN cells
df.isnull() # shows position of NaN for numerics, of None for objects and of NaT for DateTime
foo | bar | div | |
---|---|---|---|
0 | False | False | True |
1 | False | False | False |
2 | False | True | True |
3 | False | False | True |
4 | False | False | False |
and use them to replace NaN with df.bar [df.bar.isna ()] = 7
or directly with fillna
:
df.bar.fillna(7, inplace=True)
df
foo | bar | div | |
---|---|---|---|
0 | 0 | 0.0 | NaN |
1 | 3 | 0.0 | 0.000000 |
2 | 3 | 7.0 | NaN |
3 | 0 | 0.0 | NaN |
4 | 3 | 1.0 | 0.333333 |
Sometimes we prefer to remove lines containing NaN:
df.dropna(inplace=True)
df
foo | bar | div | |
---|---|---|---|
1 | 3 | 0.0 | 0.000000 |
4 | 3 | 1.0 | 0.333333 |
Estimation of NaN¶
It is possible to estimate missing or false data. If the data is ordered, we can even make an interpolation to fill holes or find inconsistent data.
dates = pd.date_range('2016-08-01', periods=8, freq='D')
temperature = pd.DataFrame({'temp': [21.5, 24, 25.5, None, 25.2, None, None, 20.1]}, index=dates)
temperature.drop(temperature.index[2], inplace=True) # so index is not linear anymore
temperature
temp | |
---|---|
2016-08-01 | 21.5 |
2016-08-02 | 24.0 |
2016-08-04 | NaN |
2016-08-05 | 25.2 |
2016-08-06 | NaN |
2016-08-07 | NaN |
2016-08-08 | 20.1 |
You can simply indicate that the missing value copies the previous value.
If several NaNs follow each other, it must be specified whether they should to take the value of the last value which is not a NaN. This is done by specifying limit
which indicates for how many consecutive NaN we do this operation.
temperature.ffill(limit=1) # forward fill (backward is bfill)
temp | |
---|---|
2016-08-01 | 21.5 |
2016-08-02 | 24.0 |
2016-08-04 | 24.0 |
2016-08-05 | 25.2 |
2016-08-06 | 25.2 |
2016-08-07 | NaN |
2016-08-08 | 20.1 |
We can also interpolate.
Be careful, the only method that takes into account the dates in the index is time
.
temperature.interpolate(method='linear')
temp | |
---|---|
2016-08-01 | 21.5 |
2016-08-02 | 24.0 |
2016-08-04 | 24.6 |
2016-08-05 | 25.2 |
2016-08-06 | 23.5 |
2016-08-07 | 21.8 |
2016-08-08 | 20.1 |
temperature.interpolate(method='time').iloc[2]
temp 24.8 Name: 2016-08-04 00:00:00, dtype: float64
The possible methods are described in Scipy.
method: {'linear', 'time', 'index', 'values', 'nearest', 'zero',
'slinear', 'quadratic', 'cubic', 'barycentric', 'krogh', 'polynomial', 'spline',
'Piecewise _polynomial', 'from_ derivatives', 'pchip', 'akima'}
Replace¶
Sometimes we want to replace other values than NaN. This can be done with a filter but replace
is faster for simple cases.
df2 = pd.DataFrame({'foo': np.random.randint(10,size=5000), 'bar': np.random.randint(10,size=5000)})
%timeit df2.replace([1,2],[11,12]) # replace 1 and 2 by 11 and 12 respectively
%timeit df2.replace([3,4],134) # replace 3 and 4 by 134
%timeit df2[df2==5] = 105
%timeit df2[(df2==6) | (df2==7)] = 167
134 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 138 µs ± 6.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 286 µs ± 7.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 402 µs ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
replace
can also use regular expressions.