import numpy as np
import pandas as pd
np.random.seed(1) # set random seed to get the same random series
Data cleansing is a fundamental step in data processing. Since you have a lot of data you're likely to have:
In the last 4 cases, once the problematic cases are found, the question arises of how to correct them.
Putting everything clean can take a long time.
Pandas offers the drop_duplicates
method to remove duplicates.
df1 = df1.drop_duplicates()
df1.drop_duplicates(inplace = True)
df1.drop_duplicates(subset=["A", "C"], inplace = True)
Be careful this method like many others does not modify its own DataFrame.
subset
allows to choose on which columns you look for duplicates.
np.random.seed(0)
df = pd.DataFrame({'foo': np.random.randint(3,size=7), 'bar': np.random.randint(3,size=7), 'baz': np.random.randint(3,size=7)})
df.iloc[1] = df.iloc[0]
df
foo | bar | baz | |
---|---|---|---|
0 | 0 | 2 | 2 |
1 | 0 | 2 | 2 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 1 |
4 | 1 | 2 | 1 |
5 | 2 | 1 | 1 |
6 | 0 | 2 | 0 |
df.drop_duplicates(subset=["foo", "baz"], inplace=True)
df
foo | bar | baz | |
---|---|---|---|
0 | 0 | 2 | 2 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 1 |
5 | 2 | 1 | 1 |
6 | 0 | 2 | 0 |
If you want to work on duplicates, then there is the duplicate
command to find out which lines are duplicated:
df = pd.DataFrame({'foo': np.random.randint(5,size=5), 'bar': np.random.randint(5,size=5)})
df.iloc[1] = df.iloc[0]
df.duplicated()
0 False 1 True 2 False 3 False 4 False dtype: bool
For this the command is [drop
] (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html). You can choose to remove columns or rows.
As with the previous method, assign the result to the dataframe or
use inplace
to store the result.
df.drop(columns='foo')
bar | |
---|---|
0 | 0 |
1 | 0 |
2 | 3 |
3 | 0 |
4 | 1 |
df.drop(index=[2,3]) # 2 & 3 are labels
foo | bar | |
---|---|---|
0 | 1 | 0 |
1 | 1 | 0 |
4 | 3 | 1 |
You can also specify the lines to remove as result of a logical filter:
to_be_dropped = df[df.foo % 2 == 0].index
print("To be dropped:", to_be_dropped)
df.drop(index = to_be_dropped)
To be dropped: Index([3], dtype='int64')
foo | bar | |
---|---|---|
0 | 1 | 0 |
1 | 1 | 0 |
2 | 3 | 3 |
4 | 3 | 1 |
A NaN, Not a Number, is the result of a computation error or missing data if the loading method has chosen this way to specify it (see http://pandas.pydata.org/pandas-docs/stable/missing_data.html ). For example:
df = df.drop(index=1)
df.at[0,'bar'] = 0
df.at[0,'foo'] = 0 # to get a Not a Number after division
df.at[1,'foo'] = 3 # makes df.at[1,'bar'] = Nan since line 1 has just been removed
df.at[2,'bar'] = None # None makes NaN
df['div'] = df.bar / df.foo
df
foo | bar | div | |
---|---|---|---|
0 | 0.0 | 0.0 | NaN |
2 | 3.0 | NaN | NaN |
3 | 0.0 | 0.0 | NaN |
4 | 3.0 | 1.0 | 0.333333 |
1 | 3.0 | NaN | NaN |
We can extract NaN with the following logical filters:
df.isna()
foo | bar | div | |
---|---|---|---|
0 | False | False | True |
2 | False | True | True |
3 | False | False | True |
4 | False | False | False |
1 | False | True | True |
and use them to replace NaN with df.bar [df.bar.isna ()] = 7
or directly withfillna
:
df.bar = df.bar.fillna(7)
df
foo | bar | div | |
---|---|---|---|
0 | 0.0 | 0.0 | NaN |
2 | 3.0 | 7.0 | NaN |
3 | 0.0 | 0.0 | NaN |
4 | 3.0 | 1.0 | 0.333333 |
1 | 3.0 | 7.0 | NaN |
Sometimes we prefer to remove lines containing NaN:
df.dropna(inplace=True)
df
foo | bar | div | |
---|---|---|---|
4 | 3.0 | 1.0 | 0.333333 |
It is possible to estimate missing or false data. If the data is ordered, we can even make an interpolation to fill holes or find inconsistent data.
dates = pd.date_range('2016-08-01', periods=8, freq='D')
temperature = pd.DataFrame({'temp': [21.5, 24, 25.5, None, 25.2, None, None, 20.1]}, index=dates)
temperature.drop(temperature.index[2], inplace=True) # so index is not linear anymore
temperature
temp | |
---|---|
2016-08-01 | 21.5 |
2016-08-02 | 24.0 |
2016-08-04 | NaN |
2016-08-05 | 25.2 |
2016-08-06 | NaN |
2016-08-07 | NaN |
2016-08-08 | 20.1 |
You can simply indicate that the missing value copies the previous value.
If several NaNs follow each other, it must be specified whether they should to take the value of the last value which is not a NaN. This is done by specifying limit
which indicates for how many consecutive NaN we do this operation.
temperature.ffill(limit=1) # forward fill (backward is bfill)
temp | |
---|---|
2016-08-01 | 21.5 |
2016-08-02 | 24.0 |
2016-08-04 | 24.0 |
2016-08-05 | 25.2 |
2016-08-06 | 25.2 |
2016-08-07 | NaN |
2016-08-08 | 20.1 |
We can also interpolate.
Be careful, the only method that takes into account the dates in the index is time
.
temperature.interpolate(method='linear')
temp | |
---|---|
2016-08-01 | 21.5 |
2016-08-02 | 24.0 |
2016-08-04 | 24.6 |
2016-08-05 | 25.2 |
2016-08-06 | 23.5 |
2016-08-07 | 21.8 |
2016-08-08 | 20.1 |
temperature.interpolate(method='time').iloc[2]
temp 24.8 Name: 2016-08-04 00:00:00, dtype: float64
The possible methods are described in Scipy.
method: {'linear', 'time', 'index', 'values', 'nearest', 'zero',
'slinear', 'quadratic', 'cubic', 'barycentric', 'krogh', 'polynomial', 'spline',
'Piecewise _polynomial', 'from_ derivatives', 'pchip', 'akima'}
Sometimes we want to replace other values than NaN. This can be done with a filter but replace
is faster for simple cases.
df2 = pd.DataFrame({'foo': np.random.randint(10,size=5000), 'bar': np.random.randint(10,size=5000)})
%timeit df2.replace([1,2],[11,12]) # replace 1 and 2 by 11 and 12 respectively
%timeit df2.replace([3,4],134) # replace 3 and 4 by 134
%timeit df2[df2==5] = 105
%timeit df2[(df2==6) | (df2==7)] = 167
132 μs ± 1.89 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) 133 μs ± 2.68 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) 226 μs ± 1.41 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) 342 μs ± 9.21 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
replace
can also use regular expressions.