In [1]:
import numpy as np
import pandas as pd

np.random.seed(1)  # set random seed to get the same random series

Clean up your data¶

Data cleansing is a fundamental step in data processing. Since you have a lot of data you're likely to have:

  • duplicates (which Pandas finds easily),
  • unnecessary data (sometimes redundant, sometimes not used, they occupy the memory while removing them),
  • errors (NaN if you're lucky)
  • inconsistencies (not always easy to find but curves or statistics can make them appear),
  • inaccuracies (as for inconsistencies),
  • missing data (that Pandas finds when loading or as NaN in the DataFrame).

In the last 4 cases, once the problematic cases are found, the question arises of how to correct them.

Putting everything clean can take a long time.

Remove duplicates¶

Pandas offers the drop_duplicates method to remove duplicates.

  • df1 = df1.drop_duplicates()
  • df1.drop_duplicates(inplace = True)
  • df1.drop_duplicates(subset=["A", "C"], inplace = True)

Be careful this method like many others does not modify its own DataFrame.

subset allows to choose on which columns you look for duplicates.

In [2]:
np.random.seed(0)
df = pd.DataFrame({'foo': np.random.randint(3,size=7), 'bar': np.random.randint(3,size=7), 'baz': np.random.randint(3,size=7)})
df.iloc[1] = df.iloc[0]
df
Out[2]:
foo bar baz
0 0 2 2
1 0 2 2
2 0 0 1
3 1 0 1
4 1 2 1
5 2 1 1
6 0 2 0
In [3]:
df.drop_duplicates(subset=["foo", "baz"], inplace=True)
df
Out[3]:
foo bar baz
0 0 2 2
2 0 0 1
3 1 0 1
5 2 1 1
6 0 2 0

If you want to work on duplicates, then there is the duplicate command to find out which lines are duplicated:

In [4]:
df = pd.DataFrame({'foo': np.random.randint(5,size=5), 'bar': np.random.randint(5,size=5)})
df.iloc[1] = df.iloc[0]
df.duplicated()
Out[4]:
0    False
1     True
2    False
3    False
4    False
dtype: bool

Remove unnecessary data¶

For this the command is [drop] (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html). You can choose to remove columns or rows.

As with the previous method, assign the result to the dataframe or use inplace to store the result.

In [5]:
df.drop(columns='foo')
Out[5]:
bar
0 0
1 0
2 3
3 0
4 1
In [6]:
df.drop(index=[2,3])  # 2 & 3 are labels
Out[6]:
foo bar
0 1 0
1 1 0
4 3 1

You can also specify the lines to remove as result of a logical filter:

In [7]:
to_be_dropped = df[df.foo % 2 == 0].index
df.drop(index = to_be_dropped)
Out[7]:
foo bar
0 1 0
1 1 0
2 3 3
4 3 1

Manage NaN¶

A NaN, Not a Number, is the result of a computation error or missing data if the loading method has chosen this way to specify it (see http://pandas.pydata.org/pandas-docs/stable/missing_data.html ). For example:

  • 0/0 gives a NaN
  • add a column to a table without specifying all the values will put NaN where information is missing
In [8]:
df.at[0,'bar'] = 0 
df.at[0,'foo'] = 0      # to get a Not a Number after division
df.at[1,'foo'] = 3      # makes df.at[1,'bar'] = Nan since line 1 has been removed
df.at[2,'bar'] = None   # None makes NaN
df['div'] = df.bar / df.foo
df
Out[8]:
foo bar div
0 0 0.0 NaN
1 3 0.0 0.000000
2 3 NaN NaN
3 0 0.0 NaN
4 3 1.0 0.333333

We can extract NaN with the following logical filters:

In [9]:
df.isna()     # shows NaN cells
df.isnull()   # shows position of NaN for numerics, of None for objects and of NaT for DateTime
Out[9]:
foo bar div
0 False False True
1 False False False
2 False True True
3 False False True
4 False False False

and use them to replace NaN with df.bar [df.bar.isna ()] = 7 or directly with fillna:

In [10]:
df.bar.fillna(7, inplace=True)
df
Out[10]:
foo bar div
0 0 0.0 NaN
1 3 0.0 0.000000
2 3 7.0 NaN
3 0 0.0 NaN
4 3 1.0 0.333333

Sometimes we prefer to remove lines containing NaN:

In [11]:
df.dropna(inplace=True)
df
Out[11]:
foo bar div
1 3 0.0 0.000000
4 3 1.0 0.333333

Estimation of NaN¶

It is possible to estimate missing or false data. If the data is ordered, we can even make an interpolation to fill holes or find inconsistent data.

In [12]:
dates = pd.date_range('2016-08-01', periods=8, freq='D')
temperature = pd.DataFrame({'temp': [21.5, 24, 25.5, None, 25.2, None, None, 20.1]}, index=dates)
temperature.drop(temperature.index[2], inplace=True) # so index is not linear anymore
temperature
Out[12]:
temp
2016-08-01 21.5
2016-08-02 24.0
2016-08-04 NaN
2016-08-05 25.2
2016-08-06 NaN
2016-08-07 NaN
2016-08-08 20.1

You can simply indicate that the missing value copies the previous value.

If several NaNs follow each other, it must be specified whether they should to take the value of the last value which is not a NaN. This is done by specifying limit which indicates for how many consecutive NaN we do this operation.

In [13]:
temperature.ffill(limit=1) # forward fill (backward is bfill)
Out[13]:
temp
2016-08-01 21.5
2016-08-02 24.0
2016-08-04 24.0
2016-08-05 25.2
2016-08-06 25.2
2016-08-07 NaN
2016-08-08 20.1

We can also interpolate.

Be careful, the only method that takes into account the dates in the index is time.

In [14]:
temperature.interpolate(method='linear')
Out[14]:
temp
2016-08-01 21.5
2016-08-02 24.0
2016-08-04 24.6
2016-08-05 25.2
2016-08-06 23.5
2016-08-07 21.8
2016-08-08 20.1
In [15]:
temperature.interpolate(method='time').iloc[2]
Out[15]:
temp    24.8
Name: 2016-08-04 00:00:00, dtype: float64

The possible methods are described in Scipy.

method: {'linear', 'time', 'index', 'values', 'nearest', 'zero',
'slinear', 'quadratic', 'cubic', 'barycentric', 'krogh', 'polynomial', 'spline',
'Piecewise _polynomial', 'from_ derivatives', 'pchip', 'akima'}

lin-quad

Replace¶

Sometimes we want to replace other values ​​than NaN. This can be done with a filter but replace is faster for simple cases.

In [16]:
df2 = pd.DataFrame({'foo': np.random.randint(10,size=5000), 'bar': np.random.randint(10,size=5000)})
In [17]:
%timeit df2.replace([1,2],[11,12])   # replace 1 and 2 by 11 and 12 respectively
%timeit df2.replace([3,4],134)       # replace 3 and 4 by 134
%timeit df2[df2==5] = 105
%timeit df2[(df2==6) | (df2==7)] = 167
134 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
138 µs ± 6.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
286 µs ± 7.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
402 µs ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

replace can also use regular expressions.

In [ ]: