Copies and views in Pandas¶
Arrays, DataFrames, can contain a large amount of data so it is good to avoid copies, even temporary ones, when possible. Thus extracting a sub-array to consult it does not require a copy. If now we want to modify it then we must ask ourselves the question of modifying the original table.
In practice Pandas chooses whether it makes a copy or a view. If he made a view, a modification of the sub-table will modify the main table, which will not be the case if he made a copy. Also Pandas sends a warning message to emphasize the uncertainty.
import pandas as pd
from IPython.display import display, HTML
CSS = """
.output {
flex-direction: row;
}
"""
HTML('<style>{}</style>'.format(CSS))
print(pd.__version__) # behaviour can change with the version
2.2.1
df = pd.DataFrame({'A':list('qwer'), 'B':list('asdf')})
df
A | B | |
---|---|---|
0 | q | a |
1 | w | s |
2 | e | d |
3 | r | f |
df2 = df.loc[1:3,:]
df.loc[2,'B'] = 'X' # warning, here 2 is a label (by chance it has the same value than the index)
df2.loc[1,'A'] = 'Z'
/tmp/ipykernel_897/1281438451.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df2.loc[1,'A'] = 'Z'
df2
is potentially a copy of part of df
(but it could be a view).
from IPython.display import display
display(df, df2)
A | B | |
---|---|---|
0 | q | a |
1 | Z | s |
2 | e | X |
3 | r | f |
A | B | |
---|---|---|
1 | Z | s |
2 | e | X |
3 | r | f |
So we can see here that df2
is a view of df
since a change on one of them is visible on the other (both have X and Z).
However, if I add a column to df2
, since I give this column to df
then the columns A and B of the 2 dataframes are views but the columns C of the 2 dataframes are distinct therefore copies!
df = pd.DataFrame({'A':list('qwer'), 'B':list('asdf')})
df2 = df.loc[1:3,:]
df2['C'] = df.A + df.B
df.loc[2,'B'] = 'X'
df2.loc[1,'A'] = 'Z'
df['C'] = df2['C']
df2.loc[3,'C'] = 'AB'
/tmp/ipykernel_897/3317636207.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df2['C'] = df.A + df.B
display(df, df2)
A | B | C | |
---|---|---|---|
0 | q | a | NaN |
1 | Z | s | ws |
2 | e | X | ed |
3 | r | f | rf |
A | B | C | |
---|---|---|---|
1 | Z | s | ws |
2 | e | X | ed |
3 | r | f | AB |
Note that the result may be different since you never know if Pandas sees df2 as a view or a copy.
copy
to make sure you have a copy¶
You want to be sure of the result, make copies:
df = pd.DataFrame({'A':list('qwer'), 'B':list('asdf')})
df2 = df.loc[1:3,:].copy()
df.loc[1,'A'] = 'X'
df2.loc[2,'B'] = 'Z'
display(df, df2)
A | B | |
---|---|---|
0 | q | a |
1 | X | s |
2 | e | d |
3 | r | f |
A | B | |
---|---|---|
1 | w | s |
2 | e | Z |
3 | r | f |
To sum up :
- in reader only mode, views are fine
- in writing mode, copies are preferred.