Skip to content

Ambiguity in DataFrame.replace regex handling #6777

Closed
@adamdivak

Description

@adamdivak

Dear developers,

I think that it is confusing how DataFrame.replace interprets the to_replace values as a pure string or a regexp.

When you pass a dictionary of from and to values, these are interpreted as pure string literals and work as expected. If you pass a nested dictionary with the same values, it is interpreted as a regex even if regex = False is specifically added. This causes a problem if you want to replace values that have special characters in them.

Let's see an example:

@buddha[J:T26]|19> df = pd.DataFrame({'a' : ['()', 'something else']})
@buddha[J:T26]|20> df
              <20>
                a
0              ()
1  something else

[2 rows x 1 columns]
@buddha[J:T26]|21> df.replace({'()' : 'parantheses'})
              <21>
                a
0     parantheses
1  something else

[2 rows x 1 columns]
@buddha[J:T26]|22> df.replace({'a' : {'()' : 'parantheses'}})
              <22>
                                                   a
0                parantheses(parantheses)parantheses
1  paranthesessparanthesesoparanthesesmparanthese...

[2 rows x 1 columns]

As you can see, in the first case the () got replaced as expected. In the second case, even though the only thing I changed was to specify the column in which the replace should occur, the parentheses got treated as a regex. The same happens if I add regex = False to the options.

The current workaround for me is to make sure that the from/to values are regexes.

Cheers,
Adam

Ps.: Pandas is awesome, thanks a lot for all the effort!

@Buddha[J:T26]|24> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: x86
processor: x86 Family 6 Model 23 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.13.1
Cython: None
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 1.2.0
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2013.9
bottleneck: 0.8.0
tables: 3.1.0
numexpr: 2.3
matplotlib: 1.3.1
openpyxl: None
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
sqlalchemy: 0.8.4
lxml: 3.3.1
bs4: 4.3.2
html5lib: 0.999
bq: None
apiclient: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignBugStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions