Closed
Description
The current solution is to call str.replace(<compiled_re>.pattern, flags=<compiled_re>.flags)
which is relatively ugly and verbose in my opnion.
Here's a contrived example of removing stopwords and normalizing whitespace afterwards:
import pandas as pd
import re
some_names = pd.Series(["three weddings and a funeral", "the big lebowski", "florence and the machine"])
stopwords = ["the", "a", "and"]
stopwords_re = re.compile(r"(\s+)?\b({})\b(\s+)?".format("|".join(stopwords), re.IGNORECASE)
whitespace_re = re.compile(r"\s+")
# desired code:
# some_names.str.replace(stopwords_re, " ").str.strip().str.replace(whitespace_re, " ")
# actual code:
some_names.\
str.replace(stopwords_re.pattern, " ", flags=stopwords_re.flags).\
str.strip().str.replace(whitespace_re.pattern, " ", flags=whitespace_re.flags)
Why do I think this is better?
- It's nice to have commonly used regular expressions compiled and to carry their flags around with them (and also allows the use of "verbose" regular expressions)
- It's not that compiled regular expressions should quack like strings... it's that in this case we're making strings quack like compiled regular expressions, but at the same time not letting those compiled regular expressions quack their own quack.
Is there a good reason not to implement this?