-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: update the pandas.errors.DtypeWarning docstring #20208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: update the pandas.errors.DtypeWarning docstring #20208
Conversation
pandas/errors/__init__.py
Outdated
|
||
See Also | ||
-------- | ||
pd.read_csv : Read CSV (comma-separated) file into a DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd -> pandas I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
pandas/errors/__init__.py
Outdated
Important to notice that df2 will contain both `str` and `int` for the | ||
same input, '1'. | ||
|
||
>>> df2.iloc[262140,0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: space after comma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
pandas/errors/__init__.py
Outdated
... 'b':['b']*300000}) | ||
>>> df.to_csv('test', sep='\t', index=False, na_rep='NA') | ||
>>> df2 = pd.read_csv('test', sep='\t') | ||
Traceback (most recent call last): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if yyou add a line like
>>> df2 = pd.read_csv('test', sep='\t')
... # doctest: +SKIP
The doctest might work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think it will be possible to get Warnings to work with doctest, so adding a # doctest: +SKIP
is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry but I couldn't understand what the doctest: +SKIP is supposed to do. I've added it right after the read_csv and i just got an error. Can you help me?
################################################################################
################################### Doctests ###################################
################################################################################
Line 28, in pandas.errors.DtypeWarning
Failed example:
df2 = pd.read_csv('test', sep=' ')
Expected:
# doctest: +SKIP
DtypeWarning: Columns (0) have mixed types...
Got nothing
pandas/errors/__init__.py
Outdated
uniform dtypes in a column(s) of a given CSV file. | ||
Warning raised when importing different dtypes in a column from a file. | ||
|
||
Raised for a dtype incompatibility. This can happen whenever `pd.read_csv` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also use read_csv
here instead of pd.read_csv
? (same below for read_table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done!
pandas/errors/__init__.py
Outdated
|
||
>>> df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, | ||
... 'b':['b']*300000}) | ||
>>> df.to_csv('test', sep='\t', index=False, na_rep='NA') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add '.csv' for the temp file ?
I would also leave out the sep='\t'
and na_rep='NA'
, as they are not relevant to show illustrate the warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
pandas/errors/__init__.py
Outdated
... 'b':['b']*300000}) | ||
>>> df.to_csv('test', sep='\t', index=False, na_rep='NA') | ||
>>> df2 = pd.read_csv('test', sep='\t') | ||
Traceback (most recent call last): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think it will be possible to get Warnings to work with doctest, so adding a # doctest: +SKIP
is fine
pandas/errors/__init__.py
Outdated
Warning that is raised for a dtype incompatibility. This | ||
can happen whenever `pd.read_csv` encounters non- | ||
uniform dtypes in a column(s) of a given CSV file. | ||
Warning raised when importing different dtypes in a column from a file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when reading (not importing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tks!
pandas/errors/__init__.py
Outdated
or `pd.read_table` encounter non-uniform dtypes in a column(s) of a given | ||
CSV file. | ||
|
||
It only happens when dealing with larger files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this comment, you can add in the Notes section, that this can happen in larger files. Its because the dtype checking happens per chunk that is read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tks!
pandas/errors/__init__.py
Outdated
Notes | ||
----- | ||
Despite the warning, the CSV file is imported with mixed types in a single | ||
column. See the examples below to better understand this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imported -> read. The dtype of the column will be object.
Codecov Report
@@ Coverage Diff @@
## master #20208 +/- ##
=========================================
Coverage ? 91.7%
=========================================
Files ? 150
Lines ? 49156
Branches ? 0
=========================================
Hits ? 45078
Misses ? 4078
Partials ? 0
Continue to review full report at Codecov.
|
…arning Made some adjustments after peer's suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche running this doctest will leave behind a test.csv
file, which isn't ideal.
We can
- convert this to a regular code-block
- Use
io.StringIO
, which is uglier but doesn't leave behind a file - Add a bit to
os.remove('test.csv')
Preferences?
pandas/errors/__init__.py
Outdated
One way to solve this issue is using the parameter `converters` in the | ||
`read_csv` and `read_table` functions to explicit the conversion: | ||
|
||
>>> df2 = pd.read_csv('test', sep='\t', converters={'a': str}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this is still test
not test.csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing: I think we should recommend dtype={'a': str}
instead ?
Hmm, yes, that's a general problem though, so let's discuss that somewhere else |
@TomAugspurger see #20302 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment in addition to the one of @TomAugspurger
Looks good for the rest!
pandas/errors/__init__.py
Outdated
... 'b':['b']*300000}) | ||
>>> df.to_csv('test.csv', index=False) | ||
>>> df2 = pd.read_csv('test.csv') | ||
>>> DtypeWarning: Columns (0) have mixed types... # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the >>>
on this line? (it might be you need to move the "# doctest: +SKIP" to the line above after the read_csv)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche @TomAugspurger I've added "import os" and an "os.remove('test.csv')" at the end of the example, just as you've agreed on #20302 , and also put "# doctest: +SKIP" right before the warning, like this:
>>> import os
>>> df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000,
... 'b':['b']*300000})
>>> df.to_csv('test.csv', index=False)
>>> df2 = pd.read_csv('test.csv')
>>> os.remove('test.csv')
# doctest: +SKIP
DtypeWarning: Columns (0) have mixed types...
But I got this message on validation:
################################################################################
################################### Doctests ###################################
################################################################################
Line 32, in pandas.errors.DtypeWarning
Failed example:
os.remove('test.csv')
Expected:
# doctest: +SKIP
DtypeWarning: Columns (0) have mixed types...
Got nothing
And also, removing ".csv" from to_csv and read_csv, like Tom suggested, raises two possible errors:
os.remove('test.csv'): FileNotFoundError: [Errno 2] No such file or directory: 'test.csv'
os.remove('test'): FileNotFoundError: File b'test' does not exist
What do you guys recommend?
pandas/errors/__init__.py
Outdated
One way to solve this issue is using the parameter `converters` in the | ||
`read_csv` and `read_table` functions to explicit the conversion: | ||
|
||
>>> df2 = pd.read_csv('test', sep='\t', converters={'a': str}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing: I think we should recommend dtype={'a': str}
instead ?
- PEP8 - Move warning to "output" section - Fixed second read example - Cleanup
Updated the example, minding taking a look @jorisvandenbossche? This unfortunately leaks the warning to stdout. @hissashirocha we decided in another issue to just |
I went back and forth on where to put the import :) |
@TomAugspurger the only problem now is that the actual warning is separated from the code block, which looks a bit strange in the html output .. The alternative you mentioned would be
But I agree this boilerplate is a bit unfortunate and distracts from what we actually want to show |
Let's go ahead with the current solution for now. We can always later look into adding our own doctest-extensions (it should be possible to add doctest options while running our own doctests: eg https://github.com/astropy/pytest-doctestplus/blob/master/pytest_doctestplus/output_checker.py) |
@hissashirocha Thanks for the PR! |
Ah was just going to push this
Looks OK. I'll make a new PR. |
Ah, that's a good one :) |
It was a pleasure! This is my first time contributing to pandas and I really enjoyed it. Thanks for your help! |
Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):
scripts/validate_docstrings.py <your-function-or-method>
git diff upstream/master -u -- "*.py" | flake8 --diff
python doc/make.py --single <your-function-or-method>
Please include the output of the validation script below between the "```" ticks: