DOC: update the pandas.errors.DtypeWarning docstring #20208

hissashirocha · 2018-03-10T17:46:35Z

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

PR title is "DOC: update the docstring"
The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
The html version looks good: python doc/make.py --single <your-function-or-method>
It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
#################### Docstring (pandas.errors.DtypeWarning) ####################
################################################################################

Warning raised when reading different dtypes in a column from a file.

Raised for a dtype incompatibility. This can happen whenever `read_csv`
or `read_table` encounter non-uniform dtypes in a column(s) of a given
CSV file.

See Also
--------
pandas.read_csv : Read CSV (comma-separated) file into a DataFrame.
pandas.read_table : Read general delimited file into a DataFrame.

Notes
-----
This warning is issued when dealing with larger files because the dtype
checking happens per chunk read.

Despite the warning, the CSV file is read with mixed types in a single
column which will be an object type. See the examples below to better
understand this issue.

Examples
--------
This example creates and reads a large CSV file with a column that contains
`int` and `str`.

>>> df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000,
...                    'b':['b']*300000})
>>> df.to_csv('test.csv', index=False)
>>> df2 = pd.read_csv('test.csv')
>>> DtypeWarning: Columns (0) have mixed types... # doctest: +SKIP

Important to notice that df2 will contain both `str` and `int` for the
same input, '1'.

>>> df2.iloc[262140, 0]
'1'
>>> type(df2.iloc[262140, 0])
<class 'str'>
>>> df2.iloc[262150, 0]
1
>>> type(df2.iloc[262150, 0])
<class 'int'>

One way to solve this issue is using the parameter `converters` in the
`read_csv` and `read_table` functions to explicit the conversion:

>>> df2 = pd.read_csv('test', sep=' ', converters={'a': str})
scripts/validate_docstrings.py:268: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  runner.run(test)

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	No returns section found

TomAugspurger · 2018-03-10T17:55:17Z

pandas/errors/__init__.py

+
+    See Also
+    --------
+    pd.read_csv : Read CSV (comma-separated) file into a DataFrame.


pd -> pandas I think.

TomAugspurger · 2018-03-10T17:56:29Z

pandas/errors/__init__.py

+    Important to notice that df2 will contain both `str` and `int` for the
+    same input, '1'.
+
+    >>> df2.iloc[262140,0]


PEP8: space after comma

TomAugspurger · 2018-03-10T18:00:16Z

pandas/errors/__init__.py

+    ...                    'b':['b']*300000})
+    >>> df.to_csv('test', sep='\t', index=False, na_rep='NA')
+    >>> df2 = pd.read_csv('test', sep='\t')
+    Traceback (most recent call last):


I think if yyou add a line like

>>> df2 = pd.read_csv('test', sep='\t') ... # doctest: +SKIP

The doctest might work.

Yeah, I don't think it will be possible to get Warnings to work with doctest, so adding a # doctest: +SKIP is fine

I'm sorry but I couldn't understand what the doctest: +SKIP is supposed to do. I've added it right after the read_csv and i just got an error. Can you help me?

################################################################################
################################### Doctests ###################################
################################################################################

Line 28, in pandas.errors.DtypeWarning
Failed example:
df2 = pd.read_csv('test', sep=' ')
Expected:
# doctest: +SKIP
DtypeWarning: Columns (0) have mixed types...
Got nothing

jorisvandenbossche · 2018-03-10T18:36:04Z

pandas/errors/__init__.py

-    uniform dtypes in a column(s) of a given CSV file.
+    Warning raised when importing different dtypes in a column from a file.
+
+    Raised for a dtype incompatibility. This can happen whenever `pd.read_csv`


Can you also use read_csv here instead of pd.read_csv ? (same below for read_table)

jorisvandenbossche · 2018-03-10T18:38:45Z

pandas/errors/__init__.py

+
+    >>> df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000,
+    ...                    'b':['b']*300000})
+    >>> df.to_csv('test', sep='\t', index=False, na_rep='NA')


can you add '.csv' for the temp file ?
I would also leave out the sep='\t' and na_rep='NA', as they are not relevant to show illustrate the warning

jorisvandenbossche · 2018-03-10T18:40:06Z

pandas/errors/__init__.py

+    ...                    'b':['b']*300000})
+    >>> df.to_csv('test', sep='\t', index=False, na_rep='NA')
+    >>> df2 = pd.read_csv('test', sep='\t')
+    Traceback (most recent call last):


Yeah, I don't think it will be possible to get Warnings to work with doctest, so adding a # doctest: +SKIP is fine

jreback · 2018-03-10T19:50:39Z

pandas/errors/__init__.py

-    Warning that is raised for a dtype incompatibility. This
-    can happen whenever `pd.read_csv` encounters non-
-    uniform dtypes in a column(s) of a given CSV file.
+    Warning raised when importing different dtypes in a column from a file.


when reading (not importing)

jreback · 2018-03-10T19:51:40Z

pandas/errors/__init__.py

+    or `pd.read_table` encounter non-uniform dtypes in a column(s) of a given
+    CSV file.
+
+    It only happens when dealing with larger files.


instead of this comment, you can add in the Notes section, that this can happen in larger files. Its because the dtype checking happens per chunk that is read.

jreback · 2018-03-10T19:52:07Z

pandas/errors/__init__.py

+    Notes
+    -----
+    Despite the warning, the CSV file is imported with mixed types in a single
+    column. See the examples below to better understand this issue.


imported -> read. The dtype of the column will be object.

codecov · 2018-03-11T18:50:30Z

Codecov Report

❗ No coverage uploaded for pull request base (master@0d86742). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #20208   +/-   ##
=========================================
  Coverage          ?    91.7%           
=========================================
  Files             ?      150           
  Lines             ?    49156           
  Branches          ?        0           
=========================================
  Hits              ?    45078           
  Misses            ?     4078           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.08% <ø> (?)`
#single	`41.85% <ø> (?)`

Impacted Files	Coverage Δ
pandas/errors/__init__.py	`92.3% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d86742...9e7e129. Read the comment docs.

…arning Made some adjustments after peer's suggestions.

TomAugspurger

@jorisvandenbossche running this doctest will leave behind a test.csv file, which isn't ideal.

We can

convert this to a regular code-block
Use io.StringIO, which is uglier but doesn't leave behind a file
Add a bit to os.remove('test.csv')

Preferences?

TomAugspurger · 2018-03-12T14:54:19Z

pandas/errors/__init__.py

+    One way to solve this issue is using the parameter `converters` in the
+    `read_csv` and `read_table` functions to explicit the conversion:
+
+    >>> df2 = pd.read_csv('test', sep='\t', converters={'a': str})


FYI, this is still test not test.csv

Another thing: I think we should recommend dtype={'a': str} instead ?

jorisvandenbossche · 2018-03-12T15:06:43Z

Hmm, yes, that's a general problem though, so let's discuss that somewhere else

jorisvandenbossche · 2018-03-12T15:09:15Z

@TomAugspurger see #20302

jorisvandenbossche

Small comment in addition to the one of @TomAugspurger

Looks good for the rest!

jorisvandenbossche · 2018-03-12T15:17:46Z

pandas/errors/__init__.py

+    ...                    'b':['b']*300000})
+    >>> df.to_csv('test.csv', index=False)
+    >>> df2 = pd.read_csv('test.csv')
+    >>> DtypeWarning: Columns (0) have mixed types... # doctest: +SKIP


Can you remove the >>> on this line? (it might be you need to move the "# doctest: +SKIP" to the line above after the read_csv)

@jorisvandenbossche @TomAugspurger I've added "import os" and an "os.remove('test.csv')" at the end of the example, just as you've agreed on #20302 , and also put "# doctest: +SKIP" right before the warning, like this:

>>> import os >>> df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, ... 'b':['b']*300000}) >>> df.to_csv('test.csv', index=False) >>> df2 = pd.read_csv('test.csv') >>> os.remove('test.csv') # doctest: +SKIP DtypeWarning: Columns (0) have mixed types...

But I got this message on validation:

################################################################################
################################### Doctests ###################################
################################################################################

Line 32, in pandas.errors.DtypeWarning
Failed example:
os.remove('test.csv')
Expected:
# doctest: +SKIP
DtypeWarning: Columns (0) have mixed types...
Got nothing

And also, removing ".csv" from to_csv and read_csv, like Tom suggested, raises two possible errors:

os.remove('test.csv'): FileNotFoundError: [Errno 2] No such file or directory: 'test.csv'
os.remove('test'): FileNotFoundError: File b'test' does not exist

What do you guys recommend?

jorisvandenbossche · 2018-03-12T15:19:34Z

pandas/errors/__init__.py

+    One way to solve this issue is using the parameter `converters` in the
+    `read_csv` and `read_table` functions to explicit the conversion:
+
+    >>> df2 = pd.read_csv('test', sep='\t', converters={'a': str})


Another thing: I think we should recommend dtype={'a': str} instead ?

- PEP8 - Move warning to "output" section - Fixed second read example - Cleanup

TomAugspurger · 2018-03-13T14:24:28Z

Updated the example, minding taking a look @jorisvandenbossche?

This unfortunately leaks the warning to stdout. # doctest: +SKIP doesn't work because we need the example to run, so that df2 is defined. I'm not sure of a good way, but when running with pytest we can maybe globally capture those? I don't want to pollute the docstring with things like catch_warnings.

@hissashirocha we decided in another issue to just os.remove the file when done.

TomAugspurger · 2018-03-13T14:46:03Z

I went back and forth on where to put the import :)

jorisvandenbossche · 2018-03-13T14:48:57Z

@TomAugspurger the only problem now is that the actual warning is separated from the code block, which looks a bit strange in the html output ..
But if don't leave the blank line, you get a failing doctest ..

The alternative you mentioned would be

>>> import warnings
>>> with warnings.catch_warnings(record=True) as w:
...     df2 = pd.read_csv('test.csv')
>>> w[0].message
pandas.errors.DtypeWarning('Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.')

But I agree this boilerplate is a bit unfortunate and distracts from what we actually want to show

jorisvandenbossche · 2018-03-13T14:55:00Z

Let's go ahead with the current solution for now. We can always later look into adding our own doctest-extensions (it should be possible to add doctest options while running our own doctests: eg https://github.com/astropy/pytest-doctestplus/blob/master/pytest_doctestplus/output_checker.py)

jorisvandenbossche · 2018-03-13T14:55:42Z

@hissashirocha Thanks for the PR!

TomAugspurger · 2018-03-13T14:59:51Z

Ah was just going to push this

diff --git a/pandas/errors/__init__.py b/pandas/errors/__init__.py
index 16654f022..ff34df64c 100644
--- a/pandas/errors/__init__.py
+++ b/pandas/errors/__init__.py
@@ -68,8 +68,7 @@ class DtypeWarning(Warning):
     ...                    'b': ['b'] * 300000})
     >>> df.to_csv('test.csv', index=False)
     >>> df2 = pd.read_csv('test.csv')
-
-        DtypeWarning: Columns (0) have mixed types
+    ... # DtypeWarning: Columns (0) have mixed types

Looks OK. I'll make a new PR.

jorisvandenbossche · 2018-03-13T15:03:08Z

Ah, that's a good one :)

hissashirocha · 2018-03-13T16:20:59Z

It was a pleasure! This is my first time contributing to pandas and I really enjoyed it. Thanks for your help!

hissashirocha added 4 commits March 10, 2018 14:31

Improving docstring for DtypeWarning.

6cbdd7d

Improving docstring for DtypeWarning.

ab7f790

Improving docstring for DtypeWarning.

9423867

Improving docstring for DtypeWarning.

ed7e372

TomAugspurger reviewed Mar 10, 2018

View reviewed changes

jorisvandenbossche reviewed Mar 10, 2018

View reviewed changes

jreback requested changes Mar 10, 2018

View reviewed changes

jreback added Docs Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Mar 10, 2018

Improving docstring for DtypeWarning.

76fc248

Merge remote-tracking branch 'upstream/master' into docstring_dtype_w…

12c0ac5

…arning Made some adjustments after peer's suggestions.

TomAugspurger reviewed Mar 12, 2018

View reviewed changes

jorisvandenbossche mentioned this pull request Mar 12, 2018

DOC: file IO in doctest examples #20302

Closed

jorisvandenbossche reviewed Mar 12, 2018

View reviewed changes

Updated example

9be8cfd

- PEP8 - Move warning to "output" section - Fixed second read example - Cleanup

TomAugspurger added this to the 0.23.0 milestone Mar 13, 2018

TomAugspurger approved these changes Mar 13, 2018

View reviewed changes

move os import

9e7e129

jorisvandenbossche approved these changes Mar 13, 2018

View reviewed changes

jorisvandenbossche merged commit df87fd3 into pandas-dev:master Mar 13, 2018

TomAugspurger mentioned this pull request Mar 13, 2018

DOC: Formatting for DtypeWarning Docstring [ci skip] #20329

Merged

hissashirocha deleted the docstring_dtype_warning branch March 13, 2018 16:24

Uh oh!

DOC: update the pandas.errors.DtypeWarning docstring #20208

DOC: update the pandas.errors.DtypeWarning docstring #20208

Uh oh!

Conversation

hissashirocha commented Mar 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Mar 12, 2018

Uh oh!

jorisvandenbossche commented Mar 12, 2018

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Mar 13, 2018

Uh oh!

TomAugspurger commented Mar 13, 2018

Uh oh!

jorisvandenbossche commented Mar 13, 2018

Uh oh!

jorisvandenbossche commented Mar 13, 2018

Uh oh!

jorisvandenbossche commented Mar 13, 2018

Uh oh!

TomAugspurger commented Mar 13, 2018

Uh oh!

jorisvandenbossche commented Mar 13, 2018

Uh oh!

hissashirocha commented Mar 10, 2018 •

edited

Loading

codecov bot commented Mar 11, 2018 •

edited

Loading