Skip to content

DOC: update the pandas.errors.DtypeWarning docstring #20208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

hissashirocha
Copy link
Contributor

@hissashirocha hissashirocha commented Mar 10, 2018

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
#################### Docstring (pandas.errors.DtypeWarning) ####################
################################################################################

Warning raised when reading different dtypes in a column from a file.

Raised for a dtype incompatibility. This can happen whenever `read_csv`
or `read_table` encounter non-uniform dtypes in a column(s) of a given
CSV file.

See Also
--------
pandas.read_csv : Read CSV (comma-separated) file into a DataFrame.
pandas.read_table : Read general delimited file into a DataFrame.

Notes
-----
This warning is issued when dealing with larger files because the dtype
checking happens per chunk read.

Despite the warning, the CSV file is read with mixed types in a single
column which will be an object type. See the examples below to better
understand this issue.

Examples
--------
This example creates and reads a large CSV file with a column that contains
`int` and `str`.

>>> df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000,
...                    'b':['b']*300000})
>>> df.to_csv('test.csv', index=False)
>>> df2 = pd.read_csv('test.csv')
>>> DtypeWarning: Columns (0) have mixed types... # doctest: +SKIP

Important to notice that df2 will contain both `str` and `int` for the
same input, '1'.

>>> df2.iloc[262140, 0]
'1'
>>> type(df2.iloc[262140, 0])
<class 'str'>
>>> df2.iloc[262150, 0]
1
>>> type(df2.iloc[262150, 0])
<class 'int'>

One way to solve this issue is using the parameter `converters` in the
`read_csv` and `read_table` functions to explicit the conversion:

>>> df2 = pd.read_csv('test', sep=' ', converters={'a': str})
scripts/validate_docstrings.py:268: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  runner.run(test)

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	No returns section found


See Also
--------
pd.read_csv : Read CSV (comma-separated) file into a DataFrame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd -> pandas I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

Important to notice that df2 will contain both `str` and `int` for the
same input, '1'.

>>> df2.iloc[262140,0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space after comma

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

... 'b':['b']*300000})
>>> df.to_csv('test', sep='\t', index=False, na_rep='NA')
>>> df2 = pd.read_csv('test', sep='\t')
Traceback (most recent call last):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if yyou add a line like

>>> df2 = pd.read_csv('test', sep='\t')
... # doctest: +SKIP

The doctest might work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it will be possible to get Warnings to work with doctest, so adding a # doctest: +SKIP is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I couldn't understand what the doctest: +SKIP is supposed to do. I've added it right after the read_csv and i just got an error. Can you help me?

################################################################################
################################### Doctests ###################################
################################################################################


Line 28, in pandas.errors.DtypeWarning
Failed example:
df2 = pd.read_csv('test', sep=' ')
Expected:
# doctest: +SKIP
DtypeWarning: Columns (0) have mixed types...
Got nothing

uniform dtypes in a column(s) of a given CSV file.
Warning raised when importing different dtypes in a column from a file.

Raised for a dtype incompatibility. This can happen whenever `pd.read_csv`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also use read_csv here instead of pd.read_csv ? (same below for read_table)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done!


>>> df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000,
... 'b':['b']*300000})
>>> df.to_csv('test', sep='\t', index=False, na_rep='NA')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add '.csv' for the temp file ?
I would also leave out the sep='\t' and na_rep='NA', as they are not relevant to show illustrate the warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

... 'b':['b']*300000})
>>> df.to_csv('test', sep='\t', index=False, na_rep='NA')
>>> df2 = pd.read_csv('test', sep='\t')
Traceback (most recent call last):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it will be possible to get Warnings to work with doctest, so adding a # doctest: +SKIP is fine

Warning that is raised for a dtype incompatibility. This
can happen whenever `pd.read_csv` encounters non-
uniform dtypes in a column(s) of a given CSV file.
Warning raised when importing different dtypes in a column from a file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when reading (not importing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tks!

or `pd.read_table` encounter non-uniform dtypes in a column(s) of a given
CSV file.

It only happens when dealing with larger files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this comment, you can add in the Notes section, that this can happen in larger files. Its because the dtype checking happens per chunk that is read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tks!

Notes
-----
Despite the warning, the CSV file is imported with mixed types in a single
column. See the examples below to better understand this issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imported -> read. The dtype of the column will be object.

@jreback jreback added Docs Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Mar 10, 2018
@codecov
Copy link

codecov bot commented Mar 11, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@0d86742). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20208   +/-   ##
=========================================
  Coverage          ?    91.7%           
=========================================
  Files             ?      150           
  Lines             ?    49156           
  Branches          ?        0           
=========================================
  Hits              ?    45078           
  Misses            ?     4078           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.08% <ø> (?)
#single 41.85% <ø> (?)
Impacted Files Coverage Δ
pandas/errors/__init__.py 92.3% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d86742...9e7e129. Read the comment docs.

…arning

Made some adjustments after peer's suggestions.
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche running this doctest will leave behind a test.csv file, which isn't ideal.

We can

  • convert this to a regular code-block
  • Use io.StringIO, which is uglier but doesn't leave behind a file
  • Add a bit to os.remove('test.csv')

Preferences?

One way to solve this issue is using the parameter `converters` in the
`read_csv` and `read_table` functions to explicit the conversion:

>>> df2 = pd.read_csv('test', sep='\t', converters={'a': str})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this is still test not test.csv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: I think we should recommend dtype={'a': str} instead ?

@jorisvandenbossche
Copy link
Member

Hmm, yes, that's a general problem though, so let's discuss that somewhere else

@jorisvandenbossche
Copy link
Member

@TomAugspurger see #20302

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment in addition to the one of @TomAugspurger

Looks good for the rest!

... 'b':['b']*300000})
>>> df.to_csv('test.csv', index=False)
>>> df2 = pd.read_csv('test.csv')
>>> DtypeWarning: Columns (0) have mixed types... # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the >>> on this line? (it might be you need to move the "# doctest: +SKIP" to the line above after the read_csv)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche @TomAugspurger I've added "import os" and an "os.remove('test.csv')" at the end of the example, just as you've agreed on #20302 , and also put "# doctest: +SKIP" right before the warning, like this:

>>> import os
>>> df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000,
...                    'b':['b']*300000})
>>> df.to_csv('test.csv', index=False)
>>> df2 = pd.read_csv('test.csv')
>>> os.remove('test.csv')
# doctest: +SKIP
DtypeWarning: Columns (0) have mixed types...

But I got this message on validation:

################################################################################
################################### Doctests ###################################
################################################################################


Line 32, in pandas.errors.DtypeWarning
Failed example:
os.remove('test.csv')
Expected:
# doctest: +SKIP
DtypeWarning: Columns (0) have mixed types...
Got nothing

And also, removing ".csv" from to_csv and read_csv, like Tom suggested, raises two possible errors:

os.remove('test.csv'): FileNotFoundError: [Errno 2] No such file or directory: 'test.csv'
os.remove('test'): FileNotFoundError: File b'test' does not exist

What do you guys recommend?

One way to solve this issue is using the parameter `converters` in the
`read_csv` and `read_table` functions to explicit the conversion:

>>> df2 = pd.read_csv('test', sep='\t', converters={'a': str})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: I think we should recommend dtype={'a': str} instead ?

- PEP8
- Move warning to "output" section
- Fixed second read example
- Cleanup
@TomAugspurger
Copy link
Contributor

Updated the example, minding taking a look @jorisvandenbossche?

This unfortunately leaks the warning to stdout. # doctest: +SKIP doesn't work because we need the example to run, so that df2 is defined. I'm not sure of a good way, but when running with pytest we can maybe globally capture those? I don't want to pollute the docstring with things like catch_warnings.

@hissashirocha we decided in another issue to just os.remove the file when done.

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Mar 13, 2018
@TomAugspurger
Copy link
Contributor

I went back and forth on where to put the import :)

@jorisvandenbossche
Copy link
Member

@TomAugspurger the only problem now is that the actual warning is separated from the code block, which looks a bit strange in the html output ..
But if don't leave the blank line, you get a failing doctest ..

The alternative you mentioned would be

>>> import warnings
>>> with warnings.catch_warnings(record=True) as w:
...     df2 = pd.read_csv('test.csv')
>>> w[0].message
pandas.errors.DtypeWarning('Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.')

But I agree this boilerplate is a bit unfortunate and distracts from what we actually want to show

@jorisvandenbossche
Copy link
Member

Let's go ahead with the current solution for now. We can always later look into adding our own doctest-extensions (it should be possible to add doctest options while running our own doctests: eg https://github.com/astropy/pytest-doctestplus/blob/master/pytest_doctestplus/output_checker.py)

@jorisvandenbossche jorisvandenbossche merged commit df87fd3 into pandas-dev:master Mar 13, 2018
@jorisvandenbossche
Copy link
Member

@hissashirocha Thanks for the PR!

@TomAugspurger
Copy link
Contributor

Ah was just going to push this

diff --git a/pandas/errors/__init__.py b/pandas/errors/__init__.py
index 16654f022..ff34df64c 100644
--- a/pandas/errors/__init__.py
+++ b/pandas/errors/__init__.py
@@ -68,8 +68,7 @@ class DtypeWarning(Warning):
     ...                    'b': ['b'] * 300000})
     >>> df.to_csv('test.csv', index=False)
     >>> df2 = pd.read_csv('test.csv')
-
-        DtypeWarning: Columns (0) have mixed types
+    ... # DtypeWarning: Columns (0) have mixed types

Looks OK. I'll make a new PR.

@jorisvandenbossche
Copy link
Member

Ah, that's a good one :)

@hissashirocha
Copy link
Contributor Author

It was a pleasure! This is my first time contributing to pandas and I really enjoyed it. Thanks for your help!

@hissashirocha hissashirocha deleted the docstring_dtype_warning branch March 13, 2018 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants