Skip to content

DOC: update the pandas.errors.DtypeWarning docstring #20208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 47 additions & 3 deletions pandas/errors/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,53 @@ class ParserError(ValueError):

class DtypeWarning(Warning):
"""
Warning that is raised for a dtype incompatibility. This
can happen whenever `pd.read_csv` encounters non-
uniform dtypes in a column(s) of a given CSV file.
Warning raised when importing different dtypes in a column from a file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when reading (not importing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tks!


Raised for a dtype incompatibility. This can happen whenever `pd.read_csv`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also use read_csv here instead of pd.read_csv ? (same below for read_table)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done!

or `pd.read_table` encounter non-uniform dtypes in a column(s) of a given
CSV file.

It only happens when dealing with larger files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this comment, you can add in the Notes section, that this can happen in larger files. Its because the dtype checking happens per chunk that is read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tks!


See Also
--------
pd.read_csv : Read CSV (comma-separated) file into a DataFrame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd -> pandas I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

pd.read_table : Read general delimited file into a DataFrame.

Notes
-----
Despite the warning, the CSV file is imported with mixed types in a single
column. See the examples below to better understand this issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imported -> read. The dtype of the column will be object.


Examples
--------
This example creates and reads a large CSV file with a column that contains
`int` and `str`.

>>> df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000,
... 'b':['b']*300000})
>>> df.to_csv('test', sep='\t', index=False, na_rep='NA')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add '.csv' for the temp file ?
I would also leave out the sep='\t' and na_rep='NA', as they are not relevant to show illustrate the warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

>>> df2 = pd.read_csv('test', sep='\t')
Traceback (most recent call last):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if yyou add a line like

>>> df2 = pd.read_csv('test', sep='\t')
... # doctest: +SKIP

The doctest might work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it will be possible to get Warnings to work with doctest, so adding a # doctest: +SKIP is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I couldn't understand what the doctest: +SKIP is supposed to do. I've added it right after the read_csv and i just got an error. Can you help me?

################################################################################
################################### Doctests ###################################
################################################################################


Line 28, in pandas.errors.DtypeWarning
Failed example:
df2 = pd.read_csv('test', sep=' ')
Expected:
# doctest: +SKIP
DtypeWarning: Columns (0) have mixed types...
Got nothing

...
DtypeWarning: Columns (0) have mixed types...

Important to notice that df2 will contain both `str` and `int` for the
same input, '1'.

>>> df2.iloc[262140,0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space after comma

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

'1'
>>> type(df2.iloc[262140,0])
<class 'str'>
>>> df2.iloc[262150,0]
1
>>> type(df2.iloc[262150,0])
<class 'int'>

One way to solve this issue is using the parameter `converters` in the
`read_csv` and `read_table` functions to explicit the conversion:

>>> df2 = pd.read_csv('test', sep='\t', converters={'a': str})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this is still test not test.csv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: I think we should recommend dtype={'a': str} instead ?

"""


Expand Down