Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Problem description
Currently, to map the Pandas data type to a SQLAlchemy type: https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L1066-L1069, the code reads:
elif col_type == "integer":
if col.dtype == "int32":
return Integer
else:
return BigInteger
This means integers of a capacity less than 32-bit are written to the database as if they were 64-bit. An example where one might get smaller integers is through the use of pd.to_numeric and downcast=True
.
Since there is already a check for the col_type == "integer"
, I think switching the col.dtype
check to this could be a possible fix:
elif col_type == "integer":
if col.dtype == "int64":
return BigInteger
else:
return Integer
But I'm not sure how to get started with an official PR or if this is a sane thing to do.
For context (this part is not an issue for Pandas, just explaining my interest in this issue), I discovered this when using the df.to_sql method to persist datasets to a Postgres database. From there, I use Postgraphile to auto-generate a GraphQL endpoint. I found that the BigInteger
type ends up resolving as a string, because Javascript can't represent such large numbers safely. This would be fine if the source data warranted the high-capacity, but for me it often doesn't.
Expected Output
That the column types that get written to a new table in the database more tightly match their data types in Pandas.
Workarounds
If using df.to_sql, one could set the dtype
argument but this can be unruly when there are many columns in the DataFrame.
For now, I have this helper method to downcast integer columns to their lowest capacity, but then back to 32-bit to persist them to the database with the desired column type:
def cast_to_int32(df):
# Downcast to the lowest possible representation.
for col in df.select_dtypes(include=['int']).columns.values:
df[col] = pd.to_numeric(df[col], downcast='integer')
# Upcast back to 32-bit (since that's what gets persisted correctly)
for col in df.select_dtypes(include=['int8', 'int16']).columns.values:
df[col] = df[col].astype('int32')
return df
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-39-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.17
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None