Description
Research
-
I have searched the [pandas] tag on StackOverflow for similar questions.
-
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
Question about pandas
I'd like to bring attention to this issue where to_sql
is being blocked by an unrelated process because pandas requires checking information_schema
metadata tables.
My example is MSSQL
(SQL Server) specific but I think it could apply to other DB systems.
steps to reproduce:
- run python code first (automatically creates table)
- run create table (new unrelated table) in an open transaction
-
- key point is that this is unrelated to the table you are inserting in pandas
- re-run python code to insert 10 records. this gets blocked bc it can't query
information_schema
Other scenarios that lock information_schema
cause this blocking in pandas (rebuilding clustered columnstore index on large table)
Notes
In my opinion, we should be able to insert into an existing table no matter if there are other processes going on.
-
for someone who knows the pandas internals, maybe they can chime in?
-
see relevant Stack Overflow question
-
Sqlalchemy GH issue - SQLAlchemy hangs during insert while querying [INFORMATION_SCHEMA].[TABLES] when an open transaction exists that has created a new table sqlalchemy/sqlalchemy#5679
Root cause
- This only occurs if the table name contains any uppercase letters.
- Per comments below, it is related to pandas case_sensitivity check line number.
- Due to this PR for mysql
- See this SQL Alchemy comment
**SQL code to create unrelated table **
begin tran
-- in same database
create table dbo.MyOtherTableUnrelatedProcess
(
a int
)
--rollback
code i used for testing the python piece
import pandas as pd
import sqlalchemy as sa
import urllib
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry', 'Isabel', 'Jack']
})
host = 'myhost'
schema = 'mydb'
params = urllib.parse.quote_plus("DRIVER={ODBC Driver 17 for SQL Server};"
"SERVER=" + host + ";"
"DATABASE=" + schema + ";"
"trusted_connection=yes")
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params), fast_executemany=True)
print('writing')
# note: MyMainTable is blocked because of the open transaction on unrelated process. to_sql can't querying information_schema
df.to_sql('MyMainTable', con=engine, if_exists='append', index=False)
print('finished')
checking the server, this is the hung query called by pandas/sqlalchemy
SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
FROM [INFORMATION_SCHEMA].[TABLES]
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(@P1 AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(@P2 AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]