ENH: Pluggable SQL performance

Currently the pandas SQL logic is using SQLAlchemy with results being returned as Python objects before being converted to a DataFrame. While the API is simple, it doesn't have good performance characteristics due to the intermediate Python objects. There exist currently some faster alternatives with inconsistent and more complicated APIs.

In addition to not having a uniform API, these implementations are only concerned about fast result en-/decoding. Functionality like automatic table creation as we have in `pandas.DataFrame.to_sql` doesn't exist there.

Thus it would be nice to have a way to use these connector implementations behind the standard pandas API.

## Faster alternatives

* [bcpandas](https://github.com/yehoshuadimarsky/bcpandas): Use BCP to insert data into MS SQL Server
* [turbodbc](https://turbodbc.readthedocs.io/en/latest/): Fast access to databases which have an ODBC driver via Apache Arrow (`fetchallarrow().to_pandas()`), e.g. MS SQL or Exasol.
* [snowflake-connector-python](https://github.com/snowflakedb/snowflake-connector-python): Brings native Apache Arrow to speed via `fetch_all_pandas()`
* [pyarrow.jvm / JDBC](https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html): Use `pyarrow`'s JVM module to get faster access to JDBC results via Arrow
* `postgres-copy-arrow-decode`: Not yet opensourced (shame on me): Cython-based encoder/decoder for Postgres' `COPY BINARY` command that decodes Postgres' binary protocol from/to Arrow. Works together with `psycopg2` and gives roughly a 2x speedup and type stability on the `COPY CSV` method in `pandas`'s docs.
* [PostgresAdapter](https://github.com/ContinuumIO/PostgresAdapter): NumPy support for Postgres connections
* [d6tstack](https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb): Fast insert into Postgres/MySQL/MSSQL via CSV files

## General implementation idea

* `pandas` users should only deal with `read_sql` and `to_sql` in its current fashion.
* There shouldn't be any new hard dependencies in pandas.
* The SQLAlchemy engine is a nice uniform interface to specify a database connection, keep this.
* We only need a limited set of operations implemented by the performance backend, basically `to_sql(DataFrame)` and `read_sql(query) -> DataFrame`. Table creation, index adjustment and further convenience functionality can still be handled by the high-level SQLAlchemy layer.

### Implementation idea (1) – Dispatch on `type(engine.raw_connection().connection)`

SQLAlchemy exposes the underlying connection of the database driver via `engine.raw_connection()`. This is a useful way to detect how we connect to the database. We could provide a registry where each backend implementation provides a function `supports_connection(engine.raw_connection().connection) -> bool` to determine whether it can be used.

Pro:

* Users doesn't need to change their code. If the backend is loaded, they will automatically get the speedup.

Con:

* Users need to take care that the backend is loaded, otherwise queries will work but stay slow.
* Only one implementation per database connection class can be implemented

### Implementation idea (2) – Extend the `method=` param

`pandas.DataFrame.to_sql` already has a `method` parameter where the user can supply a callable that is used to insert the data into the Database. Currently the callable gets a row-iterator and not a DataFrame. Thus this interface is already hard-wired that the intermediate result needs to be converted into Python objects. Instead of providing a row-iterator, we could pass the original DataFrame to this method

Pro:

* Clear control on which method is used
* Backend implementations could be used via `method=turbodbc.pandas.to_sql`

Con:

* Potentially a breaking change on the `method` paramter or a second parameter needs to be added that is doing nearly the same things is introduced.
* Needs explicit usage for the speedup.

### Implementation idea (3) - Introduce `engine=` param

As we have with the the Parquet and CSV IO implementations, we could also go for providing an `engine` parameter where users could easily switch based on the name of an implementation. A prototype implementation would look like:

```python
import pandas as pd

class DatabaseEngine:

    name = "fastengine"
    
    @staticmethod
    def supports_connecton(connection) -> bool:  # for engine="auto"
        return isinstance(connection, FastConnection)
        
    def to_sql(engine, df: pd.DataFrame, table: str):
        …
        
    def from_sql(engine, query: str) -> pd.DataFrame:
        …
        
pd.register_sql_backend(DatabaseEngine.name, DatabaseEngine)
```

Pro:

* In contrast to (1), here you would get an error when the backend was not loaded.
* Clear control on which method is used.
* User doesn't need to provide the exact function but only the name of the engine
* We could provide an `engine="auto"` setting which on explicit usage tries to find a matching backend and will otherwise fallback to the plain SQLAlchemy implementation.
* We can provide some of these engines as part of pandas, others can come from third-party libraries.

Con:

* Needs explicit usage for the speedup.

Personally, I would prefer this approach.

## Related issues

* [Use Turbodbc/Arrow for read_sql_table](https://github.com/pandas-dev/pandas/issues/17790), would be fixed by this proposal
* [Faster SQL implementation in d6stack library](https://github.com/pandas-dev/pandas/issues/28817), would be fixed by this proposal
* [Improve type handling in read_sql and read_sql_table](https://github.com/pandas-dev/pandas/issues/13049), related to this proposal but won't be fully fixed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Pluggable SQL performance #36893

Faster alternatives

General implementation idea

Implementation idea (1) – Dispatch on `type(engine.raw_connection().connection)`

Implementation idea (2) – Extend the `method=` param

Implementation idea (3) - Introduce `engine=` param

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: Pluggable SQL performance #36893

Description

Faster alternatives

General implementation idea

Implementation idea (1) – Dispatch on type(engine.raw_connection().connection)

Implementation idea (2) – Extend the method= param

Implementation idea (3) - Introduce engine= param

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Implementation idea (1) – Dispatch on `type(engine.raw_connection().connection)`

Implementation idea (2) – Extend the `method=` param

Implementation idea (3) - Introduce `engine=` param