Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.
This repository was archived by the owner on May 17, 2024. It is now read-only.

The data-diff shows incorrect row differences between Postgres and BigQuery table #532

Closed
@chandu-chai

Description

@chandu-chai

I ran the following command to use data-diff for a comparison of tables between PostgreSQL and BigQuery:
Make sure to include the following (minus sensitive information):

        "postgresql://<username>:<password>@<db_host>:<port>/<database>" \
        application \
        "bigquery://<project_name>/<dataset_name>" \
        application \
        -k id -c id \
        --debug

The primary key column in both tables is id, with a format of application-xxxxxxxxYYYYYY-zzzz (e.g., application-2gXxHwCux0HqFNsV721hvTPhJRa97A-0dJgh).

Issue:
The data-diff output showed a difference of >500 rows between the two tables, while the actual difference is only around 4 records.

Further Details:
PostgreSQL version: PostgreSQL 13.4.

After the segment split, data-diff prepares the following queries for both Postgres and BigQuery (multiple queries like this are created for each segment):

PostgreSQL query (generated by data-diff):

SELECT count(*), sum(('x' || substring(md5("id"::varchar), 18))::bit(60)::bigint) FROM "application" WHERE ("id" >= 'application-2gIxHwCux0HqFNSV727hvTPhJRa97A-AAAAA') AND ("id" < 'application-4-xuP3G1v 6g238AtkhFZAHY843gchXXXXX')

BigQuery query (generated by data-diff):

SELECT count(*), sum(cast(cast( ('0x' || substr(TO_HEX(md5(cast(`id` as string))), 18)) as int64) as numeric)) FROM `<project>`.`<dataset>`.`application` WHERE (`id` >= 'application-2gIxHwCux0HqFNSV727hvTPhJRa97A-AAAAA') AND (`id` < 'application-4-xuP3G1v 6g238AtkhFZAHY843gchXXXXX')

Upon running these queries separately on Postgres and BigQuery, there is a significant difference in row count.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingnon-dbtUse cases outside of dbtstale_immuneImmunity to stale bottriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions