The data-diff shows incorrect row differences between Postgres and BigQuery table #532
Description
I ran the following command to use data-diff for a comparison of tables between PostgreSQL and BigQuery:
Make sure to include the following (minus sensitive information):
"postgresql://<username>:<password>@<db_host>:<port>/<database>" \
application \
"bigquery://<project_name>/<dataset_name>" \
application \
-k id -c id \
--debug
The primary key column in both tables is id, with a format of application-xxxxxxxxYYYYYY-zzzz (e.g., application-2gXxHwCux0HqFNsV721hvTPhJRa97A-0dJgh).
Issue:
The data-diff output showed a difference of >500 rows between the two tables, while the actual difference is only around 4 records.
Further Details:
PostgreSQL version: PostgreSQL 13.4.
After the segment split, data-diff prepares the following queries for both Postgres and BigQuery (multiple queries like this are created for each segment):
PostgreSQL query (generated by data-diff):
SELECT count(*), sum(('x' || substring(md5("id"::varchar), 18))::bit(60)::bigint) FROM "application" WHERE ("id" >= 'application-2gIxHwCux0HqFNSV727hvTPhJRa97A-AAAAA') AND ("id" < 'application-4-xuP3G1v 6g238AtkhFZAHY843gchXXXXX')
BigQuery query (generated by data-diff):
SELECT count(*), sum(cast(cast( ('0x' || substr(TO_HEX(md5(cast(`id` as string))), 18)) as int64) as numeric)) FROM `<project>`.`<dataset>`.`application` WHERE (`id` >= 'application-2gIxHwCux0HqFNSV727hvTPhJRa97A-AAAAA') AND (`id` < 'application-4-xuP3G1v 6g238AtkhFZAHY843gchXXXXX')
Upon running these queries separately on Postgres and BigQuery, there is a significant difference in row count.