Skip to content

API: Avoid Hidden numeric heuristics #53781

Open
@mroeschke

Description

@mroeschke

There are several places where pandas has hidden heuristics/thresholds dictating certain behavior that is not immediately obvious or configurable to the user. IIRC, there have been bugs in rolling and to_datetime where buggy behavior was encountered when data had a particular value or the data was a certain size for example which can be hard to diagnose.

Ideally we should:

  1. Not change behavior due to some data characteristic introspection
  2. At lease expose the option to the user to control the heuristic

CSV reading tokenizer chunksize

int64_t DEFAULT_CHUNKSIZE = 256 * 1024

CSV line buffer size

heuristic = 2**20 // self.table_width

Number of elements when to auto use numexpr

_MIN_ELEMENTS = 1_000_000

TDA iter chunk size processing

chunksize = 10000

Something pytables related

_max_selectors = 31

chunksize = 100000

Number of element to automatically use caching in to_datetime

start_caching_at = 50

Chunk size to use when writing csv

return (100000 // (len(self.cols) or 1)) or 1

Number of regexes to store when time parsing

_CACHE_MAX_SIZE = 5 # Max number of regexes stored in _regex_cache

Rank tolerance

float64_t FP_ERR = 1e-13

isin algo determination

len(comps_array) > 1_000_000

Value formatting

has_large_values = (abs_vals > 1e6).any()

Number of elements to populate hash table

_SIZE_CUTOFF = 1_000_000

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions