Skip to content

ENH: DataFrame.describe allows UDFs and/or selectable metrics #45737

Closed
@attack68

Description

@attack68

DataFrame.describe() generates descriptive statistics for columns in a DataFrame. The "descriptive statistics" have been specifically chosen and hard-coded, and are also somewhat dtype dependent.

I find it quite odd that the function has a lot of customisation for which columns to include or to exclude based on dtype, or which percentiles to sample, but doesn't offer the ability to chain a set of predefined functions that the user might want to see (or not see: such as percentiles).

A rough idea is to propose a new argument, e.g. metrics, which overwrites and defines the metrics to a specifc set of functions, defined as str or callable:

def udf_name(s):
    return s.sum()

df = DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
df.describe(metrics=["sum", Series.mean, lambda s: s.count(), lambda s: s.dtype, udf_name])
              A        B
sum           4        6       
mean          2        3
<lambda>      2        2
<lambda>  int64    int64
udf_name      4        6

Note this came up in the context of trying to add sub-total or additional rows to a Styler, based on the underlying data (#43894), and following issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions