Description
One of the uncontroversial points from #2 is that DataFrames have column labels / names. I'd like to discuss two specific points on this before merging the results into that issue.
- What type can the column labels be? Should they be limited to just strings?
- Do we require uniqueness of column labels?
I'm a bit unsure whether these are getting too far into the implementation side of things. Should we just take no stance on either of these?
My responses:
- We should probably labels to be any type.
Operations like crosstab
/ pivot
places a column from the input dataframe into the column labels of the output.
We'll need to be careful with how this interacts with the indexing API, since a label like the tuple ('my', 'label')
might introduce ambiguities (e.g. the full list of labels is ['my', 'label', ('my', 'label')]
.
Is it reasonable to require each label to be hashable? Pandas requires this, to facilitate lookup in a hashtable.
- We cannot require uniqueness.
dataframes are commonly used to wrangle real-world data into shape, and real-world data is messy. If an implementation wants to ensure uniqueness (perhaps on a per-object basis) then is can offer that separately. But the API should at least allow for it.