Skip to content

CLN: .values -> ._values #34083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 19, 2020
Merged

CLN: .values -> ._values #34083

merged 8 commits into from
May 19, 2020

Conversation

jbrockmendel
Copy link
Member

No description provided.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully convinced we should also use _values for DataFrame, but you might have your reasons, if can explain?

def values(self):
return self.obj.values
def values(self) -> np.ndarray:
return self.obj._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj is a DataFrame here, so is there a good reason to change values to _values ? (in practice, it's actually _values that is implemented in terms of values)

The reason to do this for Series (_values being the "internal" values) doesn't really apply for DataFrame?

@jbrockmendel
Copy link
Member Author

Not fully convinced we should also use _values for DataFrame, but you might have your reasons, if can explain?

The process of going through the code looking for problematic usages of .values is grep-based, so doing this change in places where they are equivalent trims the problem for future passes.

@jorisvandenbossche
Copy link
Member

I am not the one making those changes, so it is easier for me to say, but: I am not convinced this is worth the easier grepping, as this change makes reading the code more confusing IMO. For me, _values signals that it is something like "the underlying values" that are directly accessed (at least that is how it is used for Series, so therefore that is how I start to interpret this).
But for DataFrame that is never the case, and this _values is in many cases not a cheap access at all.

Personally, I would rather remove the _values attribute on DataFrame. I don't think it serves any purpose?

If the use of .values for DataFrame is nonetheless problematic, I think we should rather look for a different attribute or method instead of _values (to_numpy() could be used, I think, if we don't want to invent a new one)

@jreback jreback added the Clean label May 9, 2020
@jreback jreback added this to the 1.1 milestone May 12, 2020
@jreback
Copy link
Contributor

jreback commented May 12, 2020

this looks ok to me.

@jorisvandenbossche
Copy link
Member

@jbrockmendel what do you think about using to_numpy() for DataFrame ?

@jbrockmendel
Copy link
Member Author

what do you think about using to_numpy() for DataFrame ?

In cases where obj.to_numpy() is equivalent to np.asarray(obj), I prefer the latter.

@jorisvandenbossche
Copy link
Member

That's fine for me as well, I don't have a strong preference, it was just as a suggestion to have an alternative for .values (you don't want) and ._values (I don't want).
For DataFrame, df.values, df._values, df.to_numpy() or np.asarray(df) are AFAIK all basically equivalent.

@jbrockmendel
Copy link
Member Author

it was just as a suggestion to have an alternative for .values (you don't want) and ._values (I don't want).

That's a reasonable thought. One more pitch for ._values before punting on this:

I use the heuristic: "when doing this with Series, would i use _values?" If so, using it in the DataFrame context is more internally consistent and potentially amenable to code-sharing. If I see np.asarray i have to assume that there is a particular reason why EA is not allowed.

@jorisvandenbossche
Copy link
Member

is more internally consistent

It's only consistent in "naming", but not in what it does. DataFrame._values is most of the time something quite different as Series._values, and hence using a "consistent" name is not necessarily helpful IMO

potentially amenable to code-sharing

Do you know if for example in this PR there are actually cases of this? (where the object in question could be both Series or DataFrame).

We certainly have (had) things like obj = getattr(obj, "_values", obj) that untransparantly handled different kinds of things (and dataframe and series at the same time), but I think we generally try to move away from that?

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche how strong are your opinions on this? if they're a blocker, ill revert the DataFrame pieces so we can push the rest through

@jorisvandenbossche
Copy link
Member

Or directly use np.asarray(..) instead of reverting those lines? (since you already need to identify those)


I still think _values to access a dataframe's 2D array is not helping code clarity. You mentioned "potential code sharing" as one reason, but so to repeat my question, do you have an example of such a case? (in this PR, or elsewhere?)

@jbrockmendel
Copy link
Member Author

Or directly use np.asarray(..) instead of reverting those lines?

we've already established this as a pattern i dislike. the status quo seems to be the only point of agreement

@jorisvandenbossche
Copy link
Member

It was you who proposed asarray, BTW

@jbrockmendel
Copy link
Member Author

It was you who proposed asarray, BTW

I proposed asarray specifically for those cases where we want an ndarray and only an ndarray

@jorisvandenbossche
Copy link
Member

I proposed asarray specifically for those cases where we want an ndarray and only an ndarray

Which is what DataFrame.values/DataFrame._values does?

@jbrockmendel
Copy link
Member Author

Which is what DataFrame.values/DataFrame._values does?

We're going in circles. I'm going to revert the edits of DataFrame.values->DataFrame._values.

@jorisvandenbossche
Copy link
Member

I think it is useful to try to clear up our misunderstanding, as it will come up again.

I proposed asarray specifically for those cases where we want an ndarray and only an ndarray

So do I read this correct that you are OK with using np.asarray(obj) instead of obj.values/obj._values if we know we want an ndarray and only an ndarray?

But then, for a DataFrame, the only return values can be an ndarray, so for dataframes we know we always want / will get an ndarray? So doesn't fall that under the above description of cases where we can use np.asarray(..) ?

@jbrockmendel
Copy link
Member Author

I think it is useful to try to clear up our misunderstanding, as it will come up again.

I'd prefer to punt on this until then.

But then, for a DataFrame, the only return values can be an ndarray, so for dataframes we know we always want / will get an ndarray

I'm thinking about a future where we may have 2D EAs.

@jbrockmendel
Copy link
Member Author

controversial parts reverted; green

@jreback jreback merged commit c3c4ccf into pandas-dev:master May 19, 2020
@jreback
Copy link
Contributor

jreback commented May 19, 2020

k thanks

@jbrockmendel jbrockmendel deleted the lessvalues branch May 19, 2020 21:32
PuneethaPai pushed a commit to PuneethaPai/pandas that referenced this pull request May 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants