-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: option to export df to Stata dataset with value labels #41042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: option to export df to Stata dataset with value labels #41042
Conversation
I added a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from these points, which are mostly minor:
- Have you created a dta with the full range of supported column types and a mix of fully and partially labeled columns and verified it is correct in Stata? Round tripping is a good start, but it is also necessary to directly verify that the dta renders correctly in Stata.
- What happens when column names are not legal Stata variable names (e.g., they need to be munged by pre/postpending _). I don't see anything that explicitly hadnles this case, and it might be automagicaly handeled. It would be good to add a test where a few columns had ad names.
pandas/core/frame.py
Outdated
@@ -2301,6 +2301,7 @@ def to_stata( | |||
time_stamp: datetime.datetime | None = None, | |||
data_label: str | None = None, | |||
variable_labels: dict[Hashable, str] | None = None, | |||
value_labels: dict[str, dict[float | int, str]] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these are not keyword only, I don't think this can be inserted here since it could produce a crash in otherwise correct code. Would be great to move most of these to be keyword only, but that is a different issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this might need to move to the end of the list. Ideally you would make it keyword only, e.g.,
storage_options: StorageOptions = None,
*,
value_labels: dict[Hashable, dict[float | int, str]] | None = None,
Also note that it should be dict[Hashable, ...
not str
Thanks for the feedback! I made the typing changes and will go ahead and create a sample dta for some additional tests |
You need to import Literal when type checking, and make sure An example: Lines 141 to 144 in accf345
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. A few final questions. Probably all correct and I'm just not seeing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really minor clarifications. Ping when done and I'll allow CI to run, and then we can see about merging.
@bashtage I clarified the comments you mentioned. I left the type checking as is to be consistent with the rest of the project. If that's ok, I think everything is addressed. Thanks again for the feedback |
If the CI is green, then we'll get it merged. Thanks. |
@bashtage I can't tell where things are failing. Any suggestions? |
): | ||
super().__init__() | ||
self._convert_dates = {} if convert_dates is None else convert_dates | ||
self._write_index = write_index | ||
self._time_stamp = time_stamp | ||
self._data_label = data_label | ||
self._variable_labels = variable_labels | ||
self._non_cat_value_labels = value_labels | ||
self._value_labels: list[StataValueLabel] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you rename this to _cat_value_labels to distingish and make consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_value_labels is a combined list of categorical and non-categorical labels. there isn't a standalone _cat_value_labels because they're added straight to _value_labels after converting categoricals in the dataframe
pandas/io/stata.py
Outdated
@@ -2222,17 +2286,52 @@ def _write_bytes(self, value: bytes) -> None: | |||
""" | |||
self.handles.handle.write(value) # type: ignore[arg-type] | |||
|
|||
def _prepare_non_cat_value_labels(self, data: DataFrame) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is modifying global state and is pretty opaque to the reader. ideally these would return values and state would be changed in the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of the state change to the caller.
Close and reopen to triger CI. |
@lmcindewar Can you take a look @jreback 's comments? |
Failure is a random one due to codecov. |
Typo in docstring
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add a release note (1.4 I think)
@simonjayhawkins I'd completely missed your comment, I added a short release note. I think that should be it unless there's anything left from the previous reviews |
LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All issues have been addressed and once green looks good to merge.
@bashtage I haven't been able to nail down what's causing these type errors. It passes type checking with an added annotation on line 2542 in io/stata.py: self.data: DataFrame = data but I wasn't sure if there was an underlying issue that needs to be fixed |
Try adding |
Failures seem unrelated to anything in this PR. |
Can you rebase on master to fix the merge conflict. and then I think this is ready. |
I rebased it. |
@jreback I think this PR is in good shape and should be merged. 3 unrelated failures in doc tests. |
@lmcindewar can you rebase so we can look into merging? |
b9f0bc7
to
0311142
Compare
@bashtage I rebased it, let me know if there's anything else needed to merge it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. over to you @bashtage
@lmcindewar would be nice to add an example to the doc-string as well (can be here or in a followup)
Thanks @lmcindewar . It would be great if you could do an example in a follow-up. |
@bashtage Will do. Thanks again for all of the reviews |
GH38454