Skip to content

ENH: option to export df to Stata dataset with value labels #41042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lmcindewar
Copy link
Contributor

@lmcindewar lmcindewar commented Apr 19, 2021

GH38454

@jbrockmendel jbrockmendel added the IO Stata read_stata, to_stata label Apr 19, 2021
@lmcindewar
Copy link
Contributor Author

I added a StataNonCatValueLabel to format value labels for non categorical variables. In the StataWriter:

  • added value_labels as optional argument. The format is as close to what is returned from StataReader.value_labels(), but the keys should be column names instead of column labels
  • renamed _is_col_cat to _has_value_labels
  • _prepare_non_cat_value_labels formats the value labels and checks that columns exist and are numeric

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from these points, which are mostly minor:

  1. Have you created a dta with the full range of supported column types and a mix of fully and partially labeled columns and verified it is correct in Stata? Round tripping is a good start, but it is also necessary to directly verify that the dta renders correctly in Stata.
  2. What happens when column names are not legal Stata variable names (e.g., they need to be munged by pre/postpending _). I don't see anything that explicitly hadnles this case, and it might be automagicaly handeled. It would be good to add a test where a few columns had ad names.

@@ -2301,6 +2301,7 @@ def to_stata(
time_stamp: datetime.datetime | None = None,
data_label: str | None = None,
variable_labels: dict[Hashable, str] | None = None,
value_labels: dict[str, dict[float | int, str]] | None = None,
Copy link
Contributor

@bashtage bashtage Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are not keyword only, I don't think this can be inserted here since it could produce a crash in otherwise correct code. Would be great to move most of these to be keyword only, but that is a different issue.

Copy link
Contributor

@bashtage bashtage Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this might need to move to the end of the list. Ideally you would make it keyword only, e.g.,

storage_options: StorageOptions = None,
*,
value_labels: dict[Hashable, dict[float | int, str]] | None = None,

Also note that it should be dict[Hashable, ... not str

@lmcindewar
Copy link
Contributor Author

Thanks for the feedback! I made the typing changes and will go ahead and create a sample dta for some additional tests

@bashtage
Copy link
Contributor

You need to import Literal when type checking, and make sure from __future__ import annotations is at the top of the file.

An example:

if TYPE_CHECKING:
from typing import Literal
from pandas._typing import (

@lmcindewar
Copy link
Contributor Author

  1. I created a variety of labelled types and checked that labels are being applied and displayed correctly in Stata:

image

image

  1. The column names were being converted to valid Stata variables before the value labels were created. I added a check to use StataWriter._converted_names before checking the existing columns. I also moved the creation of _converted_names above _prepare_pandas, it was being populated in prepare pandas and reset to an empty dict in init

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. A few final questions. Probably all correct and I'm just not seeing it.

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really minor clarifications. Ping when done and I'll allow CI to run, and then we can see about merging.

@lmcindewar
Copy link
Contributor Author

@bashtage I clarified the comments you mentioned. I left the type checking as is to be consistent with the rest of the project. If that's ok, I think everything is addressed. Thanks again for the feedback

@bashtage
Copy link
Contributor

If the CI is green, then we'll get it merged. Thanks.

@lmcindewar
Copy link
Contributor Author

If the CI is green, then we'll get it merged. Thanks.

@bashtage I can't tell where things are failing. Any suggestions?

):
super().__init__()
self._convert_dates = {} if convert_dates is None else convert_dates
self._write_index = write_index
self._time_stamp = time_stamp
self._data_label = data_label
self._variable_labels = variable_labels
self._non_cat_value_labels = value_labels
self._value_labels: list[StataValueLabel] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you rename this to _cat_value_labels to distingish and make consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_value_labels is a combined list of categorical and non-categorical labels. there isn't a standalone _cat_value_labels because they're added straight to _value_labels after converting categoricals in the dataframe

@@ -2222,17 +2286,52 @@ def _write_bytes(self, value: bytes) -> None:
"""
self.handles.handle.write(value) # type: ignore[arg-type]

def _prepare_non_cat_value_labels(self, data: DataFrame) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is modifying global state and is pretty opaque to the reader. ideally these would return values and state would be changed in the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of the state change to the caller.

@bashtage bashtage closed this Jun 7, 2021
@bashtage bashtage reopened this Jun 7, 2021
@bashtage
Copy link
Contributor

bashtage commented Jun 7, 2021

Close and reopen to triger CI.

@bashtage
Copy link
Contributor

bashtage commented Jun 7, 2021

@lmcindewar Can you take a look @jreback 's comments?

@bashtage
Copy link
Contributor

bashtage commented Jun 7, 2021

Failure is a random one due to codecov.

@bashtage
Copy link
Contributor

Typo in docstring

-        """ Encode value labels. """
+        """Encode value labels."""

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a release note (1.4 I think)

@lmcindewar
Copy link
Contributor Author

@simonjayhawkins I'd completely missed your comment, I added a short release note. I think that should be it unless there's anything left from the previous reviews

@simonjayhawkins simonjayhawkins added this to the 1.4 milestone Jul 19, 2021
@bashtage
Copy link
Contributor

LGTM.

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All issues have been addressed and once green looks good to merge.

@lmcindewar
Copy link
Contributor Author

@bashtage I haven't been able to nail down what's causing these type errors. It passes type checking with an added annotation on line 2542 in io/stata.py:

self.data: DataFrame = data

but I wasn't sure if there was an underlying issue that needs to be fixed

@bashtage
Copy link
Contributor

Try adding self.data = data after the super in __init__

@bashtage
Copy link
Contributor

Failures seem unrelated to anything in this PR.

@bashtage
Copy link
Contributor

Can you rebase on master to fix the merge conflict. and then I think this is ready.

@bashtage
Copy link
Contributor

I rebased it.

@bashtage
Copy link
Contributor

@jreback I think this PR is in good shape and should be merged. 3 unrelated failures in doc tests.

@bashtage
Copy link
Contributor

bashtage commented Sep 1, 2021

@lmcindewar can you rebase so we can look into merging?

@lmcindewar lmcindewar force-pushed the GH38454-export-stata-value-labels branch from b9f0bc7 to 0311142 Compare September 3, 2021 14:17
@lmcindewar
Copy link
Contributor Author

@bashtage I rebased it, let me know if there's anything else needed to merge it

@bashtage bashtage closed this Sep 3, 2021
@bashtage bashtage reopened this Sep 3, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. over to you @bashtage

@lmcindewar would be nice to add an example to the doc-string as well (can be here or in a followup)

@bashtage
Copy link
Contributor

Thanks @lmcindewar . It would be great if you could do an example in a follow-up.

@bashtage bashtage merged commit fd151ba into pandas-dev:master Sep 10, 2021
@lmcindewar
Copy link
Contributor Author

@bashtage Will do. Thanks again for all of the reviews

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: option to export df to Stata dataset with value labels
5 participants