ENH: option to export df to Stata dataset with value labels #41042

lmcindewar · 2021-04-19T15:17:22Z

GH38454

closes ENH: option to export df to Stata dataset with value labels #38454
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

lmcindewar · 2021-04-20T15:54:05Z

I added a StataNonCatValueLabel to format value labels for non categorical variables. In the StataWriter:

added value_labels as optional argument. The format is as close to what is returned from StataReader.value_labels(), but the keys should be column names instead of column labels
renamed _is_col_cat to _has_value_labels
_prepare_non_cat_value_labels formats the value labels and checks that columns exist and are numeric

bashtage

Aside from these points, which are mostly minor:

Have you created a dta with the full range of supported column types and a mix of fully and partially labeled columns and verified it is correct in Stata? Round tripping is a good start, but it is also necessary to directly verify that the dta renders correctly in Stata.
What happens when column names are not legal Stata variable names (e.g., they need to be munged by pre/postpending _). I don't see anything that explicitly hadnles this case, and it might be automagicaly handeled. It would be good to add a test where a few columns had ad names.

bashtage · 2021-04-23T11:06:50Z

pandas/core/frame.py

@@ -2301,6 +2301,7 @@ def to_stata(
        time_stamp: datetime.datetime | None = None,
        data_label: str | None = None,
        variable_labels: dict[Hashable, str] | None = None,
+        value_labels: dict[str, dict[float | int, str]] | None = None,


Since these are not keyword only, I don't think this can be inserted here since it could produce a crash in otherwise correct code. Would be great to move most of these to be keyword only, but that is a different issue.

I still think this might need to move to the end of the list. Ideally you would make it keyword only, e.g.,

storage_options: StorageOptions = None, *, value_labels: dict[Hashable, dict[float | int, str]] | None = None,

Also note that it should be dict[Hashable, ... not str

pandas/core/frame.py

pandas/io/stata.py

lmcindewar · 2021-04-26T14:40:00Z

Thanks for the feedback! I made the typing changes and will go ahead and create a sample dta for some additional tests

bashtage · 2021-04-28T12:29:45Z

You need to import Literal when type checking, and make sure from __future__ import annotations is at the top of the file.

An example:

pandas/pandas/core/series.py

Lines 141 to 144 in accf345

    
           if TYPE_CHECKING: 
        
               from typing import Literal 
        
               from pandas._typing import (

lmcindewar · 2021-05-01T13:18:05Z

I created a variety of labelled types and checked that labels are being applied and displayed correctly in Stata:

The column names were being converted to valid Stata variables before the value labels were created. I added a check to use StataWriter._converted_names before checking the existing columns. I also moved the creation of _converted_names above _prepare_pandas, it was being populated in prepare pandas and reset to an empty dict in init

bashtage

It looks good. A few final questions. Probably all correct and I'm just not seeing it.

pandas/core/frame.py

pandas/io/stata.py

bashtage

Really minor clarifications. Ping when done and I'll allow CI to run, and then we can see about merging.

pandas/io/stata.py

lmcindewar · 2021-05-27T13:30:13Z

@bashtage I clarified the comments you mentioned. I left the type checking as is to be consistent with the rest of the project. If that's ok, I think everything is addressed. Thanks again for the feedback

bashtage · 2021-05-27T13:40:36Z

If the CI is green, then we'll get it merged. Thanks.

lmcindewar · 2021-06-04T11:46:05Z

If the CI is green, then we'll get it merged. Thanks.

@bashtage I can't tell where things are failing. Any suggestions?

pandas/io/stata.py

jreback · 2021-06-04T13:28:48Z

pandas/io/stata.py

    ):
        super().__init__()
        self._convert_dates = {} if convert_dates is None else convert_dates
        self._write_index = write_index
        self._time_stamp = time_stamp
        self._data_label = data_label
        self._variable_labels = variable_labels
+        self._non_cat_value_labels = value_labels
+        self._value_labels: list[StataValueLabel] = []


why don't you rename this to _cat_value_labels to distingish and make consistent

_value_labels is a combined list of categorical and non-categorical labels. there isn't a standalone _cat_value_labels because they're added straight to _value_labels after converting categoricals in the dataframe

jreback · 2021-06-04T13:30:03Z

pandas/io/stata.py

@@ -2222,17 +2286,52 @@ def _write_bytes(self, value: bytes) -> None:
        """
        self.handles.handle.write(value)  # type: ignore[arg-type]

+    def _prepare_non_cat_value_labels(self, data: DataFrame) -> None:


this is modifying global state and is pretty opaque to the reader. ideally these would return values and state would be changed in the caller.

all of the state change to the caller.

bashtage · 2021-06-07T11:10:29Z

Close and reopen to triger CI.

bashtage · 2021-06-07T11:10:49Z

@lmcindewar Can you take a look @jreback 's comments?

bashtage · 2021-06-07T12:17:54Z

Failure is a random one due to codecov.

bashtage · 2021-06-16T14:05:58Z

Typo in docstring

-        """ Encode value labels. """
+        """Encode value labels."""

simonjayhawkins

can you also add a release note (1.4 I think)

pandas/core/frame.py

pandas/io/stata.py

lmcindewar · 2021-07-19T13:36:16Z

@simonjayhawkins I'd completely missed your comment, I added a short release note. I think that should be it unless there's anything left from the previous reviews

doc/source/whatsnew/v1.4.0.rst

bashtage · 2021-07-19T13:45:56Z

LGTM.

bashtage

All issues have been addressed and once green looks good to merge.

lmcindewar · 2021-07-27T17:43:25Z

@bashtage I haven't been able to nail down what's causing these type errors. It passes type checking with an added annotation on line 2542 in io/stata.py:

self.data: DataFrame = data

but I wasn't sure if there was an underlying issue that needs to be fixed

bashtage · 2021-07-27T17:52:29Z

Try adding self.data = data after the super in __init__

bashtage · 2021-07-29T13:13:17Z

Failures seem unrelated to anything in this PR.

bashtage · 2021-07-29T13:14:10Z

Can you rebase on master to fix the merge conflict. and then I think this is ready.

bashtage · 2021-07-29T16:36:18Z

I rebased it.

bashtage · 2021-07-30T07:35:16Z

@jreback I think this PR is in good shape and should be merged. 3 unrelated failures in doc tests.

bashtage · 2021-09-01T14:12:14Z

@lmcindewar can you rebase so we can look into merging?

GH38454

…repare_non_cat

lmcindewar · 2021-09-03T14:21:29Z

@bashtage I rebased it, let me know if there's anything else needed to merge it

jreback

lgtm. over to you @bashtage

@lmcindewar would be nice to add an example to the doc-string as well (can be here or in a followup)

bashtage · 2021-09-10T08:48:15Z

Thanks @lmcindewar . It would be great if you could do an example in a follow-up.

lmcindewar · 2021-09-13T14:00:22Z

@bashtage Will do. Thanks again for all of the reviews

jbrockmendel added the IO Stata read_stata, to_stata label Apr 19, 2021

bashtage requested changes Apr 23, 2021

View reviewed changes

bashtage requested changes May 4, 2021

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

pandas/io/stata.py Show resolved Hide resolved

pandas/io/stata.py Outdated Show resolved Hide resolved

pandas/io/stata.py Outdated Show resolved Hide resolved

pandas/io/stata.py Outdated Show resolved Hide resolved

lmcindewar commented May 6, 2021

View reviewed changes

pandas/io/stata.py Outdated Show resolved Hide resolved

bashtage requested changes May 18, 2021

View reviewed changes

pandas/io/stata.py Outdated Show resolved Hide resolved

pandas/io/stata.py Show resolved Hide resolved

pandas/io/stata.py Outdated Show resolved Hide resolved

pandas/io/stata.py Outdated Show resolved Hide resolved

pandas/io/stata.py Show resolved Hide resolved

simonjayhawkins added the Enhancement label May 25, 2021

jreback requested changes Jun 4, 2021

View reviewed changes

bashtage closed this Jun 7, 2021

bashtage reopened this Jun 7, 2021

simonjayhawkins requested changes Jun 16, 2021

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/io/stata.py Show resolved Hide resolved

simonjayhawkins reviewed Jul 19, 2021

View reviewed changes

doc/source/whatsnew/v1.4.0.rst Show resolved Hide resolved

simonjayhawkins added this to the 1.4 milestone Jul 19, 2021

bashtage approved these changes Jul 19, 2021

View reviewed changes

lmcindewar added 17 commits September 3, 2021 10:11

ENH: option to export df to Stata dataset with value labels

4db9ccb

GH38454

Removing unnecessary list comprehension, flake8

5a3d6d9

Adding value_labels argument to DataFrame to_stata method

bbb43f8

Updating types and changing ValueError to KeyError for missing column

533a3d5

Using converted names for invalid Stata variable names

ed73d69

Moving value_labels to key word only for to_stata

ae4dca7

Adding tests for invalid Stata names and repeated value labels

8e57e46

Fixing Literal import

70dc88b

Moving label encoding to method

2796d1f

Updates from review: typing, documentation

277896a

Fixing mypy errors

c31034d

Clarifying comment on label length

85374fd

Removing duplication in value label class and returning labels from p…

4ac27db

…repare_non_cat

Adding versionaddeds

d2a5584

Typo in spacing of docstring

d30d1fe

Adding release note

05d4d74

WSetting data in Statawriter init

0311142

lmcindewar force-pushed the GH38454-export-stata-value-labels branch from b9f0bc7 to 0311142 Compare September 3, 2021 14:17

bashtage closed this Sep 3, 2021

bashtage reopened this Sep 3, 2021

Merge branch 'master' into GH38454-export-stata-value-labels

46d3783

jreback approved these changes Sep 4, 2021

View reviewed changes

bashtage merged commit fd151ba into pandas-dev:master Sep 10, 2021

Uh oh!

ENH: option to export df to Stata dataset with value labels #41042

ENH: option to export df to Stata dataset with value labels #41042

Uh oh!

Conversation

lmcindewar commented Apr 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmcindewar commented Apr 20, 2021

Uh oh!

bashtage left a comment

Choose a reason for hiding this comment

Uh oh!

bashtage Apr 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bashtage Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lmcindewar commented Apr 26, 2021

Uh oh!

bashtage commented Apr 28, 2021

Uh oh!

lmcindewar commented May 1, 2021

Uh oh!

bashtage left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bashtage left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lmcindewar commented May 27, 2021

Uh oh!

bashtage commented May 27, 2021

Uh oh!

lmcindewar commented Jun 4, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jreback Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

lmcindewar Jun 9, 2021

Choose a reason for hiding this comment

Uh oh!

jreback Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

lmcindewar Jun 16, 2021

Choose a reason for hiding this comment

Uh oh!

bashtage commented Jun 7, 2021

Uh oh!

bashtage commented Jun 7, 2021

Uh oh!

bashtage commented Jun 7, 2021

Uh oh!

bashtage commented Jun 16, 2021

Uh oh!

simonjayhawkins left a comment

Choose a reason for hiding this comment

lmcindewar commented Apr 19, 2021 •

edited

Loading

bashtage Apr 23, 2021 •

edited

Loading

bashtage Apr 28, 2021 •

edited

Loading