Closed
Description
Although I have already produced code for this (see below) I'm posting this as an issue rather than a pull request to discuss the design, because there are some issues open in my code:
- Series of dtype object need an explicit cast or rpy2's numpy conversion will treat them improperly
- The performance has not been profiled
- Probably some room for optimizations
- Proper name for the function
- The generation of an intermediate OrdDict object may cause problems in case of very large datasets
The code in the current form is posted below. If there is interest, I will work towards integrating it in pandas.rpy.common and add unit tests.
import numpy as np
import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri as numpy2ri
from rpy2.robjects.packages import importr
import rpy2.rlike.container as rlc
def dataset_to_data_frame(dataset, strings_as_factors=True):
# Activate conversion for numpy objects
robjects.conversion.py2ri = numpy2ri.numpy2ri
robjects.numpy2ri.activate()
base = importr("base")
columns = rlc.OrdDict()
for column in dataset:
value = dataset[column]
# object type requires explicit cast
if value.dtype == np.object:
value = robjects.StrVector(value)
#FIXME: how to generalize it?
if not strings_as_factors:
value = base.I(value)
columns[column] = value
dataframe = robjects.DataFrame(columns)
dataframe.rownames = robjects.StrVector(dataset.index)
# To prevent side-effects in other code
robjects.conversion.pi2ri = robjects.default_py2ri
return dataframe