Skip to content

BUG: set keyword argument so zipfile actually compresses #21144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 29, 2018
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ Indexing
I/O
^^^

- Bug in :class:`pandas.io.common.BytesZipFile` where zip compression produces uncompressed zip archive (:issue:`17778`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reference 21144 (as well is fine) as that other issue was closed for 0.23.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jreback @WillAyd , tested 21144 on older version of pandas 0.20 on windows and same issue occurred. I think it's an existing issue unrelated to this PR. This PR only address zip compression and doesn't touch gzip at all.

- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
-

Expand Down
8 changes: 4 additions & 4 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import codecs
import mmap
from contextlib import contextmanager, closing
from zipfile import ZipFile
import zipfile

from pandas.compat import StringIO, BytesIO, string_types, text_type
from pandas import compat
Expand Down Expand Up @@ -428,7 +428,7 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
return f, handles


class BytesZipFile(ZipFile, BytesIO):
class BytesZipFile(zipfile.ZipFile, BytesIO):
"""
Wrapper for standard library class ZipFile and allow the returned file-like
handle to accept byte strings via `write` method.
Expand All @@ -437,10 +437,10 @@ class BytesZipFile(ZipFile, BytesIO):
bytes strings into a member of the archive.
"""
# GH 17778
def __init__(self, file, mode='r', **kwargs):
def __init__(self, file, mode, compression=zipfile.ZIP_DEFLATED, **kwargs):
if mode in ['wb', 'rb']:
mode = mode.replace('b', '')
super(BytesZipFile, self).__init__(file, mode, **kwargs)
super(BytesZipFile, self).__init__(file, mode, compression, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tests and a whatsnew for this?

Copy link
Member

@gfyoung gfyoung May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, because you are modifying the default behavior, I'm not sure if we need a deprecation cycle for this (to be safe, we should I would imagine).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is a bug

Copy link
Member

@gfyoung gfyoung May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, though tests and whatsnew are still needed (just to be clear).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. added whatsnew and tests.


def write(self, data):
super(BytesZipFile, self).writestr(self.filename, data)
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/frame/test_to_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -943,6 +943,22 @@ def test_to_csv_compression(self, compression):
with tm.decompress_file(filename, compression) as fh:
assert_frame_equal(df, read_csv(fh, index_col=0))

def test_to_csv_compression_size(self, compression):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are all the same tests I think it makes more sense to put in test_common and parametrize for the different writers, rather than splitting out across the various modules

Copy link
Contributor Author

@minggli minggli May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. done.


df = pd.concat(100 * [DataFrame([[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
columns=['X', 'Y', 'Z'])])

with ensure_clean() as filename:
import os
df.to_csv(filename, compression=compression)
file_size = os.path.getsize(filename)

if compression:
df.to_csv(filename, compression=None)
uncompressed_file_size = os.path.getsize(filename)
assert uncompressed_file_size > file_size

def test_to_csv_date_format(self):
with ensure_clean('__tmp_to_csv_date_format__') as path:
dt_index = self.tsframe.index
Expand Down
18 changes: 18 additions & 0 deletions pandas/tests/io/json/test_compression.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,24 @@ def test_compression_roundtrip(compression):
assert_frame_equal(df, pd.read_json(result))


def test_to_json_compression_size(compression):

df = pd.concat(100 * [pd.DataFrame([[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
columns=['X', 'Y', 'Z'])],
ignore_index=True)

with tm.ensure_clean() as filename:
import os
df.to_json(filename, compression=compression)
file_size = os.path.getsize(filename)

if compression:
df.to_json(filename, compression=None)
uncompressed_file_size = os.path.getsize(filename)
assert uncompressed_file_size > file_size


def test_read_zipped_json():
uncompressed_path = tm.get_data_path("tsframe_v012.json")
uncompressed_df = pd.read_json(uncompressed_path)
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/io/test_pickle.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,21 @@ def test_read_infer(self, ext, get_random_path):

tm.assert_frame_equal(df, df2)

def test_compression_size(self, compression):

df = pd.concat(100 * [pd.DataFrame([[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
columns=['X', 'Y', 'Z'])])

with tm.ensure_clean() as filename:
df.to_pickle(filename, compression=compression)
file_size = os.path.getsize(filename)

if compression:
df.to_pickle(filename, compression=None)
uncompressed_file_size = os.path.getsize(filename)
assert uncompressed_file_size > file_size


# ---------------------
# test pickle compression
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/series/test_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,20 @@ def test_to_csv_compression(self, compression):
index_col=0,
squeeze=True))

def test_to_csv_compression_size(self, compression):

s = Series(100 * [0.123456, 0.234567, 0.567567], name='X')

with ensure_clean() as filename:
import os
s.to_csv(filename, compression=compression, header=True)
file_size = os.path.getsize(filename)

if compression:
s.to_csv(filename, compression=None, header=True)
uncompressed_file_size = os.path.getsize(filename)
assert uncompressed_file_size > file_size


class TestSeriesIO(TestData):

Expand Down