Skip to content

Release 0.4.0 #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Dec 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
9f76885
Bump version
rhpvorderman Dec 22, 2023
18e3c7c
Use pythondevmode and libasan to detect resource errors
rhpvorderman Dec 22, 2023
4da3bc5
Fix resource warning when using the CLI
rhpvorderman Dec 22, 2023
1c0c601
Update changelog with buffer errors patch
rhpvorderman Dec 22, 2023
4853c0b
Merge pull request #25 from pycompression/asan
rhpvorderman Dec 22, 2023
30f3921
Add a GzipReader class
rhpvorderman Dec 22, 2023
5c137bb
Use _GzipReader and add tests
rhpvorderman Dec 22, 2023
3c47cf8
Remove implementation specific tests
rhpvorderman Dec 22, 2023
a3a30d7
Fix linting issue
rhpvorderman Dec 22, 2023
dde6454
Update changelog with C rewrite
rhpvorderman Dec 22, 2023
9abe43e
Fix error with zlib.compress not accepting wbits values
rhpvorderman Dec 22, 2023
61f7459
Remove redundant code
rhpvorderman Dec 22, 2023
8b1128a
Merge pull request #26 from pycompression/Cgzipreader
rhpvorderman Dec 22, 2023
e31d5d1
Add a CRC32 combine function
rhpvorderman Dec 22, 2023
cbea11b
Add parallelcompress type
rhpvorderman Dec 22, 2023
3154a58
Add code and tests for multithreading from python-isal
rhpvorderman Dec 22, 2023
6748f3e
Fix segmentation fault when deinitializing zstream
rhpvorderman Dec 22, 2023
508c5b7
Fix linting issues
rhpvorderman Dec 22, 2023
5db8590
Add gzip_ng_threaded to documentation
rhpvorderman Dec 23, 2023
a9e9aab
Set xfl byte according to zlib_ng compression levels
rhpvorderman Dec 23, 2023
094a9c0
Prevent errors in __exit__ method due to improper __init__
rhpvorderman Dec 24, 2023
c5502d6
Merge pull request #27 from pycompression/threading
rhpvorderman Dec 24, 2023
d581195
Set stable version
rhpvorderman Dec 25, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,18 @@ Changelog
.. This document is user facing. Please word the changes in such a way
.. that users understand how the changes affect the new version.

version 0.4.0
-----------------
+ Add a ``gzip_ng_threaded`` module that contains the ``gzip_ng_threaded.open``
function. This allows using multithreaded compression as well as escaping the
GIL.
+ The internal ``gzip_ng._GzipReader`` has been rewritten in C. As a result the
overhead of decompressing files has significantly been reduced.
+ The ``gzip_ng._GzipReader`` in C is now used in ``gzip_ng.decompress``. The
``_GzipReader`` also can read from objects that support the buffer protocol.
This has reduced overhead significantly.
+ Fix some unclosed buffer errors in the gzip_ng CLI.

version 0.3.0
-----------------
+ Source distributions on Linux now default to building with configure and
Expand Down
8 changes: 7 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ by providing Python bindings for the zlib-ng library.
This package provides Python bindings for the `zlib-ng
<https://github.com/zlib-ng/zlib-ng>`_ library.

``python-zlib-ng`` provides the bindings by offering two modules:
``python-zlib-ng`` provides the bindings by offering three modules:

+ ``zlib_ng``: A drop-in replacement for the zlib module that uses zlib-ng to
accelerate its performance.
Expand All @@ -51,6 +51,11 @@ This package provides Python bindings for the `zlib-ng
instead of ``zlib`` to perform its compression and checksum tasks, which
improves performance.

+ ``gzip_ng_threaded`` offers an ``open`` function which returns buffered read
or write streams that can be used to read and write large files while
escaping the GIL using one or multiple threads. This functionality only
works for streaming, seeking is not supported.

``zlib_ng`` and ``gzip_ng`` are almost fully compatible with ``zlib`` and
``gzip`` from the Python standard library. There are some minor differences
see: differences-with-zlib-and-gzip-modules_.
Expand All @@ -68,6 +73,7 @@ The python-zlib-ng modules can be imported as follows

from zlib_ng import zlib_ng
from zlib_ng import gzip_ng
from zlib_ng import gzip_ng_threaded

``zlib_ng`` and ``gzip_ng`` are meant to be used as drop in replacements so
their api and functions are the same as the stdlib's modules.
Expand Down
7 changes: 7 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,13 @@ API-documentation: zlib_ng.gzip_ng
:members:
:special-members: __init__

===========================================
API-documentation: zlib_ng.gzip_ng_threaded
===========================================

.. automodule:: zlib_ng.gzip_ng_threaded
:members: open

===============================
python -m zlib_ng.gzip_ng usage
===============================
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ def build_zlib_ng():

setup(
name="zlib-ng",
version="0.3.0",
version="0.4.0",
description="Drop-in replacement for zlib and gzip modules using zlib-ng",
author="Leiden University Medical Center",
author_email="r.h.p.vorderman@lumc.nl", # A placeholder for now
Expand Down
2 changes: 1 addition & 1 deletion src/zlib_ng/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@
# This file is part of python-zlib-ng which is distributed under the
# PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2.

__version__ = "0.3.0"
__version__ = "0.4.0"
159 changes: 13 additions & 146 deletions src/zlib_ng/gzip_ng.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@
import struct
import sys
import time
import _compression # noqa: I201 # Not third-party

from . import zlib_ng
from .zlib_ng import _GzipReader

__all__ = ["GzipFile", "open", "compress", "decompress", "BadGzipFile",
"READ_BUFFER_SIZE"]
Expand All @@ -36,19 +36,14 @@
_COMPRESS_LEVEL_TRADEOFF = zlib_ng.Z_DEFAULT_COMPRESSION
_COMPRESS_LEVEL_BEST = zlib_ng.Z_BEST_COMPRESSION

#: The amount of data that is read in at once when decompressing a file.
#: Increasing this value may increase performance.
#: 128K is also the size used by pigz and cat to read files from the
# filesystem.
READ_BUFFER_SIZE = 128 * 1024
# The amount of data that is read in at once when decompressing a file.
# Increasing this value may increase performance.
READ_BUFFER_SIZE = 512 * 1024

FTEXT, FHCRC, FEXTRA, FNAME, FCOMMENT = 1, 2, 4, 8, 16
READ, WRITE = 1, 2

try:
BadGzipFile = gzip.BadGzipFile # type: ignore
except AttributeError: # Versions lower than 3.8 do not have BadGzipFile
BadGzipFile = OSError # type: ignore
BadGzipFile = gzip.BadGzipFile # type: ignore


# The open method was copied from the CPython source with minor adjustments.
Expand Down Expand Up @@ -149,7 +144,7 @@ def __init__(self, filename=None, mode=None,
zlib_ng.DEF_MEM_LEVEL,
0)
if self.mode == READ:
raw = _GzipNGReader(self.fileobj)
raw = _GzipReader(self.fileobj, READ_BUFFER_SIZE)
self._buffer = io.BufferedReader(raw)

def __repr__(self):
Expand Down Expand Up @@ -180,124 +175,9 @@ def write(self, data):
return length


class _GzipNGReader(gzip._GzipReader):
def __init__(self, fp):
# Call the init method of gzip._GzipReader's parent here.
# It is not very invasive and allows us to override _PaddedFile
_compression.DecompressReader.__init__(
self, gzip._PaddedFile(fp), zlib_ng._ZlibDecompressor,
wbits=-zlib_ng.MAX_WBITS)
# Set flag indicating start of a new member
self._new_member = True
self._last_mtime = None

def read(self, size=-1):
if size < 0:
return self.readall()
# size=0 is special because decompress(max_length=0) is not supported
if not size:
return b""

# For certain input data, a single
# call to decompress() may not return
# any data. In this case, retry until we get some data or reach EOF.
while True:
if self._decompressor.eof:
# Ending case: we've come to the end of a member in the file,
# so finish up this member, and read a new gzip header.
# Check the CRC and file size, and set the flag so we read
# a new member
self._read_eof()
self._new_member = True
self._decompressor = self._decomp_factory(
**self._decomp_args)

if self._new_member:
# If the _new_member flag is set, we have to
# jump to the next member, if there is one.
self._init_read()
if not self._read_gzip_header():
self._size = self._pos
return b""
self._new_member = False

# Read a chunk of data from the file
if self._decompressor.needs_input:
buf = self._fp.read(READ_BUFFER_SIZE)
uncompress = self._decompressor.decompress(buf, size)
else:
uncompress = self._decompressor.decompress(b"", size)
if self._decompressor.unused_data != b"":
# Prepend the already read bytes to the fileobj so they can
# be seen by _read_eof() and _read_gzip_header()
self._fp.prepend(self._decompressor.unused_data)

if uncompress != b"":
break
if buf == b"":
raise EOFError("Compressed file ended before the "
"end-of-stream marker was reached")

self._crc = zlib_ng.crc32(uncompress, self._crc)
self._stream_size += len(uncompress)
self._pos += len(uncompress)
return uncompress


# Aliases for improved compatibility with CPython gzip module.
GzipFile = GzipNGFile
_GzipReader = _GzipNGReader


def _read_exact(fp, n):
'''Read exactly *n* bytes from `fp`
This method is required because fp may be unbuffered,
i.e. return short reads.
'''
data = fp.read(n)
while len(data) < n:
b = fp.read(n - len(data))
if not b:
raise EOFError("Compressed file ended before the "
"end-of-stream marker was reached")
data += b
return data


def _read_gzip_header(fp):
'''Read a gzip header from `fp` and progress to the end of the header.
Returns last mtime if header was present or None otherwise.
'''
magic = fp.read(2)
if magic == b'':
return None

if magic != b'\037\213':
raise BadGzipFile('Not a gzipped file (%r)' % magic)

(method, flag, last_mtime) = struct.unpack("<BBIxx", _read_exact(fp, 8))
if method != 8:
raise BadGzipFile('Unknown compression method')

if flag & FEXTRA:
# Read & discard the extra field, if present
extra_len, = struct.unpack("<H", _read_exact(fp, 2))
_read_exact(fp, extra_len)
if flag & FNAME:
# Read and discard a null-terminated string containing the filename
while True:
s = fp.read(1)
if not s or s == b'\000':
break
if flag & FCOMMENT:
# Read and discard a null-terminated string containing a comment
while True:
s = fp.read(1)
if not s or s == b'\000':
break
if flag & FHCRC:
_read_exact(fp, 2) # Read & discard the 16-bit header CRC
return last_mtime
_GzipNGReader = _GzipReader


def _create_simple_gzip_header(compresslevel: int,
Expand Down Expand Up @@ -342,25 +222,9 @@ def decompress(data):
"""Decompress a gzip compressed string in one shot.
Return the decompressed string.
"""
decompressed_members = []
while True:
fp = io.BytesIO(data)
if _read_gzip_header(fp) is None:
return b"".join(decompressed_members)
# Use a zlib raw deflate compressor
do = zlib_ng.decompressobj(wbits=-zlib_ng.MAX_WBITS)
# Read all the data except the header
decompressed = do.decompress(data[fp.tell():])
if not do.eof or len(do.unused_data) < 8:
raise EOFError("Compressed file ended before the end-of-stream "
"marker was reached")
crc, length = struct.unpack("<II", do.unused_data[:8])
if crc != zlib_ng.crc32(decompressed):
raise BadGzipFile("CRC check failed")
if length != (len(decompressed) & 0xffffffff):
raise BadGzipFile("Incorrect length of data produced")
decompressed_members.append(decompressed)
data = do.unused_data[8:].lstrip(b"\x00")
fp = io.BytesIO(data)
reader = _GzipReader(fp, max(len(data), 16))
return reader.readall()


def _argument_parser():
Expand Down Expand Up @@ -431,6 +295,7 @@ def main():
if yes_or_no not in {"y", "Y", "yes"}:
sys.exit("not overwritten")

out_buffer = None
if args.compress:
if args.file is None:
in_file = sys.stdin.buffer
Expand Down Expand Up @@ -470,6 +335,8 @@ def main():
in_file.close()
if out_file is not sys.stdout.buffer:
out_file.close()
if out_buffer is not None and out_buffer is not sys.stdout.buffer:
out_buffer.close()


if __name__ == "__main__": # pragma: no cover
Expand Down
Loading