xopen-0.8.4/0000775000372000037200000000000013555010000013503 5ustar travistravis00000000000000xopen-0.8.4/.codecov.yml0000664000372000037200000000026413555007765015760 0ustar travistravis00000000000000comment: off codecov: require_ci_to_pass: no coverage: precision: 1 round: down range: "70...100" status: project: yes patch: no changes: no comment: off xopen-0.8.4/PKG-INFO0000664000372000037200000001333313555010000014603 0ustar travistravis00000000000000Metadata-Version: 2.1 Name: xopen Version: 0.8.4 Summary: Open compressed files transparently Home-page: https://github.com/marcelm/xopen/ Author: Marcel Martin Author-email: mail@marcelm.net License: MIT Description: .. image:: https://travis-ci.org/marcelm/xopen.svg?branch=master :target: https://travis-ci.org/marcelm/xopen :alt: .. image:: https://img.shields.io/pypi/v/xopen.svg?branch=master :target: https://pypi.python.org/pypi/xopen .. image:: https://img.shields.io/conda/v/conda-forge/xopen.svg :target: https://anaconda.org/conda-forge/xopen :alt: .. image:: https://codecov.io/gh/marcelm/xopen/branch/master/graph/badge.svg :target: https://codecov.io/gh/marcelm/xopen :alt: ===== xopen ===== This small Python module provides an ``xopen`` function that works like the built-in ``open`` function, but can also deal with compressed files. Supported compression formats are gzip, bzip2 and xz. They are automatically recognized by their file extensions `.gz`, `.bz2` or `.xz`. The focus is on being as efficient as possible on all supported Python versions. For example, ``xopen`` uses ``pigz``, which is a parallel version of ``gzip``, to open ``.gz`` files, which is faster than using the built-in ``gzip.open`` function. ``pigz`` can use multiple threads when compressing, but is also faster when reading ``.gz`` files, so it is used both for reading and writing if it is available. This module has originally been developed as part of the `cutadapt tool `_ that is used in bioinformatics to manipulate sequencing data. It has been in successful use within that software for a few years. ``xopen`` is compatible with Python versions 2.7 and 3.4 to 3.8. Usage ----- Open a file for reading:: from xopen import xopen with xopen('file.txt.xz') as f: content = f.read() Or without context manager:: from xopen import xopen f = xopen('file.txt.xz') content = f.read() f.close() Open a file in binary mode for writing:: from xopen import xopen with xopen('file.txt.gz', mode='wb') as f: f.write(b'Hello') Credits ------- The name ``xopen`` was taken from the C function of the same name in the `utils.h file which is part of BWA `_. Kyle Beauchamp has contributed support for appending to files. Ruben Vorderman contributed improvements to make reading gzipped files faster. Some ideas were taken from the `canopener project `_. If you also want to open S3 files, you may want to use that module instead. Changes ------- v0.8.4 ~~~~~~ * When reading gzipped files, force ``pigz`` to use only a single process. ``pigz`` cannot use multiple cores anyway when decompressing. By default, it would use extra I/O processes, which slightly reduces wall-clock time, but increases CPU time. Single-core decompression with ``pigz`` is still about twice as fast as regular ``gzip``. * Allow ``threads=0`` for specifying that no external ``pigz``/``gzip`` process should be used (then regular ``gzip.open()`` is used instead). v0.8.3 ~~~~~~ * When reading gzipped files, let ``pigz`` use at most four threads by default. This limit previously only applied when writing to a file. * Support Python 3.8 v0.8.0 ~~~~~~ * Speed improvements when iterating over gzipped files. v0.6.0 ~~~~~~ * For reading from gzipped files, xopen will now use a ``pigz`` subprocess. This is faster than using ``gzip.open``. * Python 2 support will be dropped in one of the next releases. v0.5.0 ~~~~~~ * By default, pigz is now only allowed to use at most four threads. This hopefully reduces problems some users had with too many threads when opening many files at the same time. * xopen now accepts pathlib.Path objects. Author ------ Marcel Martin (`@marcelm_ on Twitter `_) Links ----- * `Source code `_ * `Report an issue `_ * `Project page on PyPI (Python package index) `_ Platform: UNKNOWN Classifier: Development Status :: 5 - Production/Stable Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, <4 Provides-Extra: dev xopen-0.8.4/src/0000775000372000037200000000000013555010000014272 5ustar travistravis00000000000000xopen-0.8.4/src/xopen.egg-info/0000775000372000037200000000000013555010000017115 5ustar travistravis00000000000000xopen-0.8.4/src/xopen.egg-info/PKG-INFO0000664000372000037200000001333313555007777020250 0ustar travistravis00000000000000Metadata-Version: 2.1 Name: xopen Version: 0.8.4 Summary: Open compressed files transparently Home-page: https://github.com/marcelm/xopen/ Author: Marcel Martin Author-email: mail@marcelm.net License: MIT Description: .. image:: https://travis-ci.org/marcelm/xopen.svg?branch=master :target: https://travis-ci.org/marcelm/xopen :alt: .. image:: https://img.shields.io/pypi/v/xopen.svg?branch=master :target: https://pypi.python.org/pypi/xopen .. image:: https://img.shields.io/conda/v/conda-forge/xopen.svg :target: https://anaconda.org/conda-forge/xopen :alt: .. image:: https://codecov.io/gh/marcelm/xopen/branch/master/graph/badge.svg :target: https://codecov.io/gh/marcelm/xopen :alt: ===== xopen ===== This small Python module provides an ``xopen`` function that works like the built-in ``open`` function, but can also deal with compressed files. Supported compression formats are gzip, bzip2 and xz. They are automatically recognized by their file extensions `.gz`, `.bz2` or `.xz`. The focus is on being as efficient as possible on all supported Python versions. For example, ``xopen`` uses ``pigz``, which is a parallel version of ``gzip``, to open ``.gz`` files, which is faster than using the built-in ``gzip.open`` function. ``pigz`` can use multiple threads when compressing, but is also faster when reading ``.gz`` files, so it is used both for reading and writing if it is available. This module has originally been developed as part of the `cutadapt tool `_ that is used in bioinformatics to manipulate sequencing data. It has been in successful use within that software for a few years. ``xopen`` is compatible with Python versions 2.7 and 3.4 to 3.8. Usage ----- Open a file for reading:: from xopen import xopen with xopen('file.txt.xz') as f: content = f.read() Or without context manager:: from xopen import xopen f = xopen('file.txt.xz') content = f.read() f.close() Open a file in binary mode for writing:: from xopen import xopen with xopen('file.txt.gz', mode='wb') as f: f.write(b'Hello') Credits ------- The name ``xopen`` was taken from the C function of the same name in the `utils.h file which is part of BWA `_. Kyle Beauchamp has contributed support for appending to files. Ruben Vorderman contributed improvements to make reading gzipped files faster. Some ideas were taken from the `canopener project `_. If you also want to open S3 files, you may want to use that module instead. Changes ------- v0.8.4 ~~~~~~ * When reading gzipped files, force ``pigz`` to use only a single process. ``pigz`` cannot use multiple cores anyway when decompressing. By default, it would use extra I/O processes, which slightly reduces wall-clock time, but increases CPU time. Single-core decompression with ``pigz`` is still about twice as fast as regular ``gzip``. * Allow ``threads=0`` for specifying that no external ``pigz``/``gzip`` process should be used (then regular ``gzip.open()`` is used instead). v0.8.3 ~~~~~~ * When reading gzipped files, let ``pigz`` use at most four threads by default. This limit previously only applied when writing to a file. * Support Python 3.8 v0.8.0 ~~~~~~ * Speed improvements when iterating over gzipped files. v0.6.0 ~~~~~~ * For reading from gzipped files, xopen will now use a ``pigz`` subprocess. This is faster than using ``gzip.open``. * Python 2 support will be dropped in one of the next releases. v0.5.0 ~~~~~~ * By default, pigz is now only allowed to use at most four threads. This hopefully reduces problems some users had with too many threads when opening many files at the same time. * xopen now accepts pathlib.Path objects. Author ------ Marcel Martin (`@marcelm_ on Twitter `_) Links ----- * `Source code `_ * `Report an issue `_ * `Project page on PyPI (Python package index) `_ Platform: UNKNOWN Classifier: Development Status :: 5 - Production/Stable Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, <4 Provides-Extra: dev xopen-0.8.4/src/xopen.egg-info/dependency_links.txt0000664000372000037200000000000113555007777023216 0ustar travistravis00000000000000 xopen-0.8.4/src/xopen.egg-info/top_level.txt0000664000372000037200000000000613555007777021676 0ustar travistravis00000000000000xopen xopen-0.8.4/src/xopen.egg-info/SOURCES.txt0000664000372000037200000000064713555010000021010 0ustar travistravis00000000000000.codecov.yml .editorconfig .gitignore .travis.yml LICENSE README.rst pyproject.toml setup.cfg setup.py tox.ini src/xopen/__init__.py src/xopen/_version.py src/xopen.egg-info/PKG-INFO src/xopen.egg-info/SOURCES.txt src/xopen.egg-info/dependency_links.txt src/xopen.egg-info/requires.txt src/xopen.egg-info/top_level.txt tests/file.txt tests/file.txt.bz2 tests/file.txt.gz tests/file.txt.xz tests/hello.gz tests/test_xopen.pyxopen-0.8.4/src/xopen.egg-info/requires.txt0000664000372000037200000000006213555007777021546 0ustar travistravis00000000000000 [:python_version == "2.7"] bz2file [dev] pytest xopen-0.8.4/src/xopen/0000775000372000037200000000000013555010000015423 5ustar travistravis00000000000000xopen-0.8.4/src/xopen/_version.py0000664000372000037200000000016413555007777017655 0ustar travistravis00000000000000# coding: utf-8 # file generated by setuptools_scm # don't change, don't track in version control version = '0.8.4' xopen-0.8.4/src/xopen/__init__.py0000664000372000037200000003204413555007765017567 0ustar travistravis00000000000000""" Open compressed files transparently. """ from __future__ import print_function, division, absolute_import import gzip import sys import io import os import time import signal from subprocess import Popen, PIPE from ._version import version as __version__ _PY3 = sys.version > '3' if not _PY3: import bz2file as bz2 else: try: import bz2 except ImportError: bz2 = None try: import lzma except ImportError: lzma = None if _PY3: basestring = str try: import pathlib # Exists in Python 3.4+ except ImportError: pathlib = None try: from os import fspath # Exists in Python 3.6+ except ImportError: def fspath(path): if hasattr(path, "__fspath__"): return path.__fspath__() # Python 3.4 and 3.5 have pathlib, but do not support the file system # path protocol if pathlib is not None and isinstance(path, pathlib.Path): return str(path) if not isinstance(path, basestring): raise TypeError("path must be a string") return path def _available_cpu_count(): """ Number of available virtual or physical CPUs on this system Adapted from http://stackoverflow.com/a/1006301/715090 """ try: return len(os.sched_getaffinity(0)) except AttributeError: pass import re try: with open('/proc/self/status') as f: status = f.read() m = re.search(r'(?m)^Cpus_allowed:\s*(.*)$', status) if m: res = bin(int(m.group(1).replace(',', ''), 16)).count('1') if res > 0: return res except IOError: pass try: import multiprocessing return multiprocessing.cpu_count() except (ImportError, NotImplementedError): return 1 class Closing(object): """ Inherit from this class and implement a close() method to offer context manager functionality. """ def __enter__(self): return self def __exit__(self, *exc_info): self.close() def __del__(self): try: self.close() except: pass class PipedGzipWriter(Closing): """ Write gzip-compressed files by running an external gzip or pigz process and piping into it. pigz is tried first. It is fast because it can compress using multiple cores. If pigz is not available, a gzip subprocess is used. On Python 2, this saves CPU time because gzip.GzipFile is slower. On Python 3, gzip.GzipFile is on par with gzip itself, but running an external gzip can still reduce wall-clock time because the compression happens in a separate process. """ def __init__(self, path, mode='wt', compresslevel=6, threads=None): """ mode -- one of 'w', 'wt', 'wb', 'a', 'at', 'ab' compresslevel -- gzip compression level threads (int) -- number of pigz threads. If this is set to None, a reasonable default is used. At the moment, this means that the number of available CPU cores is used, capped at four to avoid creating too many threads. Use 0 to let pigz use all available cores. """ if mode not in ('w', 'wt', 'wb', 'a', 'at', 'ab'): raise ValueError("Mode is '{0}', but it must be 'w', 'wt', 'wb', 'a', 'at' or 'ab'".format(mode)) # TODO use a context manager self.outfile = open(path, mode) self.devnull = open(os.devnull, mode) self.closed = False self.name = path kwargs = dict(stdin=PIPE, stdout=self.outfile, stderr=self.devnull) # Setting close_fds to True in the Popen arguments is necessary due to # . # However, close_fds is not supported on Windows. See # . if sys.platform != 'win32': kwargs['close_fds'] = True if 'w' in mode and compresslevel != 6: extra_args = ['-' + str(compresslevel)] else: extra_args = [] pigz_args = ['pigz'] if threads is None: threads = min(_available_cpu_count(), 4) if threads != 0: pigz_args += ['-p', str(threads)] try: self.process = Popen(pigz_args + extra_args, **kwargs) self.program = 'pigz' except OSError: # pigz not found, try regular gzip try: self.process = Popen(['gzip'] + extra_args, **kwargs) self.program = 'gzip' except (IOError, OSError): self.outfile.close() self.devnull.close() raise except IOError: # TODO IOError is the same as OSError on Python 3.3 self.outfile.close() self.devnull.close() raise if _PY3 and 'b' not in mode: self._file = io.TextIOWrapper(self.process.stdin) else: self._file = self.process.stdin def write(self, arg): self._file.write(arg) def close(self): if self.closed: return self.closed = True self._file.close() retcode = self.process.wait() self.outfile.close() self.devnull.close() if retcode != 0: raise IOError("Output {0} process terminated with exit code {1}".format(self.program, retcode)) def __iter__(self): return self def __next__(self): raise io.UnsupportedOperation('not readable') class PipedGzipReader(Closing): """ Open a pipe to pigz for reading a gzipped file. Even though pigz is mostly used to speed up writing by using many compression threads, it is also faster when reading, even when forced to use a single thread (ca. 2x speedup). """ def __init__(self, path, mode='r', threads=None): """ Raise an OSError when pigz could not be found. """ if mode not in ('r', 'rt', 'rb'): raise ValueError("Mode is '{0}', but it must be 'r', 'rt' or 'rb'".format(mode)) pigz_args = ['pigz', '-cd', path] if threads is None: # Single threaded behaviour by default because: # - Using a single thread to read a file is the least unexpected # behaviour. (For users of xopen, who do not know which backend is used.) # - There is quite a substantial overhead (+25% CPU time) when # using multiple threads while there is only a 10% gain in wall # clock time. threads = 1 pigz_args += ['-p', str(threads)] self.process = Popen(pigz_args, stdout=PIPE, stderr=PIPE) self.name = path if _PY3 and 'b' not in mode: self._file = io.TextIOWrapper(self.process.stdout) else: self._file = self.process.stdout if _PY3: self._stderr = io.TextIOWrapper(self.process.stderr) else: self._stderr = self.process.stderr self.closed = False # Give the subprocess a little bit of time to report any errors (such as # a non-existing file) time.sleep(0.01) self._raise_if_error() def close(self): if self.closed: return self.closed = True retcode = self.process.poll() if retcode is None: # still running self.process.terminate() allow_sigterm = True else: allow_sigterm = False self.process.wait() self._raise_if_error(allow_sigterm=allow_sigterm) def __iter__(self): return self._file def _raise_if_error(self, allow_sigterm=False): """ Raise IOError if process is not running anymore and the exit code is nonzero. If allow_sigterm is set and a SIGTERM exit code is encountered, no error is raised. """ retcode = self.process.poll() if ( retcode is not None and retcode != 0 and not (allow_sigterm and retcode == -signal.SIGTERM) ): message = self._stderr.read().strip() raise IOError("{} (exit code {})".format(message, retcode)) def read(self, *args): return self._file.read(*args) def readinto(self, *args): return self._file.readinto(*args) def readline(self, *args): return self._file.readline(*args) def seekable(self): return self._file.seekable() def peek(self, n=None): return self._file.peek(n) def readable(self): if _PY3: return self._file.readable() else: return NotImplementedError( "Python 2 does not support the readable() method." ) def writable(self): return self._file.writable() def flush(self): return None def _open_stdin_or_out(mode): # Do not return sys.stdin or sys.stdout directly as we want the returned object # to be closable without closing sys.stdout. std = dict(r=sys.stdin, w=sys.stdout)[mode[0]] if not _PY3: # Enforce str type on Python 2 # Note that io.open is slower than regular open() on Python 2.7, but # it appears to be the only API that has a closefd parameter. mode = mode[0] + 'b' return io.open(std.fileno(), mode=mode, closefd=False) def _open_bz2(filename, mode): if bz2 is None: raise ImportError("Cannot open bz2 files: The bz2 module is not available") if _PY3: return bz2.open(filename, mode) else: if mode[0] == 'a': raise ValueError("mode '{0}' not supported with BZ2 compression".format(mode)) return bz2.BZ2File(filename, mode) def _open_xz(filename, mode): if lzma is None: raise ImportError( "Cannot open xz files: The lzma module is not available (use Python 3.3 or newer)") return lzma.open(filename, mode) def _open_gz(filename, mode, compresslevel, threads): if sys.version_info[:2] == (2, 7): buffered_reader = io.BufferedReader buffered_writer = io.BufferedWriter else: buffered_reader = lambda x: x buffered_writer = lambda x: x if _PY3: exc = FileNotFoundError # was introduced in Python 3.3 else: exc = OSError if 'r' in mode: def open_with_threads(): return PipedGzipReader(filename, mode, threads=threads) def open_without_threads(): return buffered_reader(gzip.open(filename, mode)) else: def open_with_threads(): return PipedGzipWriter(filename, mode, compresslevel, threads=threads) def open_without_threads(): return buffered_writer(gzip.open(filename, mode, compresslevel=compresslevel)) if threads == 0: return open_without_threads() try: return open_with_threads() except exc: # pigz is not installed, use fallback return open_without_threads() def xopen(filename, mode='r', compresslevel=6, threads=None): """ A replacement for the "open" function that can also read and write compressed files transparently. The supported compression formats are gzip, bzip2 and xz. If the filename is '-', standard output (mode 'w') or standard input (mode 'r') is returned. The file type is determined based on the filename: .gz is gzip, .bz2 is bzip2, .xz is xz/lzma and no compression assumed otherwise. mode can be: 'rt', 'rb', 'at', 'ab', 'wt', or 'wb'. Also, the 't' can be omitted, so instead of 'rt', 'wt' and 'at', the abbreviations 'r', 'w' and 'a' can be used. In Python 2, the 't' and 'b' characters are ignored. Append mode ('a', 'at', 'ab') is not available with BZ2 compression and will raise an error. compresslevel is the compression level for writing to gzip files. This parameter is ignored for the other compression formats. threads only has a meaning when reading or writing gzip files. When threads is None (the default), reading or writing a gzip file is done with a pigz (parallel gzip) subprocess if possible. See PipedGzipWriter and PipedGzipReader. When threads = 0, no subprocess is used. """ if mode in ('r', 'w', 'a'): mode += 't' if mode not in ('rt', 'rb', 'wt', 'wb', 'at', 'ab'): raise ValueError("mode '{0}' not supported".format(mode)) if not _PY3: mode = mode[0] filename = fspath(filename) if compresslevel not in range(1, 10): raise ValueError("compresslevel must be between 1 and 9") if filename == '-': return _open_stdin_or_out(mode) elif filename.endswith('.bz2'): return _open_bz2(filename, mode) elif filename.endswith('.xz'): return _open_xz(filename, mode) elif filename.endswith('.gz'): return _open_gz(filename, mode, compresslevel, threads) else: # Python 2.6 and 2.7 have io.open, which we could use to make the returned # object consistent with the one returned in Python 3, but reading a file # with io.open() is 100 times slower (!) on Python 2.6, and still about # three times slower on Python 2.7 (tested with "for _ in io.open(path): pass") return open(filename, mode) xopen-0.8.4/.editorconfig0000664000372000037200000000013713555007765016211 0ustar travistravis00000000000000[*.py] charset=utf-8 end_of_line=lf insert_final_newline=true indent_style=space indent_size=4 xopen-0.8.4/tox.ini0000664000372000037200000000020413555007765015042 0ustar travistravis00000000000000[tox] envlist = py27,py34,py35,py36,py37,py38 [testenv] deps = pytest commands = pytest --doctest-modules --pyargs src/xopen tests xopen-0.8.4/.gitignore0000664000372000037200000000010213555007765015514 0ustar travistravis00000000000000__pycache__/ *.pyc *.egg-info *~ .tox venv/ src/xopen/_version.py xopen-0.8.4/setup.cfg0000664000372000037200000000030513555010000015322 0ustar travistravis00000000000000[bdist_wheel] universal = 1 [coverage:run] parallel = True include = */site-packages/xopen/* tests/* [coverage:paths] source = src/ **/site-packages/ [egg_info] tag_build = tag_date = 0 xopen-0.8.4/.travis.yml0000664000372000037200000000142713555007765015650 0ustar travistravis00000000000000language: python dist: xenial cache: directories: - $HOME/.cache/pip python: - "2.7" - "3.4" - "3.5" - "3.6" - "3.7" - "3.8" install: - sudo apt-get update && sudo apt-get install -y pigz - pip install --upgrade coverage codecov - pip install . script: - python setup.py --version # Detect encoding problems - coverage run -m pytest after_success: - coverage combine - codecov env: global: - TWINE_USERNAME=marcelm jobs: include: - stage: deploy services: - docker python: "3.6" install: python3 -m pip install twine if: tag IS present script: - | python3 setup.py sdist python3 -m pip wheel -w dist/ . ls -l dist/ python3 -m twine upload dist/xopen-* xopen-0.8.4/LICENSE0000664000372000037200000000207113555007765014540 0ustar travistravis00000000000000Copyright (c) 2010-2019 Marcel Martin Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. xopen-0.8.4/README.rst0000664000372000037200000000771413555007765015233 0ustar travistravis00000000000000.. image:: https://travis-ci.org/marcelm/xopen.svg?branch=master :target: https://travis-ci.org/marcelm/xopen :alt: .. image:: https://img.shields.io/pypi/v/xopen.svg?branch=master :target: https://pypi.python.org/pypi/xopen .. image:: https://img.shields.io/conda/v/conda-forge/xopen.svg :target: https://anaconda.org/conda-forge/xopen :alt: .. image:: https://codecov.io/gh/marcelm/xopen/branch/master/graph/badge.svg :target: https://codecov.io/gh/marcelm/xopen :alt: ===== xopen ===== This small Python module provides an ``xopen`` function that works like the built-in ``open`` function, but can also deal with compressed files. Supported compression formats are gzip, bzip2 and xz. They are automatically recognized by their file extensions `.gz`, `.bz2` or `.xz`. The focus is on being as efficient as possible on all supported Python versions. For example, ``xopen`` uses ``pigz``, which is a parallel version of ``gzip``, to open ``.gz`` files, which is faster than using the built-in ``gzip.open`` function. ``pigz`` can use multiple threads when compressing, but is also faster when reading ``.gz`` files, so it is used both for reading and writing if it is available. This module has originally been developed as part of the `cutadapt tool `_ that is used in bioinformatics to manipulate sequencing data. It has been in successful use within that software for a few years. ``xopen`` is compatible with Python versions 2.7 and 3.4 to 3.8. Usage ----- Open a file for reading:: from xopen import xopen with xopen('file.txt.xz') as f: content = f.read() Or without context manager:: from xopen import xopen f = xopen('file.txt.xz') content = f.read() f.close() Open a file in binary mode for writing:: from xopen import xopen with xopen('file.txt.gz', mode='wb') as f: f.write(b'Hello') Credits ------- The name ``xopen`` was taken from the C function of the same name in the `utils.h file which is part of BWA `_. Kyle Beauchamp has contributed support for appending to files. Ruben Vorderman contributed improvements to make reading gzipped files faster. Some ideas were taken from the `canopener project `_. If you also want to open S3 files, you may want to use that module instead. Changes ------- v0.8.4 ~~~~~~ * When reading gzipped files, force ``pigz`` to use only a single process. ``pigz`` cannot use multiple cores anyway when decompressing. By default, it would use extra I/O processes, which slightly reduces wall-clock time, but increases CPU time. Single-core decompression with ``pigz`` is still about twice as fast as regular ``gzip``. * Allow ``threads=0`` for specifying that no external ``pigz``/``gzip`` process should be used (then regular ``gzip.open()`` is used instead). v0.8.3 ~~~~~~ * When reading gzipped files, let ``pigz`` use at most four threads by default. This limit previously only applied when writing to a file. * Support Python 3.8 v0.8.0 ~~~~~~ * Speed improvements when iterating over gzipped files. v0.6.0 ~~~~~~ * For reading from gzipped files, xopen will now use a ``pigz`` subprocess. This is faster than using ``gzip.open``. * Python 2 support will be dropped in one of the next releases. v0.5.0 ~~~~~~ * By default, pigz is now only allowed to use at most four threads. This hopefully reduces problems some users had with too many threads when opening many files at the same time. * xopen now accepts pathlib.Path objects. Author ------ Marcel Martin (`@marcelm_ on Twitter `_) Links ----- * `Source code `_ * `Report an issue `_ * `Project page on PyPI (Python package index) `_ xopen-0.8.4/pyproject.toml0000664000372000037200000000010413555007765016442 0ustar travistravis00000000000000[build-system] requires = ["setuptools", "wheel", "setuptools_scm"] xopen-0.8.4/setup.py0000664000372000037200000000252413555007765015250 0ustar travistravis00000000000000import sys from setuptools import setup, find_packages if sys.version_info < (2, 7): sys.stdout.write("At least Python 2.7 is required.\n") sys.exit(1) with open('README.rst') as f: long_description = f.read() setup( name='xopen', use_scm_version={'write_to': 'src/xopen/_version.py'}, setup_requires=['setuptools_scm'], # Support pip versions that don't know about pyproject.toml author='Marcel Martin', author_email='mail@marcelm.net', url='https://github.com/marcelm/xopen/', description='Open compressed files transparently', long_description=long_description, license='MIT', package_dir={'': 'src'}, packages=find_packages('src'), install_requires=[ 'bz2file; python_version=="2.7"', ], extras_require={ 'dev': ['pytest'], }, python_requires='>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, <4', classifiers=[ "Development Status :: 5 - Production/Stable", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3.8", ] ) xopen-0.8.4/tests/0000775000372000037200000000000013555010000014645 5ustar travistravis00000000000000xopen-0.8.4/tests/hello.gz0000664000372000037200000000003113555007765016334 0ustar travistravis00000000000000ZH6xopen-0.8.4/tests/file.txt.bz20000664000372000037200000000016613555007765017054 0ustar travistravis00000000000000BZh91AY&SYӀ@ 1MTikt%B"(HN|BZh91AY&SYsS@e 1ē& 7"(H9xopen-0.8.4/tests/test_xopen.py0000664000372000037200000002332513555007765017444 0ustar travistravis00000000000000# coding: utf-8 from __future__ import print_function, division, absolute_import import io import os import random import sys import signal import time from contextlib import contextmanager import pytest from xopen import xopen, PipedGzipReader, PipedGzipWriter extensions = ["", ".gz", ".bz2"] try: import lzma extensions.append(".xz") except ImportError: lzma = None base = "tests/file.txt" files = [base + ext for ext in extensions] CONTENT_LINES = ['Testing, testing ...\n', 'The second line.\n'] CONTENT = ''.join(CONTENT_LINES) # File extensions for which appending is supported append_extensions = extensions[:] if sys.version_info[0] == 2: append_extensions.remove(".bz2") @pytest.fixture(params=extensions) def ext(request): return request.param @pytest.fixture(params=files) def fname(request): return request.param @pytest.fixture def large_gzip(tmpdir): path = str(tmpdir.join("large.gz")) random_text = ''.join(random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZ\n') for _ in range(1024)) # Make the text a lot bigger in order to ensure that it is larger than the # pipe buffer size. random_text *= 1024 with xopen(path, 'w') as f: f.write(random_text) return path @pytest.fixture def truncated_gzip(large_gzip): with open(large_gzip, 'a') as f: f.truncate(os.stat(large_gzip).st_size - 10) return large_gzip def test_xopen_text(fname): with xopen(fname, 'rt') as f: lines = list(f) assert len(lines) == 2 assert lines[1] == 'The second line.\n', fname def test_xopen_binary(fname): with xopen(fname, 'rb') as f: lines = list(f) assert len(lines) == 2 assert lines[1] == b'The second line.\n', fname def test_no_context_manager_text(fname): f = xopen(fname, 'rt') lines = list(f) assert len(lines) == 2 assert lines[1] == 'The second line.\n', fname f.close() assert f.closed def test_no_context_manager_binary(fname): f = xopen(fname, 'rb') lines = list(f) assert len(lines) == 2 assert lines[1] == b'The second line.\n', fname f.close() assert f.closed def test_readinto(fname): # Test whether .readinto() works content = CONTENT.encode('utf-8') with xopen(fname, 'rb') as f: b = bytearray(len(content) + 100) length = f.readinto(b) assert length == len(content) assert b[:length] == content def test_pipedgzipreader_readinto(): # Test whether PipedGzipReader.readinto works content = CONTENT.encode('utf-8') with PipedGzipReader("tests/file.txt.gz", "rb") as f: b = bytearray(len(content) + 100) length = f.readinto(b) assert length == len(content) assert b[:length] == content if sys.version_info[0] != 2: def test_pipedgzipreader_textiowrapper(): with PipedGzipReader("tests/file.txt.gz", "rb") as f: wrapped = io.TextIOWrapper(f) assert wrapped.read() == CONTENT def test_readline(fname): first_line = CONTENT_LINES[0].encode('utf-8') with xopen(fname, 'rb') as f: assert f.readline() == first_line def test_readline_text(fname): with xopen(fname, 'r') as f: assert f.readline() == CONTENT_LINES[0] def test_readline_pipedgzipreader(): first_line = CONTENT_LINES[0].encode('utf-8') with PipedGzipReader("tests/file.txt.gz", "rb") as f: assert f.readline() == first_line def test_readline_text_pipedgzipreader(): with PipedGzipReader("tests/file.txt.gz", "r") as f: assert f.readline() == CONTENT_LINES[0] def test_xopen_has_iter_method(ext, tmpdir): path = str(tmpdir.join("out" + ext)) with xopen(path, mode='w') as f: assert hasattr(f, '__iter__') def test_pipedgzipwriter_has_iter_method(tmpdir): with PipedGzipWriter(str(tmpdir.join("out.gz"))) as f: assert hasattr(f, '__iter__') @pytest.mark.parametrize("mode", ["rb", "rt"]) def test_pipedgzipreader_close(large_gzip, mode): with PipedGzipReader(large_gzip, mode=mode) as f: f.readline() time.sleep(0.2) # The subprocess should be properly terminated now @pytest.mark.skipif(sys.version_info < (3, ), reason="Python 3 needed") def test_partial_gzip_iteration_closes_correctly(large_gzip): class LineReader: def __init__(self, file): self.file = xopen(file, "rb") def __iter__(self): wrapper = io.TextIOWrapper(self.file) for line in wrapper: yield line f = LineReader(large_gzip) next(iter(f)) f.file.close() def test_nonexisting_file(ext): with pytest.raises(IOError): with xopen('this-file-does-not-exist' + ext) as f: pass # pragma: no cover def test_write_to_nonexisting_dir(ext): with pytest.raises(IOError): with xopen('this/path/does/not/exist/file.txt' + ext, 'w') as f: pass # pragma: no cover def test_invalid_mode(): with pytest.raises(ValueError): with xopen("tests/file.txt.gz", mode="hallo") as f: pass # pragma: no cover def test_filename_not_a_string(): with pytest.raises(TypeError): with xopen(123, mode="r") as f: pass # pragma: no cover def test_invalid_compression_level(tmpdir): path = str(tmpdir.join("out.gz")) with pytest.raises(ValueError) as e: with xopen(path, mode="w", compresslevel=17) as f: f.write("hello") # pragma: no cover assert "between 1 and 9" in e.value.args[0] @pytest.mark.parametrize("aext", append_extensions) def test_append(aext, tmpdir): text = "AB".encode("utf-8") reference = text + text path = str(tmpdir.join("the-file" + aext)) with xopen(path, "ab") as f: f.write(text) with xopen(path, "ab") as f: f.write(text) with xopen(path, "r") as f: for appended in f: pass reference = reference.decode("utf-8") assert appended == reference @pytest.mark.parametrize("aext", append_extensions) def test_append_text(aext, tmpdir): text = "AB" reference = text + text path = str(tmpdir.join("the-file" + aext)) with xopen(path, "at") as f: f.write(text) with xopen(path, "at") as f: f.write(text) with xopen(path, "rt") as f: for appended in f: pass assert appended == reference class TookTooLongError(Exception): pass class timeout: # copied from https://stackoverflow.com/a/22348885/715090 def __init__(self, seconds=1): self.seconds = seconds def handle_timeout(self, signum, frame): raise TookTooLongError() # pragma: no cover def __enter__(self): signal.signal(signal.SIGALRM, self.handle_timeout) signal.alarm(self.seconds) def __exit__(self, type, value, traceback): signal.alarm(0) def test_truncated_gz(truncated_gzip): with timeout(seconds=2): with pytest.raises((EOFError, IOError)): f = xopen(truncated_gzip, "r") f.read() f.close() # pragma: no cover def test_truncated_gz_iter(truncated_gzip): with timeout(seconds=2): with pytest.raises((EOFError, IOError)): f = xopen(truncated_gzip, 'r') for line in f: pass f.close() # pragma: no cover def test_truncated_gz_with(truncated_gzip): with timeout(seconds=2): with pytest.raises((EOFError, IOError)): with xopen(truncated_gzip, 'r') as f: f.read() def test_truncated_gz_iter_with(truncated_gzip): with timeout(seconds=2): with pytest.raises((EOFError, IOError)): with xopen(truncated_gzip, 'r') as f: for line in f: pass def test_bare_read_from_gz(): with xopen('tests/hello.gz', 'rt') as f: assert f.read() == 'hello' def test_read_piped_gzip(): with PipedGzipReader('tests/hello.gz', 'rt') as f: assert f.read() == 'hello' def test_write_pigz_threads(tmpdir): path = str(tmpdir.join('out.gz')) with xopen(path, mode='w', threads=3) as f: f.write('hello') with xopen(path) as f: assert f.read() == 'hello' if sys.version_info[0] >= 3: def test_read_gzip_no_threads(): import gzip with xopen("tests/hello.gz", "rb", threads=0) as f: assert isinstance(f, gzip.GzipFile), f def test_write_gzip_no_threads(tmpdir): import gzip path = str(tmpdir.join("out.gz")) with xopen(path, "wb", threads=0) as f: assert isinstance(f, gzip.GzipFile), f def test_write_stdout(): f = xopen('-', mode='w') print("Hello", file=f) f.close() # ensure stdout is not closed print("Still there?") def test_write_stdout_contextmanager(): # Do not close stdout with xopen('-', mode='w') as f: print("Hello", file=f) # ensure stdout is not closed print("Still there?") if sys.version_info[:2] >= (3, 4): # pathlib was added in Python 3.4 from pathlib import Path def test_read_pathlib(fname): path = Path(fname) with xopen(path, mode='rt') as f: assert f.read() == CONTENT def test_read_pathlib_binary(fname): path = Path(fname) with xopen(path, mode='rb') as f: assert f.read() == bytes(CONTENT, 'ascii') def test_write_pathlib(ext, tmpdir): path = Path(str(tmpdir)) / ('hello.txt' + ext) with xopen(path, mode='wt') as f: f.write('hello') with xopen(path, mode='rt') as f: assert f.read() == 'hello' def test_write_pathlib_binary(ext, tmpdir): path = Path(str(tmpdir)) / ('hello.txt' + ext) with xopen(path, mode='wb') as f: f.write(b'hello') with xopen(path, mode='rb') as f: assert f.read() == b'hello' xopen-0.8.4/tests/file.txt0000664000372000037200000000004613555007765016355 0ustar travistravis00000000000000Testing, testing ... The second line. xopen-0.8.4/tests/file.txt.xz0000664000372000037200000000014013555007765017010 0ustar travistravis000000000000007zXZִF!t/%Testing, testing ... The second line. ]ݜa>&+N}YZxopen-0.8.4/tests/file.txt.gz0000664000372000037200000000006513555007765016775 0ustar travistravis00000000000000ȵW I-.KQ(0B2RSRr2Rs&