././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1715025058.9213526 partd-1.4.2/0000755000076500000240000000000014616232243011655 5ustar00jamesstaff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/LICENSE.txt0000644000076500000240000000274114455317026013510 0ustar00jamesstaffCopyright (c) 2015, Continuum Analytics, Inc. and contributors All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of Continuum Analytics nor the names of any contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715022964.0 partd-1.4.2/MANIFEST.in0000644000076500000240000000023214616226164013415 0ustar00jamesstaffrecursive-include partd *.py include setup.py include README.rst include LICENSE.txt include MANIFEST.in include versioneer.py include partd/_version.py ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1715025058.9210765 partd-1.4.2/PKG-INFO0000644000076500000240000001101414616232243012747 0ustar00jamesstaffMetadata-Version: 2.1 Name: partd Version: 1.4.2 Summary: Appendable key-value storage Maintainer-email: Matthew Rocklin License: BSD Project-URL: Homepage, http://github.com/dask/partd/ Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Requires-Python: >=3.9 Description-Content-Type: text/x-rst License-File: LICENSE.txt Requires-Dist: locket Requires-Dist: toolz Provides-Extra: complete Requires-Dist: numpy>=1.20.0; extra == "complete" Requires-Dist: pandas>=1.3; extra == "complete" Requires-Dist: pyzmq; extra == "complete" Requires-Dist: blosc; extra == "complete" PartD ===== |Build Status| |Version Status| Key-value byte store with appendable values Partd stores key-value pairs. Values are raw bytes. We append on old values. Partd excels at shuffling operations. Operations ---------- PartD has two main operations, ``append`` and ``get``. Example ------- 1. Create a Partd backed by a directory:: >>> import partd >>> p = partd.File('/path/to/new/dataset/') 2. Append key-byte pairs to dataset:: >>> p.append({'x': b'Hello ', 'y': b'123'}) >>> p.append({'x': b'world!', 'y': b'456'}) 3. Get bytes associated to keys:: >>> p.get('x') # One key b'Hello world!' >>> p.get(['y', 'x']) # List of keys [b'123456', b'Hello world!'] 4. Destroy partd dataset:: >>> p.drop() That's it. Implementations --------------- We can back a partd by an in-memory dictionary:: >>> p = Dict() For larger amounts of data or to share data between processes we back a partd by a directory of files. This uses file-based locks for consistency.:: >>> p = File('/path/to/dataset/') However this can fail for many small writes. In these cases you may wish to buffer one partd with another, keeping a fixed maximum of data in the buffering partd. This writes the larger elements of the first partd to the second partd when space runs low:: >>> p = Buffer(Dict(), File(), available_memory=2e9) # 2GB memory buffer You might also want to have many distributed process write to a single partd consistently. This can be done with a server * Server Process:: >>> p = Buffer(Dict(), File(), available_memory=2e9) # 2GB memory buffer >>> s = Server(p, address='ipc://server') * Worker processes:: >>> p = Client('ipc://server') # Client machine talks to remote server Encodings and Compression ------------------------- Once we can robustly and efficiently append bytes to a partd we consider compression and encodings. This is generally available with the ``Encode`` partd, which accepts three functions, one to apply on bytes as they are written, one to apply to bytes as they are read, and one to join bytestreams. Common configurations already exist for common data and compression formats. We may wish to compress and decompress data transparently as we interact with a partd. Objects like ``BZ2``, ``Blosc``, ``ZLib`` and ``Snappy`` exist and take another partd as an argument.:: >>> p = File(...) >>> p = ZLib(p) These work exactly as before, the (de)compression happens automatically. Common data formats like Python lists, numpy arrays, and pandas dataframes are also supported out of the box.:: >>> p = File(...) >>> p = NumPy(p) >>> p.append({'x': np.array([...])}) This lets us forget about bytes and think instead in our normal data types. Composition ----------- In principle we want to compose all of these choices together 1. Write policy: ``Dict``, ``File``, ``Buffer``, ``Client`` 2. Encoding: ``Pickle``, ``Numpy``, ``Pandas``, ... 3. Compression: ``Blosc``, ``Snappy``, ... Partd objects compose by nesting. Here we make a partd that writes pickle encoded BZ2 compressed bytes directly to disk:: >>> p = Pickle(BZ2(File('foo'))) We could construct more complex systems that include compression, serialization, buffering, and remote access.:: >>> server = Server(Buffer(Dict(), File(), available_memory=2e0)) >>> client = Pickle(Snappy(Client(server.address))) >>> client.append({'x': [1, 2, 3]}) .. |Build Status| image:: https://github.com/dask/partd/workflows/CI/badge.svg :target: https://github.com/dask/partd/actions?query=workflow%3ACI .. |Version Status| image:: https://img.shields.io/pypi/v/partd.svg :target: https://pypi.python.org/pypi/partd/ ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/README.rst0000644000076500000240000000736014455317176013364 0ustar00jamesstaffPartD ===== |Build Status| |Version Status| Key-value byte store with appendable values Partd stores key-value pairs. Values are raw bytes. We append on old values. Partd excels at shuffling operations. Operations ---------- PartD has two main operations, ``append`` and ``get``. Example ------- 1. Create a Partd backed by a directory:: >>> import partd >>> p = partd.File('/path/to/new/dataset/') 2. Append key-byte pairs to dataset:: >>> p.append({'x': b'Hello ', 'y': b'123'}) >>> p.append({'x': b'world!', 'y': b'456'}) 3. Get bytes associated to keys:: >>> p.get('x') # One key b'Hello world!' >>> p.get(['y', 'x']) # List of keys [b'123456', b'Hello world!'] 4. Destroy partd dataset:: >>> p.drop() That's it. Implementations --------------- We can back a partd by an in-memory dictionary:: >>> p = Dict() For larger amounts of data or to share data between processes we back a partd by a directory of files. This uses file-based locks for consistency.:: >>> p = File('/path/to/dataset/') However this can fail for many small writes. In these cases you may wish to buffer one partd with another, keeping a fixed maximum of data in the buffering partd. This writes the larger elements of the first partd to the second partd when space runs low:: >>> p = Buffer(Dict(), File(), available_memory=2e9) # 2GB memory buffer You might also want to have many distributed process write to a single partd consistently. This can be done with a server * Server Process:: >>> p = Buffer(Dict(), File(), available_memory=2e9) # 2GB memory buffer >>> s = Server(p, address='ipc://server') * Worker processes:: >>> p = Client('ipc://server') # Client machine talks to remote server Encodings and Compression ------------------------- Once we can robustly and efficiently append bytes to a partd we consider compression and encodings. This is generally available with the ``Encode`` partd, which accepts three functions, one to apply on bytes as they are written, one to apply to bytes as they are read, and one to join bytestreams. Common configurations already exist for common data and compression formats. We may wish to compress and decompress data transparently as we interact with a partd. Objects like ``BZ2``, ``Blosc``, ``ZLib`` and ``Snappy`` exist and take another partd as an argument.:: >>> p = File(...) >>> p = ZLib(p) These work exactly as before, the (de)compression happens automatically. Common data formats like Python lists, numpy arrays, and pandas dataframes are also supported out of the box.:: >>> p = File(...) >>> p = NumPy(p) >>> p.append({'x': np.array([...])}) This lets us forget about bytes and think instead in our normal data types. Composition ----------- In principle we want to compose all of these choices together 1. Write policy: ``Dict``, ``File``, ``Buffer``, ``Client`` 2. Encoding: ``Pickle``, ``Numpy``, ``Pandas``, ... 3. Compression: ``Blosc``, ``Snappy``, ... Partd objects compose by nesting. Here we make a partd that writes pickle encoded BZ2 compressed bytes directly to disk:: >>> p = Pickle(BZ2(File('foo'))) We could construct more complex systems that include compression, serialization, buffering, and remote access.:: >>> server = Server(Buffer(Dict(), File(), available_memory=2e0)) >>> client = Pickle(Snappy(Client(server.address))) >>> client.append({'x': [1, 2, 3]}) .. |Build Status| image:: https://github.com/dask/partd/workflows/CI/badge.svg :target: https://github.com/dask/partd/actions?query=workflow%3ACI .. |Version Status| image:: https://img.shields.io/pypi/v/partd.svg :target: https://pypi.python.org/pypi/partd/ ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1715025058.9171405 partd-1.4.2/partd/0000755000076500000240000000000014616232243012767 5ustar00jamesstaff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715022964.0 partd-1.4.2/partd/__init__.py0000644000076500000240000000074714616226164015115 0ustar00jamesstafffrom contextlib import suppress from .file import File from .dict import Dict from .buffer import Buffer from .encode import Encode from .pickle import Pickle from .python import Python from .compressed import * with suppress(ImportError): from .numpy import Numpy with suppress(ImportError): from .pandas import PandasColumns, PandasBlocks with suppress(ImportError): from .zmq import Client, Server from . import _version __version__ = _version.get_versions()['version'] ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1715025058.9215286 partd-1.4.2/partd/_version.py0000644000076500000240000000076114616232243015171 0ustar00jamesstaff # This file was generated by 'versioneer.py' (0.29) from # revision-control system data, or from the parent directory name of an # unpacked source archive. Distribution tarballs contain a pre-generated copy # of this file. import json version_json = ''' { "date": "2024-05-06T14:50:19-0500", "dirty": false, "error": null, "full-revisionid": "829b539a7355f2740124fd3f9a70ca4356f981d2", "version": "1.4.2" } ''' # END VERSION_JSON def get_versions(): return json.loads(version_json) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/buffer.py0000644000076500000240000000705414455317176014632 0ustar00jamesstafffrom .core import Interface from threading import Lock from toolz import merge_with, topk, accumulate, pluck from operator import add from bisect import bisect from collections import defaultdict from queue import Queue, Empty def zero(): return 0 class Buffer(Interface): def __init__(self, fast, slow, available_memory=1e9): self.lock = Lock() self.fast = fast self.slow = slow self.available_memory = available_memory self.lengths = defaultdict(zero) self.memory_usage = 0 Interface.__init__(self) def __getstate__(self): return {'fast': self.fast, 'slow': self.slow, 'memory_usage': self.memory_usage, 'lengths': self.lengths, 'available_memory': self.available_memory} def __setstate__(self, state): Interface.__setstate__(self, state) self.lock = Lock() self.__dict__.update(state) def append(self, data, lock=True, **kwargs): if lock: self.lock.acquire() try: for k, v in data.items(): self.lengths[k] += len(v) self.memory_usage += len(v) self.fast.append(data, lock=False, **kwargs) while self.memory_usage > self.available_memory: keys = keys_to_flush(self.lengths, 0.1, maxcount=20) self.flush(keys) finally: if lock: self.lock.release() def _get(self, keys, lock=True, **kwargs): if lock: self.lock.acquire() try: result = list(map(add, self.fast.get(keys, lock=False), self.slow.get(keys, lock=False))) finally: if lock: self.lock.release() return result def _iset(self, key, value, lock=True): """ Idempotent set """ if lock: self.lock.acquire() try: self.fast.iset(key, value, lock=False) finally: if lock: self.lock.release() def _delete(self, keys, lock=True): if lock: self.lock.acquire() try: self.fast.delete(keys, lock=False) self.slow.delete(keys, lock=False) finally: if lock: self.lock.release() def drop(self): self._iset_seen.clear() self.fast.drop() self.slow.drop() def __exit__(self, *args): self.drop() def flush(self, keys=None, block=None): """ Flush keys to disk Parameters ---------- keys: list or None list of keys to flush block: bool (defaults to None) Whether or not to block until all writing is complete If no keys are given then flush all keys """ if keys is None: keys = list(self.lengths) self.slow.append(dict(zip(keys, self.fast.get(keys)))) self.fast.delete(keys) for key in keys: self.memory_usage -= self.lengths[key] del self.lengths[key] def keys_to_flush(lengths, fraction=0.1, maxcount=100000): """ Which keys to remove >>> lengths = {'a': 20, 'b': 10, 'c': 15, 'd': 15, ... 'e': 10, 'f': 25, 'g': 5} >>> keys_to_flush(lengths, 0.5) ['f', 'a'] """ top = topk(max(len(lengths) // 2, 1), lengths.items(), key=1) total = sum(lengths.values()) cutoff = min(maxcount, max(1, bisect(list(accumulate(add, pluck(1, top))), total * fraction))) result = [k for k, v in top[:cutoff]] assert result return result ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/compressed.py0000644000076500000240000000222614455317176015521 0ustar00jamesstafffrom contextlib import suppress from functools import partial from .encode import Encode __all__ = [] def bytes_concat(L): return b''.join(L) with suppress(ImportError, AttributeError): # In case snappy is not installed, or another package called snappy that does not implement compress / decompress. # For example, SnapPy (https://pypi.org/project/snappy/) import snappy Snappy = partial(Encode, snappy.compress, snappy.decompress, bytes_concat) __all__.append('Snappy') with suppress(ImportError): import zlib ZLib = partial(Encode, zlib.compress, zlib.decompress, bytes_concat) __all__.append('ZLib') with suppress(ImportError): import bz2 BZ2 = partial(Encode, bz2.compress, bz2.decompress, bytes_concat) __all__.append('BZ2') with suppress(ImportError): import blosc Blosc = partial(Encode, blosc.compress, blosc.decompress, bytes_concat) __all__.append('Blosc') ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/core.py0000644000076500000240000000432314455317176014305 0ustar00jamesstaffimport os import shutil import locket import string from toolz import memoize from contextlib import contextmanager from .utils import nested_get, flatten # http://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename-in-python valid_chars = "-_.() " + string.ascii_letters + string.digits + os.path.sep def escape_filename(fn): """ Escape text so that it is a valid filename >>> escape_filename('Foo!bar?') 'Foobar' """ return ''.join(filter(valid_chars.__contains__, fn)) def filename(path, key): return os.path.join(path, escape_filename(token(key))) def token(key): """ >>> token('hello') 'hello' >>> token(('hello', 'world')) # doctest: +SKIP 'hello/world' """ if isinstance(key, str): return key elif isinstance(key, tuple): return os.path.join(*map(token, key)) else: return str(key) class Interface: def __init__(self): self._iset_seen = set() def __setstate__(self, state): self.__dict__.update(state) self._iset_seen = set() def iset(self, key, value, **kwargs): if key in self._iset_seen: return else: self._iset(key, value, **kwargs) self._iset_seen.add(key) def __enter__(self): return self def __exit__(self, type, value, traceback): self.drop() def iget(self, key): return self._get([key], lock=False)[0] def get(self, keys, **kwargs): if not isinstance(keys, list): return self.get([keys], **kwargs)[0] elif any(isinstance(key, list) for key in keys): # nested case flatkeys = list(flatten(keys)) result = self.get(flatkeys, **kwargs) return nested_get(keys, dict(zip(flatkeys, result))) else: return self._get(keys, **kwargs) def delete(self, keys, **kwargs): if not isinstance(keys, list): return self._delete([keys], **kwargs) else: return self._delete(keys, **kwargs) def pop(self, keys, **kwargs): with self.partd.lock: result = self.partd.get(keys, lock=False) self.partd.delete(keys, lock=False) return result ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/dict.py0000644000076500000240000000324514455317026014274 0ustar00jamesstafffrom .core import Interface from threading import Lock class Dict(Interface): def __init__(self): self.lock = Lock() self.data = dict() Interface.__init__(self) def __getstate__(self): return {'data': self.data} def __setstate__(self, state): Interface.__setstate__(self, state) Dict.__init__(self) self.data = state['data'] def append(self, data, lock=True, **kwargs): if lock: self.lock.acquire() try: for k, v in data.items(): if k not in self.data: self.data[k] = [] self.data[k].append(v) finally: if lock: self.lock.release() def _get(self, keys, lock=True, **kwargs): assert isinstance(keys, (list, tuple, set)) if lock: self.lock.acquire() try: result = [b''.join(self.data.get(key, [])) for key in keys] finally: if lock: self.lock.release() return result def _iset(self, key, value, lock=True): """ Idempotent set """ if lock: self.lock.acquire() try: self.data[key] = [value] finally: if lock: self.lock.release() def _delete(self, keys, lock=True): if lock: self.lock.acquire() try: for key in keys: if key in self.data: del self.data[key] finally: if lock: self.lock.release() def drop(self): self._iset_seen.clear() self.data.clear() def __exit__(self, *args): self.drop() ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/encode.py0000644000076500000240000000240714455317026014605 0ustar00jamesstafffrom .core import Interface from .file import File from toolz import valmap from .utils import frame, framesplit class Encode(Interface): def __init__(self, encode, decode, join, partd=None): if not partd or isinstance(partd, str): partd = File(partd) self.partd = partd self.encode = encode self.decode = decode self.join = join Interface.__init__(self) def __getstate__(self): return self.__dict__ __setstate__ = Interface.__setstate__ def append(self, data, **kwargs): data = valmap(self.encode, data) data = valmap(frame, data) self.partd.append(data, **kwargs) def _get(self, keys, **kwargs): raw = self.partd._get(keys, **kwargs) return [self.join([self.decode(frame) for frame in framesplit(chunk)]) for chunk in raw] def delete(self, keys, **kwargs): return self.partd.delete(keys, **kwargs) def _iset(self, key, value, **kwargs): return self.partd.iset(key, frame(self.encode(value)), **kwargs) def drop(self): return self.partd.drop() @property def lock(self): return self.partd.lock def __exit__(self, *args): self.drop() self.partd.__exit__(*args) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/file.py0000644000076500000240000000756214455317176014304 0ustar00jamesstaffimport atexit from contextlib import suppress import os import shutil import string import tempfile from .core import Interface import locket class File(Interface): def __init__(self, path=None, dir=None): if not path: path = tempfile.mkdtemp(suffix='.partd', dir=dir) cleanup_files.append(path) self._explicitly_given_path = False else: self._explicitly_given_path = True self.path = path if not os.path.exists(path): with suppress(OSError): os.makedirs(path) self.lock = locket.lock_file(self.filename('.lock')) Interface.__init__(self) def __getstate__(self): return {'path': self.path} def __setstate__(self, state): Interface.__setstate__(self, state) File.__init__(self, state['path']) def append(self, data, lock=True, fsync=False, **kwargs): if lock: self.lock.acquire() try: for k, v in data.items(): fn = self.filename(k) if not os.path.exists(os.path.dirname(fn)): os.makedirs(os.path.dirname(fn)) with open(fn, 'ab') as f: f.write(v) if fsync: os.fsync(f) finally: if lock: self.lock.release() def _get(self, keys, lock=True, **kwargs): assert isinstance(keys, (list, tuple, set)) if lock: self.lock.acquire() try: result = [] for key in keys: try: with open(self.filename(key), 'rb') as f: result.append(f.read()) except OSError: result.append(b'') finally: if lock: self.lock.release() return result def _iset(self, key, value, lock=True): """ Idempotent set """ fn = self.filename(key) if not os.path.exists(os.path.dirname(fn)): os.makedirs(os.path.dirname(fn)) if lock: self.lock.acquire() try: with open(self.filename(key), 'wb') as f: f.write(value) finally: if lock: self.lock.release() def _delete(self, keys, lock=True): if lock: self.lock.acquire() try: for key in keys: path = filename(self.path, key) if os.path.exists(path): os.remove(path) finally: if lock: self.lock.release() def drop(self): if os.path.exists(self.path): shutil.rmtree(self.path) self._iset_seen.clear() os.mkdir(self.path) def filename(self, key): return filename(self.path, key) def __exit__(self, *args): self.drop() os.rmdir(self.path) def __del__(self): if not self._explicitly_given_path: self.drop() os.rmdir(self.path) def filename(path, key): return os.path.join(path, escape_filename(token(key))) # http://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename-in-python valid_chars = "-_.() " + string.ascii_letters + string.digits + os.path.sep def escape_filename(fn): """ Escape text so that it is a valid filename >>> escape_filename('Foo!bar?') 'Foobar' """ return ''.join(filter(valid_chars.__contains__, fn)) def token(key): """ >>> token('hello') 'hello' >>> token(('hello', 'world')) # doctest: +SKIP 'hello/world' """ if isinstance(key, str): return key elif isinstance(key, tuple): return os.path.join(*map(token, key)) else: return str(key) cleanup_files = list() @atexit.register def cleanup(): for fn in cleanup_files: if os.path.exists(fn): shutil.rmtree(fn) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/numpy.py0000644000076500000240000001012514455317176014522 0ustar00jamesstaff""" Store arrays We put arrays on disk as raw bytes, extending along the first dimension. Alongside each array x we ensure the value x.dtype which stores the string description of the array's dtype. """ from contextlib import suppress import pickle import numpy as np from toolz import valmap, identity, partial from .core import Interface from .file import File from .utils import frame, framesplit, suffix def serialize_dtype(dt): """ Serialize dtype to bytes >>> serialize_dtype(np.dtype('i4')) b'>> serialize_dtype(np.dtype('M8[us]')) b'>> parse_dtype(b'i4') dtype('int32') >>> parse_dtype(b"[('a', 'i4')]") dtype([('a', '= (0, 5, 2): unpack_kwargs = {'raw': False} else: unpack_kwargs = {'encoding': 'utf-8'} blocks = [msgpack.unpackb(f, **unpack_kwargs) for f in framesplit(bytes)] except Exception: blocks = [pickle.loads(f) for f in framesplit(bytes)] result = np.empty(sum(map(len, blocks)), dtype='O') i = 0 for block in blocks: result[i:i + len(block)] = block i += len(block) return result else: result = np.frombuffer(bytes, dtype) if copy: result = result.copy() return result compress_text = identity decompress_text = identity compress_bytes = lambda bytes, itemsize: bytes decompress_bytes = identity with suppress(ImportError): import blosc blosc.set_nthreads(1) compress_bytes = blosc.compress decompress_bytes = blosc.decompress compress_text = partial(blosc.compress, typesize=1) decompress_text = blosc.decompress with suppress(ImportError): from snappy import compress as compress_text from snappy import decompress as decompress_text def compress(bytes, dtype): if dtype == 'O': return compress_text(bytes) else: return compress_bytes(bytes, dtype.itemsize) def decompress(bytes, dtype): if dtype == 'O': return decompress_text(bytes) else: return decompress_bytes(bytes) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715024310.0 partd-1.4.2/partd/pandas.py0000644000076500000240000001622714616230666014626 0ustar00jamesstafffrom functools import partial import pickle import pandas as pd from packaging.version import Version PANDAS_GE_210 = Version(pd.__version__).release >= (2, 1, 0) PANDAS_GE_300 = Version(pd.__version__).major >= 3 if PANDAS_GE_300: from pandas.api.internals import create_dataframe_from_blocks create_block_manager_from_blocks = None make_block = None else: create_dataframe_from_blocks = None try: from pandas.core.internals.managers import create_block_manager_from_blocks except ImportError: from pandas.core.internals import create_block_manager_from_blocks from pandas.core.internals import make_block from . import numpy as pnp from .core import Interface from .encode import Encode from .utils import extend, framesplit, frame from pandas.api.types import is_extension_array_dtype from pandas.api.extensions import ExtensionArray def is_extension_array(x): return isinstance(x, ExtensionArray) dumps = partial(pickle.dumps, protocol=pickle.HIGHEST_PROTOCOL) class PandasColumns(Interface): def __init__(self, partd=None): self.partd = pnp.Numpy(partd) Interface.__init__(self) def append(self, data, **kwargs): for k, df in data.items(): self.iset(extend(k, '.columns'), dumps(list(df.columns))) self.iset(extend(k, '.index-name'), dumps(df.index.name)) # TODO: don't use values, it does some work. Look at _blocks instead # pframe/cframe do this well arrays = {extend(k, col): df[col].values for k, df in data.items() for col in df.columns} arrays.update({extend(k, '.index'): df.index.values for k, df in data.items()}) # TODO: handle categoricals self.partd.append(arrays, **kwargs) def _get(self, keys, columns=None, **kwargs): if columns is None: columns = self.partd.partd.get([extend(k, '.columns') for k in keys], **kwargs) columns = list(map(pickle.loads, columns)) else: columns = [columns] * len(keys) index_names = self.partd.partd.get([extend(k, '.index-name') for k in keys], **kwargs) index_names = map(pickle.loads, index_names) keys = [[extend(k, '.index'), [extend(k, col) for col in cols]] for k, cols in zip(keys, columns)] arrays = self.partd.get(keys, **kwargs) return [pd.DataFrame(dict(zip(cols, arrs)), columns=cols, index=pd.Index(index, name=iname)) for iname, (index, arrs), cols in zip(index_names, arrays, columns)] def __getstate__(self): return {'partd': self.partd} def _iset(self, key, value): return self.partd._iset(key, value) def drop(self): return self.partd.drop() @property def lock(self): return self.partd.partd.lock def __exit__(self, *args): self.drop() self.partd.__exit__(self, *args) def __del__(self): self.partd.__del__() def index_to_header_bytes(ind): # These have special `__reduce__` methods, just use pickle if isinstance(ind, (pd.DatetimeIndex, pd.MultiIndex, pd.RangeIndex)): return None, dumps(ind) if isinstance(ind, pd.CategoricalIndex): cat = (ind.ordered, ind.categories) values = ind.codes else: cat = None values = ind.values if is_extension_array_dtype(ind): return None, dumps(ind) header = (type(ind), {k: getattr(ind, k, None) for k in ind._attributes}, values.dtype, cat) bytes = pnp.compress(pnp.serialize(values), values.dtype) return header, bytes def index_from_header_bytes(header, bytes): if header is None: return pickle.loads(bytes) typ, attr, dtype, cat = header data = pnp.deserialize(pnp.decompress(bytes, dtype), dtype, copy=True) if cat: data = pd.Categorical.from_codes(data, cat[1], ordered=cat[0]) return typ.__new__(typ, data=data, **attr) def block_to_header_bytes(block): values = block.values if isinstance(values, pd.Categorical): extension = ('categorical_type', (values.ordered, values.categories)) values = values.codes elif isinstance(block, pd.DatetimeTZDtype): extension = ('datetime64_tz_type', (block.values.tzinfo,)) values = values.view('i8') elif is_extension_array_dtype(block.dtype) or is_extension_array(values): extension = ("other", ()) else: extension = ('numpy_type', ()) header = (block.mgr_locs.as_array, values.dtype, values.shape, extension) if extension == ("other", ()): bytes = pickle.dumps(values) else: bytes = pnp.compress(pnp.serialize(values), values.dtype) return header, bytes def block_from_header_bytes(header, bytes, create_block: bool): placement, dtype, shape, (extension_type, extension_values) = header if extension_type == "other": values = pickle.loads(bytes) else: values = pnp.deserialize(pnp.decompress(bytes, dtype), dtype, copy=True).reshape(shape) if extension_type == 'categorical_type': values = pd.Categorical.from_codes(values, extension_values[1], ordered=extension_values[0]) elif extension_type == 'datetime64_tz_type': tz_info = extension_values[0] values = pd.DatetimeIndex(values).tz_localize('utc').tz_convert( tz_info) if create_block: return make_block(values, placement=placement) return values, placement def serialize(df): """ Serialize and compress a Pandas DataFrame Uses Pandas blocks, snappy, and blosc to deconstruct an array into bytes """ col_header, col_bytes = index_to_header_bytes(df.columns) ind_header, ind_bytes = index_to_header_bytes(df.index) headers = [col_header, ind_header] bytes = [col_bytes, ind_bytes] for block in df._mgr.blocks: h, b = block_to_header_bytes(block) headers.append(h) bytes.append(b) frames = [dumps(headers)] + bytes return b''.join(map(frame, frames)) def deserialize(bytes): """ Deserialize and decompress bytes back to a pandas DataFrame """ frames = list(framesplit(bytes)) headers = pickle.loads(frames[0]) bytes = frames[1:] axes = [index_from_header_bytes(headers[0], bytes[0]), index_from_header_bytes(headers[1], bytes[1])] blocks = [block_from_header_bytes(h, b, create_block=not PANDAS_GE_300) for (h, b) in zip(headers[2:], bytes[2:])] if PANDAS_GE_300: return pd.api.internals.create_dataframe_from_blocks(blocks, axes[1], axes[0]) elif PANDAS_GE_210: return pd.DataFrame._from_mgr(create_block_manager_from_blocks(blocks, axes), axes=axes) else: return pd.DataFrame(create_block_manager_from_blocks(blocks, axes)) def join(dfs): if not dfs: return pd.DataFrame() else: return pd.concat(dfs) PandasBlocks = partial(Encode, serialize, deserialize, join) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/pickle.py0000644000076500000240000000055014455317176014622 0ustar00jamesstaff""" get/put functions that consume/produce Python lists using Pickle to serialize """ import pickle from .encode import Encode from functools import partial def concat(lists): return sum(lists, []) Pickle = partial(Encode, partial(pickle.dumps, protocol=pickle.HIGHEST_PROTOCOL), pickle.loads, concat) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/python.py0000644000076500000240000000160614455317176014677 0ustar00jamesstaff""" get/put functions that consume/produce Python lists using msgpack or pickle to serialize. First we try msgpack (it's faster). If that fails then we default to pickle. """ import pickle try: from pandas import msgpack except ImportError: try: import msgpack except ImportError: msgpack = False from .encode import Encode from functools import partial def dumps(x): try: return msgpack.packb(x, use_bin_type=True) except: return pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL) def loads(x): try: if msgpack.version >= (0, 5, 2): unpack_kwargs = {'raw': False} else: unpack_kwargs = {'encoding': 'utf-8'} return msgpack.unpackb(x, **unpack_kwargs) except: return pickle.loads(x) def concat(lists): return sum(lists, []) Python = partial(Encode, dumps, loads, concat) ././@PaxHeader0000000000000000000000000000003200000000000010210 xustar0026 mtime=1715025058.92028 partd-1.4.2/partd/tests/0000755000076500000240000000000014616232243014131 5ustar00jamesstaff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/tests/test_buffer.py0000644000076500000240000000303014455317026017013 0ustar00jamesstafffrom partd.dict import Dict from partd.file import File from partd.buffer import Buffer, keys_to_flush import pickle import shutil import os def test_partd(): a = Dict() b = Dict() with Buffer(a, b, available_memory=10) as p: p.append({'x': b'Hello', 'y': b'abc'}) assert a.get(['x', 'y']) == [b'Hello', b'abc'] p.append({'x': b'World!', 'y': b'def'}) assert a.get(['x', 'y']) == [b'', b'abcdef'] assert b.get(['x', 'y']) == [b'HelloWorld!', b''] result = p.get(['y', 'x']) assert result == [b'abcdef', b'HelloWorld!'] assert p.get('z') == b'' with p.lock: # uh oh, possible deadlock result = p.get(['x'], lock=False) def test_keys_to_flush(): lengths = {'a': 20, 'b': 10, 'c': 15, 'd': 15, 'e': 10, 'f': 25, 'g': 5} assert keys_to_flush(lengths, 0.5) == ['f', 'a'] def test_pickle(): with Dict() as a: with File() as b: c = Buffer(a, b) c.append({'x': b'123'}) d = pickle.loads(pickle.dumps(c)) assert d.get('x') == c.get('x') pickled_attrs = ('memory_usage', 'lengths', 'available_memory') for attr in pickled_attrs: assert hasattr(d, attr) assert getattr(d, attr) == getattr(c, attr) # special case Dict and File -- some attrs do not pickle assert hasattr(d, 'fast') assert d.fast.data == c.fast.data assert hasattr(d, 'slow') assert d.slow.path == c.slow.path ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/tests/test_compressed.py0000644000076500000240000000133614455317026017715 0ustar00jamesstafffrom partd.compressed import ZLib import shutil import os import pickle def test_partd(): with ZLib() as p: p.append({'x': b'Hello', 'y': b'abc'}) p.append({'x': b'World!', 'y': b'def'}) assert os.path.exists(p.partd.filename('x')) assert os.path.exists(p.partd.filename('y')) result = p.get(['y', 'x']) assert result == [b'abcdef', b'HelloWorld!'] assert p.get('z') == b'' with p.lock: # uh oh, possible deadlock result = p.get(['x'], lock=False) assert not os.path.exists(p.partd.path) def test_pickle(): with ZLib() as p: p.append({'x': b'123'}) q = pickle.loads(pickle.dumps(p)) assert q.get('x') == b'123' ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/tests/test_dict.py0000644000076500000240000000165414455317026016477 0ustar00jamesstafffrom partd.dict import Dict import shutil import os def test_partd(): with Dict() as p: p.append({'x': b'Hello', 'y': b'abc'}) p.append({'x': b'World!', 'y': b'def'}) result = p.get(['y', 'x']) assert result == [b'abcdef', b'HelloWorld!'] assert p.get('z') == b'' with p.lock: # uh oh, possible deadlock result = p.get(['x'], lock=False) def test_key_tuple(): with Dict() as p: p.append({('a', 'b'): b'123'}) assert p.get(('a', 'b')) == b'123' def test_iset(): with Dict() as p: p.iset('x', b'123') assert 'x' in p._iset_seen assert 'y' not in p._iset_seen p.iset('x', b'123') p.iset('x', b'123') assert p.get('x') == b'123' def test_delete_non_existent_key(): with Dict() as p: p.append({'x': b'123'}) p.delete(['x', 'y']) assert p.get(['x', 'y']) == [b'', b''] ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/tests/test_encode.py0000644000076500000240000000127414455317026017007 0ustar00jamesstafffrom partd.file import File from partd.encode import Encode import zlib import shutil import os def test_partd(): with Encode(zlib.compress, zlib.decompress, b''.join) as p: p.append({'x': b'Hello', 'y': b'abc'}) p.append({'x': b'World!', 'y': b'def'}) result = p.get(['y', 'x']) assert result == [b'abcdef', b'HelloWorld!'] assert p.get('z') == b'' with p.lock: # uh oh, possible deadlock result = p.get(['x'], lock=False) def test_ensure(): with Encode(zlib.compress, zlib.decompress, b''.join) as p: p.iset('x', b'123') p.iset('x', b'123') p.iset('x', b'123') assert p.get('x') == b'123' ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/tests/test_file.py0000644000076500000240000000341014455317026016463 0ustar00jamesstafffrom partd.file import File import shutil import os def test_partd(): with File() as p: p.append({'x': b'Hello', 'y': b'abc'}) p.append({'x': b'World!', 'y': b'def'}) assert os.path.exists(p.filename('x')) assert os.path.exists(p.filename('y')) result = p.get(['y', 'x']) assert result == [b'abcdef', b'HelloWorld!'] assert p.get('z') == b'' with p.lock: # uh oh, possible deadlock result = p.get(['x'], lock=False) assert not os.path.exists(p.path) def test_key_tuple(): with File() as p: p.append({('a', 'b'): b'123'}) assert os.path.exists(p.filename(('a', 'b'))) def test_iset(): with File() as p: p.iset('x', b'123') assert 'x' in p._iset_seen assert 'y' not in p._iset_seen p.iset('x', b'123') p.iset('x', b'123') assert p.get('x') == b'123' def test_nested_get(): with File() as p: p.append({'x': b'1', 'y': b'2', 'z': b'3'}) assert p.get(['x', ['y', 'z']]) == [b'1', [b'2', b'3']] def test_drop(): with File() as p: p.append({'x': b'123'}) p.iset('y', b'abc') assert p.get('x') == b'123' assert p.get('y') == b'abc' p.drop() assert p.get('x') == b'' assert p.get('y') == b'' p.append({'x': b'123'}) p.iset('y', b'def') assert p.get('x') == b'123' assert p.get('y') == b'def' def test_del(): f = File() assert f.path assert os.path.exists(f.path) f.__del__() assert not os.path.exists(f.path) with File('Foo') as p: p.__del__() assert os.path.exists(p.path) def test_specify_dirname(): with File(dir=os.getcwd()) as f: assert os.getcwd() in f.path ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/tests/test_numpy.py0000644000076500000240000000440314455317176016725 0ustar00jamesstaffimport pytest np = pytest.importorskip('numpy') # noqa import pickle import partd from partd.numpy import Numpy def test_numpy(): dt = np.dtype([('a', 'i4'), ('b', 'i2'), ('c', 'f8')]) with Numpy() as p: p.append({'a': np.array([10, 20, 30], dtype=dt['a']), 'b': np.array([ 1, 2, 3], dtype=dt['b']), 'c': np.array([.1, .2, .3], dtype=dt['c'])}) p.append({'a': np.array([70, 80, 90], dtype=dt['a']), 'b': np.array([ 7, 8, 9], dtype=dt['b']), 'c': np.array([.7, .8, .9], dtype=dt['c'])}) result = p.get(['a', 'c']) assert (result[0] == np.array([10, 20, 30, 70, 80, 90],dtype=dt['a'])).all() assert (result[1] == np.array([.1, .2, .3, .7, .8, .9],dtype=dt['c'])).all() with p.lock: # uh oh, possible deadlock result = p.get(['a'], lock=False) def test_nested(): with Numpy() as p: p.append({'x': np.array([1, 2, 3]), ('y', 1): np.array([4, 5, 6]), ('z', 'a', 3): np.array([.1, .2, .3])}) assert (p.get(('z', 'a', 3)) == np.array([.1, .2, .3])).all() def test_serialization(): with Numpy() as p: p.append({'x': np.array([1, 2, 3])}) q = pickle.loads(pickle.dumps(p)) assert (q.get('x') == [1, 2, 3]).all() array_of_lists = np.empty(3, dtype='O') array_of_lists[:] = [[1, 2], [3, 4], [5, 6]] @pytest.mark.parametrize('x', [np.array(['Alice', 'Bob', 'Charlie'], dtype='O'), array_of_lists]) def test_object_dtype(x): with Numpy() as p: p.append({'x': x}) p.append({'x': x}) assert isinstance(p.get('x'), np.ndarray) assert (p.get('x') == np.concatenate([x, x])).all() def test_datetime_types(): x = np.array(['2014-01-01T12:00:00'], dtype='M8[us]') y = np.array(['2014-01-01T12:00:00'], dtype='M8[s]') with Numpy() as p: p.append({'x': x, 'y': y}) assert p.get('x').dtype == x.dtype assert p.get('y').dtype == y.dtype def test_non_utf8_bytes(): a = np.array([b'\xc3\x28', b'\xa0\xa1', b'\xe2\x28\xa1', b'\xe2\x82\x28', b'\xf0\x28\x8c\xbc'], dtype='O') s = partd.numpy.serialize(a) assert (partd.numpy.deserialize(s, 'O') == a).all() ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/tests/test_pandas.py0000644000076500000240000001136114455317176017024 0ustar00jamesstaffimport pytest pytest.importorskip('pandas') # noqa import numpy as np import pandas as pd import pandas.testing as tm import os try: import pyarrow as pa except ImportError: pa = None from partd.pandas import PandasColumns, PandasBlocks, serialize, deserialize df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [1., 2., 3.], 'c': ['x', 'y', 'x']}, columns=['a', 'b', 'c'], index=pd.Index([1, 2, 3], name='myindex')) df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [10., 20., 30.], 'c': ['X', 'Y', 'X']}, columns=['a', 'b', 'c'], index=pd.Index([10, 20, 30], name='myindex')) def test_PandasColumns(): with PandasColumns() as p: assert os.path.exists(p.partd.partd.path) p.append({'x': df1, 'y': df2}) p.append({'x': df2, 'y': df1}) assert os.path.exists(p.partd.partd.filename('x')) assert os.path.exists(p.partd.partd.filename(('x', 'a'))) assert os.path.exists(p.partd.partd.filename(('x', '.index'))) assert os.path.exists(p.partd.partd.filename('y')) result = p.get(['y', 'x']) tm.assert_frame_equal(result[0], pd.concat([df2, df1])) tm.assert_frame_equal(result[1], pd.concat([df1, df2])) with p.lock: # uh oh, possible deadlock result = p.get(['x'], lock=False) assert not os.path.exists(p.partd.partd.path) def test_column_selection(): with PandasColumns('foo') as p: p.append({'x': df1, 'y': df2}) p.append({'x': df2, 'y': df1}) result = p.get('x', columns=['c', 'b']) tm.assert_frame_equal(result, pd.concat([df1, df2])[['c', 'b']]) def test_PandasBlocks(): with PandasBlocks() as p: assert os.path.exists(p.partd.path) p.append({'x': df1, 'y': df2}) p.append({'x': df2, 'y': df1}) assert os.path.exists(p.partd.filename('x')) assert os.path.exists(p.partd.filename('y')) result = p.get(['y', 'x']) tm.assert_frame_equal(result[0], pd.concat([df2, df1])) tm.assert_frame_equal(result[1], pd.concat([df1, df2])) with p.lock: # uh oh, possible deadlock result = p.get(['x'], lock=False) assert not os.path.exists(p.partd.path) @pytest.mark.parametrize('ordered', [False, True]) def test_serialize_categoricals(ordered): frame = pd.DataFrame({'x': [1, 2, 3, 4], 'y': pd.Categorical(['c', 'a', 'b', 'a'], ordered=ordered)}, index=pd.Categorical(['x', 'y', 'z', 'x'], ordered=ordered)) frame.index.name = 'foo' frame.columns.name = 'bar' for ind, df in [(0, frame), (1, frame.T)]: df2 = deserialize(serialize(df)) tm.assert_frame_equal(df, df2) def test_serialize_multi_index(): df = pd.DataFrame({'x': ['a', 'b', 'c', 'a', 'b', 'c'], 'y': [1, 2, 3, 4, 5, 6], 'z': [7., 8, 9, 10, 11, 12]}) df = df.groupby([df.x, df.y]).sum() df.index.name = 'foo' df.columns.name = 'bar' df2 = deserialize(serialize(df)) tm.assert_frame_equal(df, df2) @pytest.mark.parametrize('base', [ pd.Timestamp('1987-03-3T01:01:01+0001'), pd.Timestamp('1987-03-03 01:01:01-0600', tz='US/Central'), ]) def test_serialize(base): df = pd.DataFrame({'x': [ base + pd.Timedelta(seconds=i) for i in np.random.randint(0, 1000, size=10)], 'y': list(range(10)), 'z': pd.date_range('2017', periods=10)}) df2 = deserialize(serialize(df)) tm.assert_frame_equal(df, df2) def test_other_extension_types(): pytest.importorskip("pandas", minversion="0.25.0") a = pd.array([pd.Period("2000"), pd.Period("2001")]) df = pd.DataFrame({"A": a}) df2 = deserialize(serialize(df)) tm.assert_frame_equal(df, df2) @pytest.mark.parametrize("dtype", ["Int64", "Int32", "Float64", "Float32"]) def test_index_numeric_extension_types(dtype): pytest.importorskip("pandas", minversion="1.4.0") df = pd.DataFrame({"x": [1, 2, 3]}, index=[4, 5, 6]) df.index = df.index.astype(dtype) df2 = deserialize(serialize(df)) tm.assert_frame_equal(df, df2) @pytest.mark.parametrize( "dtype", [ "string[python]", pytest.param( "string[pyarrow]", marks=pytest.mark.skipif(pa is None, reason="Requires pyarrow"), ), ], ) def test_index_non_numeric_extension_types(dtype): pytest.importorskip("pandas", minversion="1.4.0") df = pd.DataFrame({"x": [1, 2, 3]}, index=["a", "b", "c"]) df.index = df.index.astype(dtype) df2 = deserialize(serialize(df)) tm.assert_frame_equal(df, df2) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/tests/test_partd.py0000644000076500000240000000240114455317026016655 0ustar00jamesstafffrom partd import File from partd.core import token, escape_filename, filename from partd import core import os import shutil from contextlib import contextmanager def test_partd(): path = 'tmp.partd' with File(path) as p: p.append({'x': b'Hello', 'y': b'abc'}) p.append({'x': b'World!', 'y': b'def'}) assert os.path.exists(p.filename('x')) assert os.path.exists(p.filename('y')) result = p.get(['y', 'x']) assert result == [b'abcdef', b'HelloWorld!'] assert p.get('z') == b'' with p.lock: # uh oh, possible deadlock result = p.get(['x'], lock=False) assert not os.path.exists(path) def test_key_tuple(): with File('foo') as p: p.append({('a', 'b'): b'123'}) assert os.path.exists(os.path.join(p.path, 'a', 'b')) def test_ensure(): with File('foo') as p: p.iset('x', b'123') p.iset('x', b'123') p.iset('x', b'123') assert p.get('x') == b'123' def test_filenames(): assert token('hello') == 'hello' assert token(('hello', 'world')) == os.path.join('hello', 'world') assert escape_filename(os.path.join('a', 'b')) == os.path.join('a', 'b') assert filename('dir', ('a', 'b')) == os.path.join('dir', 'a', 'b') ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/tests/test_pickle.py0000644000076500000240000000137214455317026017020 0ustar00jamesstafffrom partd.pickle import Pickle import os import shutil def test_pickle(): with Pickle() as p: p.append({'x': ['Hello', 'World!'], 'y': [1, 2, 3]}) p.append({'x': ['Alice', 'Bob!'], 'y': [4, 5, 6]}) assert os.path.exists(p.partd.filename('x')) assert os.path.exists(p.partd.filename('y')) result = p.get(['y', 'x']) assert result == [[1, 2, 3, 4, 5, 6], ['Hello', 'World!', 'Alice', 'Bob!']] with p.lock: # uh oh, possible deadlock result = p.get(['x'], lock=False) assert not os.path.exists(p.partd.path) def test_ensure(): with Pickle() as p: p.iset('x', [1, 2, 3]) p.iset('x', [1, 2, 3]) assert p.get('x') == [1, 2, 3] ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/tests/test_python.py0000644000076500000240000000037014455317026017067 0ustar00jamesstafffrom partd.python import dumps, loads import os import shutil from math import sin def test_pack_unpack(): data = [1, 2, b'Hello', 'Hello'] assert loads(dumps(data)) == data data = [1, 2, sin] assert loads(dumps(data)) == data ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624086.0 partd-1.4.2/partd/tests/test_utils.py0000644000076500000240000000040214455317026016702 0ustar00jamesstafffrom partd.utils import frame, framesplit import struct def test_frame(): assert frame(b'Hello') == struct.pack('Q', 5) + b'Hello' def test_framesplit(): L = [b'Hello', b'World!', b'123'] assert list(framesplit(b''.join(map(frame, L)))) == L ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/tests/test_zmq.py0000644000076500000240000000640714455317176016372 0ustar00jamesstaffimport pytest pytest.importorskip('zmq') from partd.zmq import Server, keys_to_flush, File, Client from partd import core, Dict from threading import Thread from time import sleep from contextlib import contextmanager import pickle import os import shutil def test_server(): s = Server() try: s.start() s.append({'x': b'abc', 'y': b'1234'}) s.append({'x': b'def', 'y': b'5678'}) assert s.get(['x']) == [b'abcdef'] assert s.get(['x', 'y']) == [b'abcdef', b'12345678'] assert s.get(['x']) == [b'abcdef'] finally: s.close() def dont_test_flow_control(): path = 'bar' if os.path.exists('bar'): shutil.rmtree('bar') s = Server('bar', available_memory=1, n_outstanding_writes=3, start=False) p = Client(s.address) try: listen_thread = Thread(target=s.listen) listen_thread.start() """ Don't start these threads self._write_to_disk_thread = Thread(target=self._write_to_disk) self._write_to_disk_thread.start() self._free_frozen_sockets_thread = Thread(target=self._free_frozen_sockets) self._free_frozen_sockets_thread.start() """ p.append({'x': b'12345'}) sleep(0.1) assert s._out_disk_buffer.qsize() == 1 p.append({'x': b'12345'}) p.append({'x': b'12345'}) sleep(0.1) assert s._out_disk_buffer.qsize() == 3 held_append = Thread(target=p.append, args=({'x': b'123'},)) held_append.start() sleep(0.1) assert held_append.is_alive() # held! assert not s._frozen_sockets.empty() write_to_disk_thread = Thread(target=s._write_to_disk) write_to_disk_thread.start() free_frozen_sockets_thread = Thread(target=s._free_frozen_sockets) free_frozen_sockets_thread.start() sleep(0.2) assert not held_append.is_alive() assert s._frozen_sockets.empty() finally: s.close() @contextmanager def partd_server(**kwargs): with Server(**kwargs) as server: with Client(server.address) as p: yield (p, server) def test_partd_object(): with partd_server() as (p, server): p.append({'x': b'Hello', 'y': b'abc'}) p.append({'x': b'World!', 'y': b'def'}) result = p.get(['y', 'x']) assert result == [b'abcdef', b'HelloWorld!'] def test_delete(): with partd_server() as (p, server): p.append({'x': b'Hello'}) assert p.get('x') == b'Hello' p.delete(['x']) assert p.get('x') == b'' def test_iset(): with partd_server() as (p, server): p.iset('x', b'111') p.iset('x', b'111') assert p.get('x') == b'111' def test_tuple_keys(): with partd_server() as (p, server): p.append({('x', 'y'): b'123'}) assert p.get(('x', 'y')) == b'123' def test_serialization(): with partd_server() as (p, server): p.append({'x': b'123'}) q = pickle.loads(pickle.dumps(p)) assert q.get('x') == b'123' def test_drop(): with partd_server() as (p, server): p.append({'x': b'123'}) p.drop() assert p.get('x') == b'' def dont_test_server_autocreation(): with Client() as p: p.append({'x': b'123'}) assert p.get('x') == b'123' ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/utils.py0000644000076500000240000000702314455317176014515 0ustar00jamesstafffrom contextlib import contextmanager import os import shutil import tempfile import struct def raises(exc, lamda): try: lamda() return False except exc: return True @contextmanager def tmpfile(extension=''): extension = '.' + extension.lstrip('.') handle, filename = tempfile.mkstemp(extension) os.close(handle) os.remove(filename) try: yield filename finally: if os.path.exists(filename): if os.path.isdir(filename): shutil.rmtree(filename) else: os.remove(filename) def frame(bytes): """ Pack the length of the bytes in front of the bytes TODO: This does a full copy. This should maybe be inlined somehow wherever this gets used instead. My laptop shows a data bandwidth of 2GB/s """ return struct.pack('Q', len(bytes)) + bytes def framesplit(bytes): """ Split buffer into frames of concatenated chunks >>> data = frame(b'Hello') + frame(b'World') >>> list(framesplit(data)) # doctest: +SKIP [b'Hello', b'World'] """ i = 0; n = len(bytes) chunks = list() while i < n: nbytes = struct.unpack('Q', bytes[i:i+8])[0] i += 8 yield bytes[i: i + nbytes] i += nbytes def partition_all(n, bytes): """ Partition bytes into evenly sized blocks The final block holds the remainder and so may not be of equal size >>> list(partition_all(2, b'Hello')) [b'He', b'll', b'o'] See Also: toolz.partition_all """ if len(bytes) < n: # zero copy fast common case yield bytes else: for i in range(0, len(bytes), n): yield bytes[i: i+n] def nested_get(ind, coll, lazy=False): """ Get nested index from collection Examples -------- >>> nested_get(1, 'abc') 'b' >>> nested_get([1, 0], 'abc') ['b', 'a'] >>> nested_get([[1, 0], [0, 1]], 'abc') [['b', 'a'], ['a', 'b']] """ if isinstance(ind, list): if lazy: return (nested_get(i, coll, lazy=lazy) for i in ind) else: return [nested_get(i, coll, lazy=lazy) for i in ind] else: return coll[ind] def flatten(seq): """ >>> list(flatten([1])) [1] >>> list(flatten([[1, 2], [1, 2]])) [1, 2, 1, 2] >>> list(flatten([[[1], [2]], [[1], [2]]])) [1, 2, 1, 2] >>> list(flatten(((1, 2), (1, 2)))) # Don't flatten tuples [(1, 2), (1, 2)] >>> list(flatten((1, 2, [3, 4]))) # support heterogeneous [1, 2, 3, 4] """ for item in seq: if isinstance(item, list): yield from flatten(item) else: yield item def suffix(key, term): """ suffix a key with a suffix Works if they key is a string or a tuple >>> suffix('x', '.dtype') 'x.dtype' >>> suffix(('a', 'b', 'c'), '.dtype') ('a', 'b', 'c.dtype') """ if isinstance(key, str): return key + term elif isinstance(key, tuple): return key[:-1] + (suffix(key[-1], term),) else: return suffix(str(key), term) def extend(key, term): """ extend a key with a another element in a tuple Works if they key is a string or a tuple >>> extend('x', '.dtype') ('x', '.dtype') >>> extend(('a', 'b', 'c'), '.dtype') ('a', 'b', 'c', '.dtype') """ if isinstance(term, tuple): pass elif isinstance(term, str): term = (term,) else: term = (str(term),) if not isinstance(key, tuple): key = (key,) return key + term ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1689624190.0 partd-1.4.2/partd/zmq.py0000644000076500000240000002255214455317176014170 0ustar00jamesstaffimport zmq import logging from itertools import chain from bisect import bisect import socket from operator import add from time import sleep, time from toolz import accumulate, topk, pluck, merge, keymap import uuid from collections import defaultdict from contextlib import contextmanager, suppress from threading import Thread, Lock from datetime import datetime from multiprocessing import Process import traceback import sys from .dict import Dict from .file import File from .buffer import Buffer from . import core tuple_sep = b'-|-' logger = logging.getLogger(__name__) @contextmanager def logerrors(): try: yield except Exception as e: logger.exception(e) raise class Server: def __init__(self, partd=None, bind=None, start=True, block=False, hostname=None): self.context = zmq.Context() if partd is None: partd = Buffer(Dict(), File()) self.partd = partd self.socket = self.context.socket(zmq.ROUTER) if hostname is None: hostname = socket.gethostname() if isinstance(bind, str): bind = bind.encode() if bind is None: port = self.socket.bind_to_random_port('tcp://*') else: self.socket.bind(bind) port = int(bind.split(':')[-1].rstrip('/')) self.address = ('tcp://%s:%d' % (hostname, port)).encode() self.status = 'created' self.partd.lock.acquire() self._lock = Lock() self._socket_lock = Lock() if start: self.start() if block: self.block() def start(self): if self.status != 'run': self.status = 'run' self._listen_thread = Thread(target=self.listen) self._listen_thread.start() logger.debug('Start server at %s', self.address) def block(self): """ Block until all threads close """ try: self._listen_thread.join() except AttributeError: pass def listen(self): with logerrors(): logger.debug('Start listening %s', self.address) while self.status != 'closed': if not self.socket.poll(100): continue with self._socket_lock: payload = self.socket.recv_multipart() address, command, payload = payload[0], payload[1], payload[2:] logger.debug('Server receives %s %s', address, command) if command == b'close': logger.debug('Server closes') self.ack(address) self.status = 'closed' break # self.close() elif command == b'append': keys, values = payload[::2], payload[1::2] keys = list(map(deserialize_key, keys)) data = dict(zip(keys, values)) self.partd.append(data, lock=False) logger.debug('Server appends %d keys', len(data)) self.ack(address) elif command == b'iset': key, value = payload key = deserialize_key(key) self.partd.iset(key, value, lock=False) self.ack(address) elif command == b'get': keys = list(map(deserialize_key, payload)) logger.debug('get %s', keys) result = self.get(keys) self.send_to_client(address, result) self.ack(address, flow_control=False) elif command == b'delete': keys = list(map(deserialize_key, payload)) logger.debug('delete %s', keys) self.partd.delete(keys, lock=False) self.ack(address, flow_control=False) elif command == b'syn': self.ack(address) elif command == b'drop': self.drop() self.ack(address) else: logger.debug("Unknown command: %s", command) raise ValueError("Unknown command: " + command) def send_to_client(self, address, result): with logerrors(): if not isinstance(result, list): result = [result] with self._socket_lock: self.socket.send_multipart([address] + result) def ack(self, address, flow_control=True): with logerrors(): logger.debug('Server sends ack') self.send_to_client(address, b'ack') def append(self, data): self.partd.append(data, lock=False) logger.debug('Server appends %d keys', len(data)) def drop(self): with logerrors(): self.partd.drop() def get(self, keys): with logerrors(): logger.debug('Server gets keys: %s', keys) with self._lock: result = self.partd.get(keys, lock=False) return result def close(self): logger.debug('Server closes') self.status = 'closed' self.block() with suppress(zmq.error.ZMQError): self.socket.close(1) with suppress(zmq.error.ZMQError): self.context.destroy(3) self.partd.lock.release() def __enter__(self): self.start() return self def __exit__(self, *args): self.close() self.partd.__exit__(*args) def keys_to_flush(lengths, fraction=0.1, maxcount=100000): """ Which keys to remove >>> lengths = {'a': 20, 'b': 10, 'c': 15, 'd': 15, ... 'e': 10, 'f': 25, 'g': 5} >>> keys_to_flush(lengths, 0.5) ['f', 'a'] """ top = topk(max(len(lengths) // 2, 1), lengths.items(), key=1) total = sum(lengths.values()) cutoff = min(maxcount, max(1, bisect(list(accumulate(add, pluck(1, top))), total * fraction))) result = [k for k, v in top[:cutoff]] assert result return result def serialize_key(key): """ >>> serialize_key('x') b'x' >>> serialize_key(('a', 'b', 1)) b'a-|-b-|-1' """ if isinstance(key, tuple): return tuple_sep.join(map(serialize_key, key)) if isinstance(key, bytes): return key if isinstance(key, str): return key.encode() return str(key).encode() def deserialize_key(text): """ >>> deserialize_key(b'x') b'x' >>> deserialize_key(b'a-|-b-|-1') (b'a', b'b', b'1') """ if tuple_sep in text: return tuple(text.split(tuple_sep)) else: return text from .core import Interface from .file import File class Client(Interface): def __init__(self, address=None, create_server=False, **kwargs): self.address = address self.context = zmq.Context() self.socket = self.context.socket(zmq.DEALER) logger.debug('Client connects to %s', address) self.socket.connect(address) self.send(b'syn', [], ack_required=False) self.lock = NotALock() # Server sequentializes everything Interface.__init__(self) def __getstate__(self): return {'address': self.address} def __setstate__(self, state): self.__init__(state['address']) logger.debug('Reconstruct client from pickled state') def send(self, command, payload, recv=False, ack_required=True): if ack_required: ack = self.socket.recv_multipart() assert ack == [b'ack'] logger.debug('Client sends command: %s', command) self.socket.send_multipart([command] + payload) if recv: result = self.socket.recv_multipart() else: result = None return result def _get(self, keys, lock=None): """ Lock argument is ignored. Everything is sequential (I think) """ logger.debug('Client gets %s %s', self.address, keys) keys = list(map(serialize_key, keys)) return self.send(b'get', keys, recv=True) def append(self, data, lock=None): logger.debug('Client appends %s %s', self.address, str(len(data)) + ' keys') data = keymap(serialize_key, data) payload = list(chain.from_iterable(data.items())) self.send(b'append', payload) def _delete(self, keys, lock=None): logger.debug('Client deletes %s %s', self.address, str(len(keys)) + ' keys') keys = list(map(serialize_key, keys)) self.send(b'delete', keys) def _iset(self, key, value): self.send(b'iset', [serialize_key(key), value]) def drop(self): self.send(b'drop', []) sleep(0.05) def close_server(self): self.send(b'close', []) def close(self): if hasattr(self, 'server_process'): with suppress(zmq.error.ZMQError): self.close_server() self.server_process.join() with suppress(zmq.error.ZMQError): self.socket.close(1) with suppress(zmq.error.ZMQError): self.context.destroy(1) def __exit__(self, type, value, traceback): self.drop() self.close() def __del__(self): self.close() class NotALock: def acquire(self): pass def release(self): pass def __enter__(self): return self def __exit__(self, *args): pass ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1715025058.9204898 partd-1.4.2/partd.egg-info/0000755000076500000240000000000014616232243014461 5ustar00jamesstaff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715025058.0 partd-1.4.2/partd.egg-info/PKG-INFO0000644000076500000240000001101414616232242015552 0ustar00jamesstaffMetadata-Version: 2.1 Name: partd Version: 1.4.2 Summary: Appendable key-value storage Maintainer-email: Matthew Rocklin License: BSD Project-URL: Homepage, http://github.com/dask/partd/ Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Requires-Python: >=3.9 Description-Content-Type: text/x-rst License-File: LICENSE.txt Requires-Dist: locket Requires-Dist: toolz Provides-Extra: complete Requires-Dist: numpy>=1.20.0; extra == "complete" Requires-Dist: pandas>=1.3; extra == "complete" Requires-Dist: pyzmq; extra == "complete" Requires-Dist: blosc; extra == "complete" PartD ===== |Build Status| |Version Status| Key-value byte store with appendable values Partd stores key-value pairs. Values are raw bytes. We append on old values. Partd excels at shuffling operations. Operations ---------- PartD has two main operations, ``append`` and ``get``. Example ------- 1. Create a Partd backed by a directory:: >>> import partd >>> p = partd.File('/path/to/new/dataset/') 2. Append key-byte pairs to dataset:: >>> p.append({'x': b'Hello ', 'y': b'123'}) >>> p.append({'x': b'world!', 'y': b'456'}) 3. Get bytes associated to keys:: >>> p.get('x') # One key b'Hello world!' >>> p.get(['y', 'x']) # List of keys [b'123456', b'Hello world!'] 4. Destroy partd dataset:: >>> p.drop() That's it. Implementations --------------- We can back a partd by an in-memory dictionary:: >>> p = Dict() For larger amounts of data or to share data between processes we back a partd by a directory of files. This uses file-based locks for consistency.:: >>> p = File('/path/to/dataset/') However this can fail for many small writes. In these cases you may wish to buffer one partd with another, keeping a fixed maximum of data in the buffering partd. This writes the larger elements of the first partd to the second partd when space runs low:: >>> p = Buffer(Dict(), File(), available_memory=2e9) # 2GB memory buffer You might also want to have many distributed process write to a single partd consistently. This can be done with a server * Server Process:: >>> p = Buffer(Dict(), File(), available_memory=2e9) # 2GB memory buffer >>> s = Server(p, address='ipc://server') * Worker processes:: >>> p = Client('ipc://server') # Client machine talks to remote server Encodings and Compression ------------------------- Once we can robustly and efficiently append bytes to a partd we consider compression and encodings. This is generally available with the ``Encode`` partd, which accepts three functions, one to apply on bytes as they are written, one to apply to bytes as they are read, and one to join bytestreams. Common configurations already exist for common data and compression formats. We may wish to compress and decompress data transparently as we interact with a partd. Objects like ``BZ2``, ``Blosc``, ``ZLib`` and ``Snappy`` exist and take another partd as an argument.:: >>> p = File(...) >>> p = ZLib(p) These work exactly as before, the (de)compression happens automatically. Common data formats like Python lists, numpy arrays, and pandas dataframes are also supported out of the box.:: >>> p = File(...) >>> p = NumPy(p) >>> p.append({'x': np.array([...])}) This lets us forget about bytes and think instead in our normal data types. Composition ----------- In principle we want to compose all of these choices together 1. Write policy: ``Dict``, ``File``, ``Buffer``, ``Client`` 2. Encoding: ``Pickle``, ``Numpy``, ``Pandas``, ... 3. Compression: ``Blosc``, ``Snappy``, ... Partd objects compose by nesting. Here we make a partd that writes pickle encoded BZ2 compressed bytes directly to disk:: >>> p = Pickle(BZ2(File('foo'))) We could construct more complex systems that include compression, serialization, buffering, and remote access.:: >>> server = Server(Buffer(Dict(), File(), available_memory=2e0)) >>> client = Pickle(Snappy(Client(server.address))) >>> client.append({'x': [1, 2, 3]}) .. |Build Status| image:: https://github.com/dask/partd/workflows/CI/badge.svg :target: https://github.com/dask/partd/actions?query=workflow%3ACI .. |Version Status| image:: https://img.shields.io/pypi/v/partd.svg :target: https://pypi.python.org/pypi/partd/ ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715025058.0 partd-1.4.2/partd.egg-info/SOURCES.txt0000644000076500000240000000140114616232242016340 0ustar00jamesstaffLICENSE.txt MANIFEST.in README.rst pyproject.toml setup.py partd/__init__.py partd/_version.py partd/buffer.py partd/compressed.py partd/core.py partd/dict.py partd/encode.py partd/file.py partd/numpy.py partd/pandas.py partd/pickle.py partd/python.py partd/utils.py partd/zmq.py partd.egg-info/PKG-INFO partd.egg-info/SOURCES.txt partd.egg-info/dependency_links.txt partd.egg-info/not-zip-safe partd.egg-info/requires.txt partd.egg-info/top_level.txt partd/tests/test_buffer.py partd/tests/test_compressed.py partd/tests/test_dict.py partd/tests/test_encode.py partd/tests/test_file.py partd/tests/test_numpy.py partd/tests/test_pandas.py partd/tests/test_partd.py partd/tests/test_pickle.py partd/tests/test_python.py partd/tests/test_utils.py partd/tests/test_zmq.py././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715025058.0 partd-1.4.2/partd.egg-info/dependency_links.txt0000644000076500000240000000000114616232242020526 0ustar00jamesstaff ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715025058.0 partd-1.4.2/partd.egg-info/not-zip-safe0000644000076500000240000000000114616232242016706 0ustar00jamesstaff ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715025058.0 partd-1.4.2/partd.egg-info/requires.txt0000644000076500000240000000007714616232242017064 0ustar00jamesstafflocket toolz [complete] numpy>=1.20.0 pandas>=1.3 pyzmq blosc ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715025058.0 partd-1.4.2/partd.egg-info/top_level.txt0000644000076500000240000000000614616232242017206 0ustar00jamesstaffpartd ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715024310.0 partd-1.4.2/pyproject.toml0000644000076500000240000000223714616230666014604 0ustar00jamesstaff[build-system] requires = ["setuptools>=61.2", "versioneer[toml]==0.29"] build-backend = "setuptools.build_meta" [project] name = "partd" description = "Appendable key-value storage" maintainers = [{name = "Matthew Rocklin", email = "mrocklin@gmail.com"}] license = {text = "BSD"} keywords = [] classifiers = [ "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.9", "Programming Language :: Python :: 3.10", "Programming Language :: Python :: 3.11", "Programming Language :: Python :: 3.12", ] readme = "README.rst" urls = {Homepage = "http://github.com/dask/partd/"} requires-python = ">=3.9" dynamic = ["version"] dependencies = [ "locket", "toolz", ] [project.optional-dependencies] complete = [ "numpy >= 1.20.0", "pandas >=1.3", "pyzmq", "blosc", ] [tool.setuptools] packages = ["partd"] zip-safe = false include-package-data = false [tool.versioneer] VCS = "git" style = "pep440" versionfile_source = "partd/_version.py" versionfile_build = "partd/_version.py" tag_prefix = "" parentdir_prefix = "partd-" [tool.pytest.ini_options] addopts = "--strict-markers --strict-config" filterwarnings = ["error"] ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1715025058.9214165 partd-1.4.2/setup.cfg0000644000076500000240000000004614616232243013476 0ustar00jamesstaff[egg_info] tag_build = tag_date = 0 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1715022964.0 partd-1.4.2/setup.py0000755000076500000240000000030214616226164013372 0ustar00jamesstaff#!/usr/bin/env python from __future__ import annotations import versioneer from setuptools import setup setup( version=versioneer.get_version(), cmdclass=versioneer.get_cmdclass(), )