cachey-0.2.1/0000755000076600000240000000000013632202326014441 5ustar taugspurgerstaff00000000000000cachey-0.2.1/LICENSE.txt0000644000076600000240000000274113632201651016270 0ustar taugspurgerstaff00000000000000Copyright (c) 2015, Continuum Analytics, Inc. and contributors All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of Continuum Analytics nor the names of any contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. cachey-0.2.1/MANIFEST.in0000644000076600000240000000020313632201651016172 0ustar taugspurgerstaff00000000000000recursive-include cachey *.py include setup.py include README.md include LICENSE.txt include MANIFEST.in include requirements.txt cachey-0.2.1/PKG-INFO0000644000076600000240000000504313632202326015540 0ustar taugspurgerstaff00000000000000Metadata-Version: 2.1 Name: cachey Version: 0.2.1 Summary: Caching mindful of computation/storage costs Home-page: http://github.com/dask/cachey/ Maintainer: Matthew Rocklin Maintainer-email: mrocklin@gmail.com License: BSD Description: Caching for Analytic Computations --------------------------------- Humans repeat stuff. Caching helps. Normal caching policies like LRU aren't well suited for analytic computations where both the cost of recomputation and the cost of storage routinely vary by one million or more. Consider the following computations ```python # Want this np.std(x) # tiny result, costly to recompute # Don't want this np.transpose(x) # huge result, cheap to recompute ``` Cachey tries to hold on to values that have the following characteristics 1. Expensive to recompute (in seconds) 2. Cheap to store (in bytes) 3. Frequently used 4. Recenty used It accomplishes this by adding the following to each items score on each access score += compute_time / num_bytes * (1 + eps) ** tick_time For some small value of epsilon (which determines the memory halflife.) This has units of inverse bandwidth, has exponential decay of old results and roughly linear amplification of repeated results. Example ------- ```python >>> from cachey import Cache >>> c = Cache(1e9, 1) # 1 GB, cut off anything with cost 1 or less >>> c.put('x', 'some value', cost=3) >>> c.put('y', 'other value', cost=2) >>> c.get('x') 'some value' ``` This also has a `memoize` method ```python >>> memo_f = c.memoize(f) ``` Status ------ Cachey is new and not robust. Platform: UNKNOWN Classifier: Development Status :: 4 - Beta Classifier: Intended Audience :: Developers Classifier: Intended Audience :: Science/Research Classifier: License :: OSI Approved :: BSD License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Topic :: Scientific/Engineering Requires-Python: >=3.6 Description-Content-Type: text/markdown cachey-0.2.1/README.md0000644000076600000240000000250713632201651015724 0ustar taugspurgerstaff00000000000000Caching for Analytic Computations --------------------------------- Humans repeat stuff. Caching helps. Normal caching policies like LRU aren't well suited for analytic computations where both the cost of recomputation and the cost of storage routinely vary by one million or more. Consider the following computations ```python # Want this np.std(x) # tiny result, costly to recompute # Don't want this np.transpose(x) # huge result, cheap to recompute ``` Cachey tries to hold on to values that have the following characteristics 1. Expensive to recompute (in seconds) 2. Cheap to store (in bytes) 3. Frequently used 4. Recenty used It accomplishes this by adding the following to each items score on each access score += compute_time / num_bytes * (1 + eps) ** tick_time For some small value of epsilon (which determines the memory halflife.) This has units of inverse bandwidth, has exponential decay of old results and roughly linear amplification of repeated results. Example ------- ```python >>> from cachey import Cache >>> c = Cache(1e9, 1) # 1 GB, cut off anything with cost 1 or less >>> c.put('x', 'some value', cost=3) >>> c.put('y', 'other value', cost=2) >>> c.get('x') 'some value' ``` This also has a `memoize` method ```python >>> memo_f = c.memoize(f) ``` Status ------ Cachey is new and not robust. cachey-0.2.1/cachey/0000755000076600000240000000000013632202326015675 5ustar taugspurgerstaff00000000000000cachey-0.2.1/cachey/__init__.py0000644000076600000240000000014513632202273020007 0ustar taugspurgerstaff00000000000000from .score import Scorer from .cache import Cache from .nbytes import nbytes __version__ = '0.2.1' cachey-0.2.1/cachey/cache.py0000644000076600000240000001254713632201651017323 0ustar taugspurgerstaff00000000000000from .nbytes import nbytes from .score import Scorer from heapdict import heapdict import time def cost(nbytes, time): return float(time) / nbytes / 1e9 def memo_key(args, kwargs): result = (args, frozenset(list(kwargs.items()))) try: hash(result) except TypeError: result = tuple(map(id, args)), str(kwargs) return result class Cache(object): """ A cache that prefers long-running cheap-to-store computations This cache prefers computations that have the following properties: 1. Costly to compute (seconds) 2. Cheap to store (bytes) 3. Frequently used 4. Recently used Parameters ---------- available_bytes: int The number of bytes of data to keep in the cache limit: float The minimum cost something must be to consider to keep in the cache scorer: optional with halflife A Scorer object (see cachey/scorer.py) halflife: int, optional with scorer The halflife in number of touches of the score of a piece of data nbytes: function (defaults to cachey/nbytes.py) Function to compute the number of bytes of an input. cost: function (defaults to cost()) Determine cost from nbytes and time cache_data : MutableMapping (defaults to dict()) Dict-like object to use for cache Example ------- >>> from cachey import Cache >>> c = Cache(1e9, 10) # 1GB of space, costs must be 10 or higher >>> c.put('x', 1, cost=50) >>> c.get('x') 1 >>> def inc(x): ... return x + 1 >>> memo_inc = c.memoize(inc) # Memoize functions """ def __init__(self, available_bytes, limit=0, scorer=None, halflife=1000, nbytes=nbytes, cost=cost, hit=None, miss=None, cache_data=None): if scorer is None: scorer = Scorer(halflife) self.scorer = scorer self.available_bytes = available_bytes self.limit = limit self.get_nbytes = nbytes self.cost = cost self.hit = hit self.miss = miss self.data = cache_data if cache_data is not None else dict() self.heap = heapdict() self.nbytes = dict() self.total_bytes = 0 def put(self, key, value, cost, nbytes=None): """ Put key-value data into cache with associated cost >>> c = Cache(1e9, 10) >>> c.put('x', 10, cost=50) >>> c.get('x') 10 """ if nbytes is None: nbytes = self.get_nbytes(value) if cost >= self.limit and nbytes < self.available_bytes: score = self.scorer.touch(key, cost) if (nbytes + self.total_bytes < self.available_bytes or not self.heap or score > self.heap.peekitem()[1]): self.data[key] = value self.heap[key] = score self.nbytes[key] = nbytes self.total_bytes += nbytes self.shrink() def get(self, key, default=None): """ Get value associated with key. Returns None if not present >>> c = Cache(1e9, 10) >>> c.put('x', 10, cost=50) >>> c.get('x') 10 """ score = self.scorer.touch(key) if key in self.data: value = self.data[key] if self.hit is not None: self.hit(key, value) self.heap[key] = score return value else: if self.miss is not None: self.miss(key) return default def retire(self, key): """ Retire/remove a key from the cache See Also: shrink """ val = self.data.pop(key) self.total_bytes -= self.nbytes.pop(key) def _shrink_one(self): try: key, score = self.heap.popitem() except IndexError: return self.retire(key) def resize(self, available_bytes): """ Resize the cache. Will fit the cache into available_bytes by calling `shrink()`. """ self.available_bytes = available_bytes self.shrink() def shrink(self): """ Retire keys from the cache until we're under bytes budget See Also: retire """ if self.total_bytes <= self.available_bytes: return while self.total_bytes > self.available_bytes: self._shrink_one() def __contains__(self, key): return key in self.data def clear(self): while self.data: self._shrink_one() def __bool__(self): return not not self.data def memoize(self, func, key=memo_key): """ Create a cached function >>> def inc(x): ... return x + 1 >>> c = Cache(1e9) >>> memo_inc = c.memoize(inc) >>> memo_inc(1) # computes first time 2 >>> memo_inc(1) # uses cached result (if computation has a high score) 2 """ def cached_func(*args, **kwargs): k = (func, key(args, kwargs)) result = self.get(k) if result is None: start = time.time() result = func(*args, **kwargs) end = time.time() nb = nbytes(result) self.put(k, result, cost(nb, end - start), nbytes=nb) return result return cached_func cachey-0.2.1/cachey/nbytes.py0000644000076600000240000000170613632201651017557 0ustar taugspurgerstaff00000000000000import sys def _array(x): if x.dtype == 'O': return sys.getsizeof('0'*100) * x.size elif str(x.dtype) == 'category': return _array(x.codes) + _array(x.categories) else: return x.nbytes def nbytes(o): """ Number of bytes of an object >>> nbytes(123) # doctest: +SKIP 24 >>> nbytes('Hello, world!') # doctest: +SKIP 50 >>> import numpy as np >>> nbytes(np.ones(1000, dtype='i4')) 4000 """ name = type(o).__module__ + '.' + type(o).__name__ if name == 'pandas.core.series.Series': return _array(o._data.blocks[0].values) + _array(o.index._data) elif name == 'pandas.core.frame.DataFrame': return _array(o.index) + sum([_array(blk.values) for blk in o._data.blocks]) elif name == 'numpy.ndarray': return _array(o) elif hasattr(o, 'nbytes'): return o.nbytes else: return sys.getsizeof(o) cachey-0.2.1/cachey/score.py0000644000076600000240000000302613632201651017363 0ustar taugspurgerstaff00000000000000from collections import defaultdict from math import log class Scorer(object): """ Object to track scores of cache Prefers computations that have the following properties: 1. Costly to compute (seconds) 2. Cheap to store (bytes) 3. Frequently used 4. Recently used This object tracks both stated costs of keys and a separate score related to how frequently/recently they have been accessed. It uses these to to provide a score for the key used by the ``Cache`` object, which is the main usable object. Example ------- >>> s = Scorer(halflife=10) >>> s.touch('x', cost=2) # score is similar to cost 2 >>> s.touch('x') # scores increase on every touch 4.138629436111989 """ def __init__(self, halflife): self.cost = dict() self.time = defaultdict(lambda: 0) self._base_multiplier = 1 + log(2) / float(halflife) self.tick = 1 self._base = 1 def touch(self, key, cost=None): """ Update score for key Provide a cost the first time and optionally thereafter. """ time = self._base if cost is not None: self.cost[key] = cost self.time[key] += self._base time = self.time[key] else: try: cost = self.cost[key] self.time[key] += self._base time = self.time[key] except KeyError: return self._base *= self._base_multiplier return cost * time cachey-0.2.1/cachey/tests/0000755000076600000240000000000013632202326017037 5ustar taugspurgerstaff00000000000000cachey-0.2.1/cachey/tests/test_cache.py0000644000076600000240000000443113632201651021515 0ustar taugspurgerstaff00000000000000import sys from time import sleep from cachey import Cache, Scorer, nbytes def test_cache(): c = Cache(available_bytes=nbytes(1) * 3) c.put('x', 1, 10) assert c.get('x') == 1 assert 'x' in c c.put('a', 1, 10) c.put('b', 1, 10) c.put('c', 1, 10) assert set(c.data) == set('xbc') c.put('d', 1, 10) assert set(c.data) == set('xcd') c.clear() assert 'x' not in c assert not c.data assert not c.heap def test_cache_data_dict(): my_dict = {} c = Cache(available_bytes=nbytes(1) * 3, cache_data=my_dict) c.put('x', 1, 10) assert c.get('x') == 1 assert my_dict['x'] == 1 c.clear() assert 'x' not in c def test_cache_resize(): c = Cache(available_bytes=nbytes(1) * 3) c.put('x', 1, 10) assert c.get('x') == 1 assert 'x' in c c.put('a', 1, 10) c.put('b', 1, 10) c.put('c', 1, 10) assert set(c.data) == set('xbc') c.put('d', 1, 10) assert set(c.data) == set('xcd') # resize will shrink c.resize(available_bytes=nbytes(1) * 1) assert set(c.data) == set('x') c.resize(available_bytes=nbytes(1) * 10) assert set(c.data) == set('x') def test_cache_scores_update(): c = Cache(available_bytes=nbytes(1) * 2) c.put('x', 1, 1) c.put('y', 1, 1) c.get('x') c.get('x') c.get('x') c.put('z', 1, 1) assert set(c.data) == set('xz') def test_memoize(): c = Cache(available_bytes=nbytes(1) * 3) flag = [0] def slow_inc(x): flag[0] += 1 sleep(0.01) return x + 1 memo_inc = c.memoize(slow_inc) assert memo_inc(1) == 2 assert memo_inc(1) == 2 assert list(c.data.values()) == [2] def test_callbacks(): hit_flag = [False] def hit(key, value): hit_flag[0] = (key, value) miss_flag = [False] def miss(key): miss_flag[0] = key c = Cache(100, hit=hit, miss=miss) c.get('x') assert miss_flag[0] == 'x' assert hit_flag[0] == False c.put('y', 1, 1) c.get('y') assert hit_flag[0] == ('y', 1) def test_just_one_reference(): c = Cache(available_bytes=1000) o = object() x = sys.getrefcount(o) c.put('key', o, cost=10) y = sys.getrefcount(o) assert y == x + 1 c.retire('key') z = sys.getrefcount(o) assert z == x cachey-0.2.1/cachey/tests/test_nbytes.py0000644000076600000240000000103613632201651021754 0ustar taugspurgerstaff00000000000000from cachey import nbytes def test_obj(): assert nbytes('hello'*100) > 500 try: import pandas as pd import numpy as np def test_pandas(): x = np.random.random(1000) i = np.random.random(1000) s = pd.Series(x, index=i) assert nbytes(s) == nbytes(x) + nbytes(i) df = pd.DataFrame(s) assert nbytes(df) == nbytes(s) s = pd.Series(pd.Categorical(['a', 'b'] * 1000)) assert nbytes(s.cat.codes) < nbytes(s) < nbytes(s.cat.codes) * 2 except ImportError: pass cachey-0.2.1/cachey/tests/test_score.py0000644000076600000240000000056413632201651021570 0ustar taugspurgerstaff00000000000000from cachey import Scorer def test_Scorer(): s = Scorer(10) a = s.touch('x', 10) b = s.touch('y', 1) assert a > b a = s.touch('x') b = s.touch('x') assert a < b def test_halflife(): s = Scorer(1) a = s.touch('x', 10) b = s.touch('y', 1) b = s.touch('y', 1) b = s.touch('y', 1) b = s.touch('y', 1) assert b > a cachey-0.2.1/cachey.egg-info/0000755000076600000240000000000013632202326017367 5ustar taugspurgerstaff00000000000000cachey-0.2.1/cachey.egg-info/PKG-INFO0000644000076600000240000000504313632202326020466 0ustar taugspurgerstaff00000000000000Metadata-Version: 2.1 Name: cachey Version: 0.2.1 Summary: Caching mindful of computation/storage costs Home-page: http://github.com/dask/cachey/ Maintainer: Matthew Rocklin Maintainer-email: mrocklin@gmail.com License: BSD Description: Caching for Analytic Computations --------------------------------- Humans repeat stuff. Caching helps. Normal caching policies like LRU aren't well suited for analytic computations where both the cost of recomputation and the cost of storage routinely vary by one million or more. Consider the following computations ```python # Want this np.std(x) # tiny result, costly to recompute # Don't want this np.transpose(x) # huge result, cheap to recompute ``` Cachey tries to hold on to values that have the following characteristics 1. Expensive to recompute (in seconds) 2. Cheap to store (in bytes) 3. Frequently used 4. Recenty used It accomplishes this by adding the following to each items score on each access score += compute_time / num_bytes * (1 + eps) ** tick_time For some small value of epsilon (which determines the memory halflife.) This has units of inverse bandwidth, has exponential decay of old results and roughly linear amplification of repeated results. Example ------- ```python >>> from cachey import Cache >>> c = Cache(1e9, 1) # 1 GB, cut off anything with cost 1 or less >>> c.put('x', 'some value', cost=3) >>> c.put('y', 'other value', cost=2) >>> c.get('x') 'some value' ``` This also has a `memoize` method ```python >>> memo_f = c.memoize(f) ``` Status ------ Cachey is new and not robust. Platform: UNKNOWN Classifier: Development Status :: 4 - Beta Classifier: Intended Audience :: Developers Classifier: Intended Audience :: Science/Research Classifier: License :: OSI Approved :: BSD License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Topic :: Scientific/Engineering Requires-Python: >=3.6 Description-Content-Type: text/markdown cachey-0.2.1/cachey.egg-info/SOURCES.txt0000644000076600000240000000060313632202326021252 0ustar taugspurgerstaff00000000000000LICENSE.txt MANIFEST.in README.md requirements.txt setup.py cachey/__init__.py cachey/cache.py cachey/nbytes.py cachey/score.py cachey.egg-info/PKG-INFO cachey.egg-info/SOURCES.txt cachey.egg-info/dependency_links.txt cachey.egg-info/not-zip-safe cachey.egg-info/requires.txt cachey.egg-info/top_level.txt cachey/tests/test_cache.py cachey/tests/test_nbytes.py cachey/tests/test_score.pycachey-0.2.1/cachey.egg-info/dependency_links.txt0000644000076600000240000000000113632202326023435 0ustar taugspurgerstaff00000000000000 cachey-0.2.1/cachey.egg-info/not-zip-safe0000644000076600000240000000000113632202326021615 0ustar taugspurgerstaff00000000000000 cachey-0.2.1/cachey.egg-info/requires.txt0000644000076600000240000000001113632202326021757 0ustar taugspurgerstaff00000000000000heapdict cachey-0.2.1/cachey.egg-info/top_level.txt0000644000076600000240000000000713632202326022116 0ustar taugspurgerstaff00000000000000cachey cachey-0.2.1/requirements.txt0000644000076600000240000000001113632201651017715 0ustar taugspurgerstaff00000000000000heapdict cachey-0.2.1/setup.cfg0000644000076600000240000000004613632202326016262 0ustar taugspurgerstaff00000000000000[egg_info] tag_build = tag_date = 0 cachey-0.2.1/setup.py0000755000076600000240000000222413632202260016153 0ustar taugspurgerstaff00000000000000#!/usr/bin/env python from os.path import exists from setuptools import setup setup(name='cachey', version='0.2.1', description='Caching mindful of computation/storage costs', classifiers=[ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3.8", "Topic :: Scientific/Engineering", ], url='http://github.com/dask/cachey/', maintainer='Matthew Rocklin', maintainer_email='mrocklin@gmail.com', license='BSD', keywords='', packages=['cachey'], python_requires='>=3.6', install_requires=list(open('requirements.txt').read().strip().split('\n')), long_description=(open('README.md').read() if exists('README.md') else ''), long_description_content_type='text/markdown', zip_safe=False)