pax_global_header00006660000000000000000000000064135350373120014514gustar00rootroot0000000000000052 comment=86c972bf21fd18d4b02285275bb291acfacda69d scitrack-0.1.8.1/000077500000000000000000000000001353503731200134645ustar00rootroot00000000000000scitrack-0.1.8.1/.gitignore000066400000000000000000000005171353503731200154570ustar00rootroot00000000000000*.py[cod] # C extensions *.so # Packages *.egg *.egg-info dist build eggs parts bin var sdist develop-eggs .installed.cfg lib lib64 # Installer logs pip-log.txt # Unit test / coverage reports .coverage .tox nosetests.xml tests/draw_results # Translations *.mo # Mr Developer .mr.developer.cfg .project .pydevproject # vi .*.swp scitrack-0.1.8.1/.hgignore000066400000000000000000000003201353503731200152620ustar00rootroot00000000000000syntax:glob .svn *.pyc *.pyo *.so *.o *.DS_Store *.tmproj *.rej *.orig *.wpr *.pdf _build/* build *htmlcov* *.idea *.coverage *trackcomp.egg-info* dist/* *.cache scitrack.egg* *.sublime-* *.wpu *.pytest_cachescitrack-0.1.8.1/.hgtags000066400000000000000000000002151353503731200147400ustar00rootroot000000000000002c80657fecfe617eab8b5e071da8b4b494ca3636 0.1.6 402b7daea661f3904f85f6f248f46c8d9f588704 0.1.7 038183f48645c7ba0417fa98946689f86efca803 0.1.8 scitrack-0.1.8.1/README.rst000066400000000000000000000127641353503731200151650ustar00rootroot00000000000000################## About ``scitrack`` ################## One of the critical challenges in scientific analysis is to track all the elements involved. This includes the arguments provided to a specific application, input data files referenced by those arguments and output data generated by the application. In addition to this, tracking a minimal set of system specific information. ``scitrack`` is a library aimed at application developers writing scientific software to support this tracking of scientific computation. The library provides elementary functionality to support logging. The primary capabilities concern generating checksums on input and output files and facilitating logging of the computational environment. ********** Installing ********** For the released version:: $ pip install scitrack For the very latest version:: $ pip install git+https://github.com/HuttleyLab/scitrack Or clone it:: $ git clone git@github.com:HuttleyLab/scitrack.git And then install:: $ pip install ~/path/to/scitrack ***************** ``CachingLogger`` ***************** There is a single object provided by ``scitrack``, ``CachingLogger``. This object is basically a wrapper around the ``logging`` module, but on invocation, captures basic information regarding the system and the command line call that was made to invoke the application. In addition, the class provides convenience methods for logging both the path and the md5 hexdigest checksum of input/output files. A method is also provided for producing checksums of text data. The latter is useful for the case when data are from a stream or a database, for instance. All logging calls are cached until a path for a logfile is provided. The logger can also, optionally, create directories. When run in parallel using ``mpirun``, the process ID is appended to the hostname to help identify processors. ********************************** Simple instantiation of the logger ********************************** Creating the logger. Setting ``create_dir=True`` means on creation of the logfile, the directory path will be created also. .. code:: python from scitrack import CachingLogger LOGGER = CachingLogger(create_dir=True) LOGGER.log_file_path = "somedir/some_path.log" The last assignment triggers creation of ``somedir/some_path.log``. ****************************************** Capturing a programs arguments and options ****************************************** ``scitrack`` will write the contents of ``sys.argv`` to the log file, prefixed by ``command_string``. However, this only captures arguments specified on the command line. Tracking the value of optional arguments not specified, which may have default values, is critical to tracking the full command set. Doing this is your responsibility as a developer. Here's one approach when using the ``click`` `command line interface library `_. Below we create a simple ``click`` app and capture the required and optional argument values. .. code:: python from scitrack import CachingLogger import click LOGGER = CachingLogger() @click.group() def main(): """the main command""" pass @main.command() @click.option('-i', '--infile', type=click.Path(exists=True)) @click.option('-t', '--test', is_flag=True, help='Run test.') def my_app(infile, test): # capture the local variables, at this point just provided arguments LOGGER.log_args() LOGGER.log_versions('numpy') LOGGER.input_file(infile) LOGGER.log_file_path = "some_path.log" if __name__ == "__main__": my_app() The ``CachingLogger.write()`` method takes a message and a label. All other logging methods wrap ``log_message()``, providing a specific label. For instance, the method ``input_file()`` writes out two lines in the log. - input_file_path, the absolute path to the intput file - input_file_path md5sum, the hex digest of the file ``output_file()`` behaves analogously. An additional method ``text_data()`` is useful for other data input/output sources (e.g. records from a database). For this to have value for arbitrary data types requires a systematic approach to ensuring the text conversion is robust across platforms. The ``log_args()`` method captures all local variables within a scope. The ``log_versions()`` method captures versions for the current file and that of a list of named packages, e.g. ``LOGGER.log_versions(['numpy', 'sklearn'])``. Some sample output ================== :: 2018-11-28 11:33:30 yourmachine.com:71779 INFO system_details : system=Darwin Kernel Version 18.2.0: Fri Oct 5 19:41:49 PDT 2018; root:xnu-4903.221.2~2/RELEASE_X86_64 2018-11-28 11:33:30 yourmachine.com:71779 INFO python : 3.7.1 2018-11-28 11:33:30 yourmachine.com:71779 INFO user : gavin 2018-11-28 11:33:30 yourmachine.com:71779 INFO command_string : /Users/gavin/miniconda3/envs/py37/bin/py.test -s 2018-11-28 11:33:30 yourmachine.com:71779 INFO input_file_path : /Users/gavin/repos/SciTrack/tests/sample.fasta 2018-11-28 11:33:30 yourmachine.com:71779 INFO input_file_path md5sum : 96eb2c2632bae19eb65ea9224aaafdad 2018-11-28 11:33:30 yourmachine.com:71779 INFO version : test_logging==0.1.5 2018-11-28 11:33:30 yourmachine.com:71779 INFO version : numpy==1.15.1 ********************** Other useful functions ********************** Two other useful functions are ``get_file_hexdigest`` and ``get_text_hexdigest``. The latter can take either unicode or ascii strings. scitrack-0.1.8.1/scitrack/000077500000000000000000000000001353503731200152675ustar00rootroot00000000000000scitrack-0.1.8.1/scitrack/__init__.py000066400000000000000000000203641353503731200174050ustar00rootroot00000000000000import hashlib import importlib import inspect import logging import os import platform import socket import sys from getpass import getuser __author__ = "Gavin Huttley" __copyright__ = "Copyright 2016, Gavin Huttley" __credits__ = ["Gavin Huttley"] __license__ = "BSD" __version__ = "0.1.8.1" __maintainer__ = "Gavin Huttley" __email__ = "Gavin.Huttley@anu.edu.au" __status__ = "Development" VERSION_ATTRS = ["__version__", "version", "VERSION"] def abspath(path): """returns an expanded, absolute path""" return os.path.abspath(os.path.expanduser(path)) def _create_path(path): """creates path""" if os.path.exists(path): return os.makedirs(path, exist_ok=True) def get_package_name(object): """returns the package name for the provided object""" name = inspect.getmodule(object).__name__ package = name.split(".")[0] return package def get_version_for_package(package): """returns the version of package""" if type(package) == str: try: mod = importlib.import_module(package) except ModuleNotFoundError: raise ValueError("Unknown package %s" % package) elif inspect.ismodule(package): mod = package else: raise ValueError("Unknown type, package %s" % package) vn = None for v in VERSION_ATTRS: try: vn = getattr(mod, v) if callable(vn): vn = vn() break except AttributeError: pass if type(vn) in (tuple, list): vn = vn[0] del mod return vn create_path = _create_path FileHandler = logging.FileHandler class CachingLogger(object): """stores log messages until a log filename is provided""" def __init__(self, log_file_path=None, create_dir=True, mode="w"): super(CachingLogger, self).__init__() self._log_file_path = None self._logfile = None self._started = False self.create_dir = create_dir self._messages = [] self._hostname = socket.gethostname() self._mode = mode if log_file_path: self.log_file_path = log_file_path def _reset(self, mode="w"): self._log_file_path = None self._mode = mode self._started = False self._messages = [] if self._logfile is not None: self._logfile.close() self._logfile = None @property def log_file_path(self): return self._log_file_path @log_file_path.setter def log_file_path(self, path): """set the log file path and then dump cached log messages""" path = abspath(path) if self.create_dir: dirname = os.path.dirname(path) create_path(dirname) self._log_file_path = path self._logfile = set_logger(self._log_file_path, mode=self.mode) for m in self._messages: logging.info(m) self._messages = [] self._started = True @property def mode(self): """the logfile opening mode""" return self._mode @mode.setter def mode(self, mode): """the logfile file opening mode""" self._mode = mode def _record_file(self, file_class, file_path): """writes the file path and md5 checksum to log file""" file_path = abspath(file_path) md5sum = get_file_hexdigest(file_path) self.log_message(file_path, label=file_class) self.log_message(md5sum, label="%s md5sum" % file_class) def input_file(self, file_path, label="input_file_path"): """logs path and md5 checksum Argument: - label is inserted before the message""" self._record_file(label, file_path) def output_file(self, file_path, label="output_file_path"): """logs path and md5 checksum Argument: - label is inserted before the message""" self._record_file(label, file_path) def text_data(self, data, label=None): """logs md5 checksum for input text data. Argument: - label is inserted before the message For this to be useful you must ensure the text order is persistent.""" assert label is not None, "You must provide a data label" md5sum = get_text_hexdigest(data) self.log_message(md5sum, label=label) def log_message(self, msg, label=None): """writes a log message Argument: - label is inserted before the message""" label = label or "misc" data = [label, msg] msg = " : ".join(data) if not self._started: self._messages.append(msg) else: logging.info(msg) def log_args(self, args=None): """save arguments to file using label='params' Argument: - args: if None, uses inspect module to get locals from the calling frame""" if args is None: parent = inspect.currentframe().f_back args = inspect.getargvalues(parent).locals # remove args whose value is a CachingLogger for k in list(args): if type(args[k]) == self.__class__: del args[k] self.log_message(str(args), label="params") def shutdown(self): """safely shutdown the logger""" logging.getLogger().removeHandler(self._logfile) self._logfile.flush() self._logfile.close() self._logfile = None self._reset() def log_versions(self, packages=None): """logs version from the global namespace where method is invoked, plus from any named packages""" if type(packages) == str or inspect.ismodule(packages): packages = [packages] elif packages is None: packages = [] for i, p in enumerate(packages): if inspect.ismodule(p): packages[i] = p.__name__ parent = inspect.currentframe().f_back g = parent.f_globals name = g.get("__package__", g.get("__name__", "")) if name: vn = get_version_for_package(name) else: vn = [g.get(v, None) for v in VERSION_ATTRS if g.get(v, None)] vn = None if not vn else vn[0] name = get_package_name(parent) versions = [(name, vn)] for package in packages: vn = get_version_for_package(package) versions.append((package, vn)) for n_v in versions: self.log_message("%s==%s" % n_v, label="version") def set_logger(log_file_path, level=logging.DEBUG, mode="w"): """setup logging""" handler = FileHandler(log_file_path, mode) handler.setLevel(level) hostpid = socket.gethostname() + ":" + str(os.getpid()) fmt = "%(asctime)s\t" + hostpid + "\t%(levelname)s\t%(message)s" formatter = logging.Formatter(fmt, datefmt="%Y-%m-%d %H:%M:%S") handler.setFormatter(formatter) logging.root.addHandler(handler) logging.root.setLevel(level) logging.info("system_details : system=%s" % platform.version()) logging.info("python : %s" % platform.python_version()) logging.info("user : %s" % getuser()) logging.info("command_string : %s" % " ".join(sys.argv)) return handler def get_file_hexdigest(filename): """returns the md5 hexadecimal checksum of the file NOTE ---- The md5 sum of get_text_hexdigest can differ from get_file_hexdigest. This will occur if the line ending character differs from being read in 'rb' versus 'r' modes. """ # from # http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python with open(filename, "rb") as infile: md5 = hashlib.md5() while True: data = infile.read(128) if not data: break md5.update(data) return md5.hexdigest() def get_text_hexdigest(data): """returns md5 hexadecimal checksum of string/unicode data NOTE ---- The md5 sum of get_text_hexdigest can differ from get_file_hexdigest. This will occur if the line ending character differs from being read in 'rb' versus 'r' modes. """ if data.__class__ not in ("".__class__, u"".__class__): raise TypeError("can only checksum string or unicode data") data = data.encode("utf-8") md5 = hashlib.md5() md5.update(data) return md5.hexdigest() scitrack-0.1.8.1/setup.cfg000066400000000000000000000000341353503731200153020ustar00rootroot00000000000000[bdist_wheel] universal = 1 scitrack-0.1.8.1/setup.py000066400000000000000000000026661353503731200152100ustar00rootroot00000000000000#!/usr/bin/env python from setuptools import setup import sys import pathlib __author__ = "Gavin Huttley" __copyright__ = "Copyright 2016, Gavin Huttley" __credits__ = ["Gavin Huttley"] __license__ = "BSD" __version__ = "0.1.8.1" __maintainer__ = "Gavin Huttley" __email__ = "Gavin.Huttley@anu.edu.au" __status__ = "Development" if sys.version_info < (3, 6): py_version = ".".join([str(n) for n in sys.version_info]) raise RuntimeError("Python-3.6 or greater is required, Python-%s used." % py_version) short_description = "scitrack" readme_path = pathlib.Path(__file__).parent / "README.rst" long_description = readme_path.read_text() setup( name="scitrack", version=__version__, author="Gavin Huttley", author_email="gavin.huttley@anu.edu.au", description=short_description, long_description=long_description, platforms=["any"], license=[__license__], keywords=["science", "logging", "parallel"], classifiers=[ "Development Status :: 3 - Alpha", "Intended Audience :: Science/Research", "Programming Language :: Python :: 3", "License :: OSI Approved :: BSD License", "Topic :: Scientific/Engineering :: Bio-Informatics", "Topic :: Software Development :: Libraries :: Python Modules", "Operating System :: OS Independent", ], packages=["scitrack"], url="https://github.com/HuttleyLab/scitrack", ) scitrack-0.1.8.1/tests/000077500000000000000000000000001353503731200146265ustar00rootroot00000000000000scitrack-0.1.8.1/tests/sample.fasta000066400000000000000000000030341353503731200171270ustar00rootroot00000000000000>Rhesus with extra words tgtggcacaaatactcatgccagctcattacagcatgagaac---agtttgttactcact aaagacagaatgaatgtagaaaaggctgaattctgtaataaaagcaaacagcctggcttg gcaaggagccaacataacagatggactggaagtaaggaaacatgtaatgataggcagact cccagcacagagaaaaaggtagatctgaatgctaatgccctgtatgagagaaaagaatgg aataagcaaaaactgccatgctctgagaatcctagagacactgaagatgttccttgg >Manatee tgtggcacaaatactcatgccagctcattacagcatgagaatagcagtttattactcact aaagacagaatgaatgtagaaaaggctgaattctgtcataaaagcaaacagcctggctta acaaggagccagcagagcagatgggctgaaagtaaggaaacatgtaatgataggcagact cctagcacagagaaaaaggtagatatgaatgctaatccattgtatgagagaaaagaagtg aataagcagaaacctccatgctccgagagtgttagagatacacaagatattccttgg >Pig tgtggcacagatactcatgccagctcgttacagcatgagaacagcagtttattactcact aaagacagaatgaatgtagaaaaggctgaattttgtaataaaagcaagcagcctgtctta gcaaagagccaacagagcagatgggctgaaagtaagggcacatgtaatgataggcagact cctaacacagagaaaaaggtagttctgaatactgatctcctgtatgggagaaacgaactg aataagcagaaacctgcgtgctctgacagtcctagagattcccaagatgttccttgg >GoldenMol tgtggcacaaatactcatgccagctcattacagcatgagaacagcagtttattactcact aaagacagaatgaatgtagaaaaggctgaattctgtaataaaaacaaacagtctggctta gcgaggagccagcagagcagatgggctggaagtaaggcagcgtgcaatgacaagcagact cctagcacacagacagagctatataggagtgctggtcccatgcacaggagaaaagaagta aataagctgaaatctccatggtctgagagtcctggagctacccaagagattccttgg >Rat tgtggcacagatgctcgtgccagctcattacagcgtgggacccgcagtttattgttcact gaggacagactggatgcagaaaaggctgaattctgtgatagaagcaaacagtctggcgca gcagtgagccagcagagcagatgggctgacagtaaagaaacatgtaatggcaggccggtt ccccgcactgagggaaaggcagatccaaatgtggattccctctgtggtagaaagcagtgg aatcatccgaaaagcctgtgccctgagaattctggagctaccactgacgttccttggscitrack-0.1.8.1/tests/test_logging.py000066400000000000000000000144541353503731200176750ustar00rootroot00000000000000# -*- coding: utf-8 -*- import os import shutil import subprocess import sys from collections import Counter from scitrack import (CachingLogger, get_package_name, get_text_hexdigest, get_version_for_package, logging) __author__ = "Gavin Huttley" __copyright__ = "Copyright 2016, Gavin Huttley" __credits__ = ["Gavin Huttley"] __license__ = "BSD" __version__ = "0.1.8.1" __maintainer__ = "Gavin Huttley" __email__ = "Gavin.Huttley@anu.edu.au" __status__ = "Development" LOGFILE_NAME = "delme.log" DIRNAME = "delme" def test_creates_path(): """creates a log path""" LOGGER = CachingLogger(create_dir=True) LOGGER.log_file_path = os.path.join(DIRNAME, LOGFILE_NAME) LOGGER.input_file("sample.fasta") LOGGER.shutdown() assert os.path.exists(DIRNAME) assert os.path.exists(os.path.join(DIRNAME, LOGFILE_NAME)) try: shutil.rmtree(DIRNAME) except OSError: pass def test_tracks_args(): """details on host, python version should be present in log""" LOGGER = CachingLogger(create_dir=True) LOGGER.log_file_path = os.path.join(LOGFILE_NAME) LOGGER.input_file("sample.fasta") LOGGER.shutdown() with open(LOGFILE_NAME, "r") as infile: contents = "".join(infile.readlines()) for label in ["system_details", "python", "user", "command_string"]: assert contents.count(label) == 1, (label, contents.count(label)) try: os.remove(LOGFILE_NAME) except OSError: pass def test_tracks_locals(): """details on local arguments should be present in log""" LOGGER = CachingLogger(create_dir=True) LOGGER.log_file_path = os.path.join(LOGFILE_NAME) def track_func(a=1, b="abc"): LOGGER.log_args() track_func() LOGGER.shutdown() with open(LOGFILE_NAME, "r") as infile: for line in infile: index = line.find("params :") if index > 0: got = eval(line.split("params :")[1]) break assert got == dict(a=1, b="abc") try: os.remove(LOGFILE_NAME) pass except OSError: pass def test_package_inference(): """correctly identify the package name""" name = get_package_name(CachingLogger) assert name == "scitrack" def test_package_versioning(): """correctly identify versions for specified packages""" vn = get_version_for_package("numpy") assert type(vn) is str try: # not installed, but using valuerrror rather than import error get_version_for_package("gobbledygook") except ValueError: pass try: get_version_for_package(1) except ValueError: pass def test_tracks_versions(): """should track versions""" LOGGER = CachingLogger(create_dir=True) LOGGER.log_file_path = os.path.join(LOGFILE_NAME) LOGGER.input_file("sample.fasta") LOGGER.log_versions(["numpy"]) LOGGER.shutdown() with open(LOGFILE_NAME, "r") as infile: contents = "".join(infile.readlines()) for label in ["system_details", "python", "user", "command_string"]: assert contents.count(label) == 1, (label, contents.count(label)) for line in contents.splitlines(): if "version :" in line: if "numpy" not in line: assert "==%s" % __version__ in line, line else: assert "numpy" in line, line print("\n\n", contents) try: os.remove(LOGFILE_NAME) except OSError: pass def test_tracks_versions_string(): """should track version if package name is a string""" LOGGER = CachingLogger(create_dir=True) LOGGER.log_file_path = os.path.join(LOGFILE_NAME) LOGGER.log_versions("numpy") LOGGER.shutdown() import numpy expect = "numpy==%s" % numpy.__version__ del numpy with open(LOGFILE_NAME, "r") as infile: contents = "".join(infile.readlines()) for line in contents.splitlines(): if "version :" in line and "numpy" in line: assert expect in line, line try: os.remove(LOGFILE_NAME) except OSError: pass def test_tracks_versions_module(): """should track version if package is a module""" LOGGER = CachingLogger(create_dir=True) LOGGER.log_file_path = os.path.join(LOGFILE_NAME) import numpy expect = "numpy==%s" % numpy.__version__ LOGGER.log_versions(numpy) LOGGER.shutdown() del numpy with open(LOGFILE_NAME, "r") as infile: contents = "".join(infile.readlines()) for line in contents.splitlines(): if "version :" in line and "numpy" in line: assert expect in line, line try: os.remove(LOGFILE_NAME) except OSError: pass def test_appending(): """appending to an existing logfile should work""" LOGGER = CachingLogger(create_dir=True) LOGGER.log_file_path = LOGFILE_NAME LOGGER.input_file("sample.fasta") LOGGER.shutdown() records = Counter() with open(LOGFILE_NAME) as infile: for line in infile: records[line] += 1 vals = set(list(records.values())) assert vals == {1} LOGGER = CachingLogger(create_dir=True) LOGGER.mode = "a" LOGGER.log_file_path = LOGFILE_NAME LOGGER.input_file("sample.fasta") LOGGER.shutdown() records = Counter() with open(LOGFILE_NAME) as infile: for line in infile: records[line] += 1 vals = set(list(records.values())) assert vals == {2} try: os.remove(LOGFILE_NAME) except OSError: pass def test_mdsum_input(): """md5 sum of input file should be correct""" LOGGER = CachingLogger(create_dir=True) LOGGER.log_file_path = os.path.join(LOGFILE_NAME) LOGGER.input_file("sample.fasta") LOGGER.shutdown() with open(LOGFILE_NAME, "r") as infile: num = 0 for line in infile: line = line.strip() if "input_file_path md5sum" in line: assert "96eb2c2632bae19eb65ea9224aaafdad" in line num += 1 assert num == 1 try: os.remove(LOGFILE_NAME) except OSError: pass def test_md5sum_text(): """md5 sum for text data should be computed""" data = u"åbcde" s = get_text_hexdigest(data) assert s data = "abcde" s = get_text_hexdigest(data) assert s