airr-1.3.1/0000755000076600000240000000000013741420532013216 5ustar vandej27staff00000000000000airr-1.3.1/MANIFEST.in0000644000076600000240000000020413402556313014751 0ustar vandej27staff00000000000000include requirements.txt include README.rst include NEWS.rst # versioneer-generated include versioneer.py include airr/_version.py airr-1.3.1/NEWS.rst0000644000076600000240000000527613741417105014540 0ustar vandej27staff00000000000000Version 1.3.1: October 13, 2020 -------------------------------------------------------------------------------- 1. Refactored ``merge_rearrangement`` to allow for larger number of files. 2. Improved error handling in format validation operations. Version 1.3.0: May 30, 2020 -------------------------------------------------------------------------------- 1. Updated schema set to v1.3. 2. Added ``load_repertoire``, ``write_repertoire``, and ``validate_repertoire`` to ``airr.interface`` to read, write and validate Repertoire metadata, respectively. 3. Added ``repertoire_template`` to ``airr.interface`` which will return a complete repertoire object where all fields have ``null`` values. 4. Added ``validate_object`` to ``airr.schema`` that will validate a single repertoire object against the schema. 5. Extended the ``airr-tools`` commandline program to validate both rearrangement and repertoire files. Version 1.2.1: October 5, 2018 -------------------------------------------------------------------------------- 1. Fixed a bug in the python reference library causing start coordinate values to be empty in some cases when writing data. Version 1.2.0: August 17, 2018 -------------------------------------------------------------------------------- 1. Updated schema set to v1.2. 2. Several improvements to the ``validate_rearrangement`` function. 3. Changed behavior of all `airr.interface` functions to accept a file path (string) to a single Rearrangement TSV, instead of requiring a file handle as input. 4. Added ``base`` argument to ``RearrangementReader`` and ``RearrangementWriter`` to support optional conversion of 1-based closed intervals in the TSV to python-style 0-based half-open intervals. Defaults to conversion. 5. Added the custom exception ``ValidationError`` for handling validation checks. 6. Added the ``validate`` argument to ``RearrangementReader`` which will raise a ``ValidationError`` exception when reading files with missing required fields or invalid values for known field types. 7. Added ``validate`` argument to all type conversion methods in ``Schema``, which will now raise a ``ValidationError`` exception for value that cannot be converted when set to ``True``. When set ``False`` (default), the previous behavior of assigning ``None`` as the converted value is retained. 8. Added ``validate_header`` and ``validate_row`` methods to ``Schema`` and removed validations methods from ``RearrangementReader``. 9. Removed automatic closure of file handle upon reaching the iterator end in ``RearrangementReader``. Version 1.1.0: May 1, 2018 -------------------------------------------------------------------------------- Initial release.airr-1.3.1/PKG-INFO0000644000076600000240000002123013741420532014311 0ustar vandej27staff00000000000000Metadata-Version: 1.1 Name: airr Version: 1.3.1 Summary: AIRR Community Data Representation Standard reference library for antibody and TCR sequencing data. Home-page: http://docs.airr-community.org Author: AIRR Community Author-email: UNKNOWN License: CC BY 4.0 Description: Installation ------------------------------------------------------------------------------ Install in the usual manner from PyPI:: > pip3 install airr --user Or from the `downloaded `__ source code directory:: > python3 setup.py install --user Quick Start ------------------------------------------------------------------------------ Reading AIRR Repertoire metadata files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package contains functions to read and write AIRR repertoire metadata files. The file format is either YAML or JSON, and the package provides a light wrapper over the standard parsers. The file needs a ``json``, ``yaml``, or ``yml`` file extension so that the proper parser is utilized. All of the repertoires are loaded into memory at once and no streaming interface is provided:: import airr # Load the repertoires data = airr.load_repertoire('input.airr.json') for rep in data['Repertoire']: print(rep) Why are the repertoires in a list versus in a dictionary keyed by the ``repertoire_id``? There are two primary reasons for this. First, the ``repertoire_id`` might not have been assigned yet. Some systems might allow MiAIRR metadata to be entered but the ``repertoire_id`` is assigned to that data later by another process. Without the ``repertoire_id``, the data could not be stored in a dictionary. Secondly, the list allows the repertoire data to have a default ordering. If you know that the repertoires all have a unique ``repertoire_id`` then you can quickly create a dictionary object using a comprehension:: rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } Writing AIRR Repertoire metadata files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Writing AIRR repertoire metadata is also a light wrapper over standard YAML or JSON parsers. The ``airr`` library provides a function to create a blank repertoire object in the appropriate format with all of the required fields. As with the load function, the complete list of repertoires are written at once, there is no streaming interface:: import airr # Create some blank repertoire objects in a list reps = [] for i in range(5): reps.append(airr.repertoire_template()) # Write the repertoires airr.write_repertoire('output.airr.json', reps) Reading AIRR Rearrangement TSV files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package contains functions to read and write AIRR rearrangement files as either iterables or pandas data frames. The usage is straightforward, as the file format is a typical tab delimited file, but the package performs some additional validation and type conversion beyond using a standard CSV reader:: import airr # Create an iteratable that returns a dictionary for each row reader = airr.read_rearrangement('input.tsv') for row in reader: print(row) # Load the entire file into a pandas data frame df = airr.load_rearrangement('input.tsv') Writing AIRR formatted files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Similar to the read operations, write functions are provided for either creating a writer class to perform row-wise output or writing the entire contents of a pandas data frame to a file. Again, usage is straightforward with the ``airr`` output functions simply performing some type conversion and field ordering operations:: import airr # Create a writer class for iterative row output writer = airr.create_rearrangement('output.tsv') for row in reader: writer.write(row) # Write an entire pandas data frame to a file airr.dump_rearrangement(df, 'file.tsv') By default, ``create_rearrangement`` will only write the ``required`` fields in the output file. Additional fields can be included in the output file by providing the ``fields`` parameter with an array of additional field names:: # Specify additional fields in the output fields = ['new_calc', 'another_field'] writer = airr.create_rearrangement('output.tsv', fields=fields) A common operation is to read an AIRR rearrangement file, and then write an AIRR rearrangement file with additional fields in it while keeping all of the existing fields from the original file. The ``derive_rearrangement`` function provides this capability:: import airr # Read rearrangement data and write new file with additional fields reader = airr.read_rearrangement('input.tsv') fields = ['new_calc'] writer = airr.derive_rearrangement('output.tsv', 'input.tsv', fields=fields) for row in reader: row['new_calc'] = 'a value' writer.write(row) Validating AIRR data files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package can validate repertoire and rearrangement data files to insure that they contain all required fields and that the fields types match the AIRR Schema. This can be done using the ``airr-tools`` command line program or the validate functions in the library can be called:: # Validate a rearrangement file airr-tools validate rearrangement -a input.tsv # Validate a repertoire metadata file airr-tools validate repertoire -a input.airr.json Combining Repertoire metadata and Rearrangement files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package does not keep track of which repertoire metadata files are associated with rearrangement files, so users will need to handle those associations themselves. However, in the data, the ``repertoire_id`` field forms the link. The typical usage is that a program is going to perform some computation on the rearrangements, and it needs access to the repertoire metadata as part of the computation logic. This example code shows the basic framework for doing that, in this case doing gender specific computation:: import airr # Load the repertoires data = airr.load_repertoire('input.airr.json') # Put repertoires in dictionary keyed by repertoire_id rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } # Create an iteratable for rearrangement data reader = airr.read_rearrangement('input.tsv') for row in reader: # get repertoire metadata with this rearrangement rep = rep_dict[row['repertoire_id']] # check the gender if rep['subject']['sex'] == 'male': # do male specific computation elif rep['subject']['sex'] == 'female': # do female specific computation else: # do other specific computation Keywords: AIRR,bioinformatics,sequencing,immunoglobulin,antibody,adaptive immunity,T cell,B cell,BCR,TCR Platform: UNKNOWN Classifier: Intended Audience :: Science/Research Classifier: Natural Language :: English Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics airr-1.3.1/README.rst0000644000076600000240000001524613741416302014715 0ustar vandej27staff00000000000000Installation ------------------------------------------------------------------------------ Install in the usual manner from PyPI:: > pip3 install airr --user Or from the `downloaded `__ source code directory:: > python3 setup.py install --user Quick Start ------------------------------------------------------------------------------ Reading AIRR Repertoire metadata files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package contains functions to read and write AIRR repertoire metadata files. The file format is either YAML or JSON, and the package provides a light wrapper over the standard parsers. The file needs a ``json``, ``yaml``, or ``yml`` file extension so that the proper parser is utilized. All of the repertoires are loaded into memory at once and no streaming interface is provided:: import airr # Load the repertoires data = airr.load_repertoire('input.airr.json') for rep in data['Repertoire']: print(rep) Why are the repertoires in a list versus in a dictionary keyed by the ``repertoire_id``? There are two primary reasons for this. First, the ``repertoire_id`` might not have been assigned yet. Some systems might allow MiAIRR metadata to be entered but the ``repertoire_id`` is assigned to that data later by another process. Without the ``repertoire_id``, the data could not be stored in a dictionary. Secondly, the list allows the repertoire data to have a default ordering. If you know that the repertoires all have a unique ``repertoire_id`` then you can quickly create a dictionary object using a comprehension:: rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } Writing AIRR Repertoire metadata files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Writing AIRR repertoire metadata is also a light wrapper over standard YAML or JSON parsers. The ``airr`` library provides a function to create a blank repertoire object in the appropriate format with all of the required fields. As with the load function, the complete list of repertoires are written at once, there is no streaming interface:: import airr # Create some blank repertoire objects in a list reps = [] for i in range(5): reps.append(airr.repertoire_template()) # Write the repertoires airr.write_repertoire('output.airr.json', reps) Reading AIRR Rearrangement TSV files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package contains functions to read and write AIRR rearrangement files as either iterables or pandas data frames. The usage is straightforward, as the file format is a typical tab delimited file, but the package performs some additional validation and type conversion beyond using a standard CSV reader:: import airr # Create an iteratable that returns a dictionary for each row reader = airr.read_rearrangement('input.tsv') for row in reader: print(row) # Load the entire file into a pandas data frame df = airr.load_rearrangement('input.tsv') Writing AIRR formatted files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Similar to the read operations, write functions are provided for either creating a writer class to perform row-wise output or writing the entire contents of a pandas data frame to a file. Again, usage is straightforward with the ``airr`` output functions simply performing some type conversion and field ordering operations:: import airr # Create a writer class for iterative row output writer = airr.create_rearrangement('output.tsv') for row in reader: writer.write(row) # Write an entire pandas data frame to a file airr.dump_rearrangement(df, 'file.tsv') By default, ``create_rearrangement`` will only write the ``required`` fields in the output file. Additional fields can be included in the output file by providing the ``fields`` parameter with an array of additional field names:: # Specify additional fields in the output fields = ['new_calc', 'another_field'] writer = airr.create_rearrangement('output.tsv', fields=fields) A common operation is to read an AIRR rearrangement file, and then write an AIRR rearrangement file with additional fields in it while keeping all of the existing fields from the original file. The ``derive_rearrangement`` function provides this capability:: import airr # Read rearrangement data and write new file with additional fields reader = airr.read_rearrangement('input.tsv') fields = ['new_calc'] writer = airr.derive_rearrangement('output.tsv', 'input.tsv', fields=fields) for row in reader: row['new_calc'] = 'a value' writer.write(row) Validating AIRR data files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package can validate repertoire and rearrangement data files to insure that they contain all required fields and that the fields types match the AIRR Schema. This can be done using the ``airr-tools`` command line program or the validate functions in the library can be called:: # Validate a rearrangement file airr-tools validate rearrangement -a input.tsv # Validate a repertoire metadata file airr-tools validate repertoire -a input.airr.json Combining Repertoire metadata and Rearrangement files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package does not keep track of which repertoire metadata files are associated with rearrangement files, so users will need to handle those associations themselves. However, in the data, the ``repertoire_id`` field forms the link. The typical usage is that a program is going to perform some computation on the rearrangements, and it needs access to the repertoire metadata as part of the computation logic. This example code shows the basic framework for doing that, in this case doing gender specific computation:: import airr # Load the repertoires data = airr.load_repertoire('input.airr.json') # Put repertoires in dictionary keyed by repertoire_id rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } # Create an iteratable for rearrangement data reader = airr.read_rearrangement('input.tsv') for row in reader: # get repertoire metadata with this rearrangement rep = rep_dict[row['repertoire_id']] # check the gender if rep['subject']['sex'] == 'male': # do male specific computation elif rep['subject']['sex'] == 'female': # do female specific computation else: # do other specific computation airr-1.3.1/airr/0000755000076600000240000000000013741420532014153 5ustar vandej27staff00000000000000airr-1.3.1/airr/__init__.py0000644000076600000240000000037513402556313016272 0ustar vandej27staff00000000000000""" Reference library for AIRR schema for Ig/TCR rearrangements """ from airr.interface import * from airr.schema import ValidationError # versioneer-generated from ._version import get_versions __version__ = get_versions()['version'] del get_versions airr-1.3.1/airr/_version.py0000644000076600000240000000076113741420532016355 0ustar vandej27staff00000000000000 # This file was generated by 'versioneer.py' (0.18) from # revision-control system data, or from the parent directory name of an # unpacked source archive. Distribution tarballs contain a pre-generated copy # of this file. import json version_json = ''' { "date": "2020-10-13T14:38:25-0700", "dirty": false, "error": null, "full-revisionid": "725baa9cacf0db009e1bfbb276837c9a05d1f965", "version": "1.3.1" } ''' # END VERSION_JSON def get_versions(): return json.loads(version_json) airr-1.3.1/airr/interface.py0000644000076600000240000002743213741416302016475 0ustar vandej27staff00000000000000""" Interface functions for file operations """ from __future__ import absolute_import # System imports import sys import pandas as pd from collections import OrderedDict from itertools import chain from pkg_resources import resource_filename import json import yaml import yamlordereddictloader from io import open # Load imports from airr.io import RearrangementReader, RearrangementWriter from airr.schema import ValidationError, RearrangementSchema, RepertoireSchema def read_rearrangement(filename, validate=False, debug=False): """ Open an iterator to read an AIRR rearrangements file Arguments: file (str): path to the input file. validate (bool): whether to validate data as it is read, raising a ValidationError exception in the event of an error. debug (bool): debug flag. If True print debugging information to standard error. Returns: airr.io.RearrangementReader: iterable reader class. """ return RearrangementReader(open(filename, 'r'), validate=validate, debug=debug) def create_rearrangement(filename, fields=None, debug=False): """ Create an empty AIRR rearrangements file writer Arguments: filename (str): output file path. fields (list): additional non-required fields to add to the output. debug (bool): debug flag. If True print debugging information to standard error. Returns: airr.io.RearrangementWriter: open writer class. """ return RearrangementWriter(open(filename, 'w+'), fields=fields, debug=debug) def derive_rearrangement(out_filename, in_filename, fields=None, debug=False): """ Create an empty AIRR rearrangements file with fields derived from an existing file Arguments: out_filename (str): output file path. in_filename (str): existing file to derive fields from. fields (list): additional non-required fields to add to the output. debug (bool): debug flag. If True print debugging information to standard error. Returns: airr.io.RearrangementWriter: open writer class. """ reader = RearrangementReader(open(in_filename, 'r')) in_fields = list(reader.fields) if fields is not None: in_fields.extend([f for f in fields if f not in in_fields]) return RearrangementWriter(open(out_filename, 'w+'), fields=in_fields, debug=debug) def load_rearrangement(filename, validate=False, debug=False): """ Load the contents of an AIRR rearrangements file into a data frame Arguments: filename (str): input file path. validate (bool): whether to validate data as it is read, raising a ValidationError exception in the event of an error. debug (bool): debug flag. If True print debugging information to standard error. Returns: pandas.DataFrame: Rearrangement records as rows of a data frame. """ # TODO: test pandas.DataFrame.read_csv with converters argument as an alterative # schema = RearrangementSchema # df = pd.read_csv(handle, sep='\t', header=0, index_col=None, # dtype=schema.numpy_types(), true_values=schema.true_values, # false_values=schema.true_values) # return df with open(filename, 'r') as handle: reader = RearrangementReader(handle, validate=validate, debug=debug) df = pd.DataFrame(list(reader)) return df def dump_rearrangement(dataframe, filename, debug=False): """ Write the contents of a data frame to an AIRR rearrangements file Arguments: dataframe (pandas.DataFrame): data frame of rearrangement data. filename (str): output file path. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if the file is written without error. """ # TODO: test pandas.DataFrame.to_csv with converters argument as an alterative # dataframe.to_csv(handle, sep='\t', header=True, index=False, encoding='utf-8') fields = dataframe.columns.tolist() with open(filename, 'w+') as handle: writer = RearrangementWriter(handle, fields=fields, debug=debug) for __, row in dataframe.iterrows(): writer.write(row.to_dict()) return True def merge_rearrangement(out_filename, in_filenames, drop=False, debug=False): """ Merge one or more AIRR rearrangements files Arguments: out_filename (str): output file path. in_filenames (list): list of input files to merge. drop (bool): drop flag. If True then drop fields that do not exist in all input files, otherwise combine fields from all input files. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if files were successfully merged, otherwise False. """ try: # gather fields from input files readers = (RearrangementReader(open(f, 'r'), debug=False) for f in in_filenames) field_list = [x.fields for x in readers] if drop: field_set = set.intersection(*map(set, field_list)) else: field_set = set.union(*map(set, field_list)) field_order = OrderedDict([(f, None) for f in chain(*field_list)]) out_fields = [f for f in field_order if f in field_set] # write input files to output file sequentially readers = (RearrangementReader(open(f, 'r'), debug=debug) for f in in_filenames) with open(out_filename, 'w+') as handle: writer = RearrangementWriter(handle, fields=out_fields, debug=debug) for reader in readers: for r in reader: writer.write(r) reader.close() except Exception as e: sys.stderr.write('Error occurred while merging AIRR rearrangement files: %s\n' % e) return False return True def validate_rearrangement(filename, debug=False): """ Validates an AIRR rearrangements file Arguments: filename (str): path of the file to validate. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if files passed validation, otherwise False. """ valid = True if debug: sys.stderr.write('Validating: %s\n' % filename) # Open reader handle = open(filename, 'r') reader = RearrangementReader(handle, validate=True) # Validate header try: iter(reader) except ValidationError as e: valid = False if debug: sys.stderr.write('%s has validation error: %s\n' % (filename, e)) # Validate each row i = 0 while True: try: i = i + 1 next(reader) except StopIteration: break except ValidationError as e: valid = False if debug: sys.stderr.write('%s at record %i has validation error: %s\n' % (filename, i, e)) # Close handle.close() return valid def load_repertoire(filename, validate=False, debug=False): """ Load an AIRR repertoire metadata file Arguments: filename (str): path to the input file. validate (bool): whether to validate data as it is read, raising a ValidationError exception in the event of an error. debug (bool): debug flag. If True print debugging information to standard error. Returns: list: list of Repertoire dictionaries. """ # Because the repertoires are read in completely, we do not bother # with a reader class. md = None # determine file type from extension and use appropriate loader ext = filename.split('.')[-1] if ext in ('yaml', 'yml'): with open(filename, 'r', encoding='utf-8') as handle: md = yaml.load(handle, Loader=yamlordereddictloader.Loader) elif ext == 'json': with open(filename, 'r', encoding='utf-8') as handle: md = json.load(handle) else: if debug: sys.stderr.write('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % (ext)) raise TypeError('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % (ext)) if md.get('Repertoire') is None: if debug: sys.stderr.write('%s is missing "Repertoire" key\n' % (filename)) raise KeyError('Repertoire object cannot be found in the file') # validate if requested if validate: valid = True reps = md['Repertoire'] i = 0 for r in reps: try: RepertoireSchema.validate_object(r) except ValidationError as e: valid = False if debug: sys.stderr.write('%s has repertoire at array position %i with validation error: %s\n' % (filename, i, e)) i = i + 1 if not valid: raise ValidationError('Repertoire file %s has validation errors\n' % (filename)) # we do not perform any additional processing return md def validate_repertoire(filename, debug=False): """ Validates an AIRR repertoire metadata file Arguments: filename (str): path of the file to validate. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if files passed validation, otherwise False. """ valid = True if debug: sys.stderr.write('Validating: %s\n' % filename) # load with validate try: data = load_repertoire(filename, validate=True, debug=debug) except TypeError: valid = False except KeyError: valid = False except ValidationError as e: valid = False if debug: sys.stderr.write('%s has validation error: %s\n' % (filename, e)) return valid def write_repertoire(filename, repertoires, info=None, debug=False): """ Write an AIRR repertoire metadata file Arguments: file (str): path to the output file. repertoires (list): array of repertoire objects. info (object): info object to write. Will write current AIRR Schema info if not specified. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if the file is written without error. """ if not isinstance(repertoires, list): if debug: sys.stderr.write('Repertoires parameter is not a list\n') raise TypeError('Repertoires parameter is not a list') md = OrderedDict() if info is None: info = RearrangementSchema.info.copy() info['title'] = 'Repertoire metadata' info['description'] = 'Repertoire metadata written by AIRR Standards Python Library' md['Info'] = info md['Repertoire'] = repertoires # determine file type from extension and use appropriate loader ext = filename.split('.')[-1] if ext == 'yaml' or ext == 'yml': with open(filename, 'w') as handle: md = yaml.dump(md, handle, default_flow_style=False) elif ext == 'json': with open(filename, 'w') as handle: md = json.dump(md, handle, sort_keys=False, indent=2) else: if debug: sys.stderr.write('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % (ext)) raise TypeError('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % (ext)) return True def repertoire_template(): """ Return a blank repertoire object from the template. This object has the complete structure with all of the fields and all values set to None or empty string. Returns: object: empty repertoire object. """ # TODO: I suppose we should dynamically create this from the schema # versus loading a template. # Load blank template f = resource_filename(__name__, 'specs/blank.airr.yaml') object = load_repertoire(f) return object['Repertoire'][0] airr-1.3.1/airr/io.py0000644000076600000240000001740413741416302015142 0ustar vandej27staff00000000000000""" Reference library for AIRR schema for Ig/TCR rearrangements """ from __future__ import print_function import sys import csv from airr.schema import RearrangementSchema, ValidationError class RearrangementReader: """ Iterator for reading Rearrangement objects in TSV format Attributes: fields (list): field names in the input Rearrangement file. external_fields (list): list of fields in the input file that are not part of the Rearrangement definition. """ @property def fields(self): """ Get list of fields Returns: list : field names. """ return self.dict_reader.fieldnames @property def external_fields(self): """ Get list of field that are not in the Rearrangement schema Returns: list : field names. """ return [f for f in self.dict_reader.fieldnames \ if f not in self.schema.properties] def __init__(self, handle, base=1, validate=False, debug=False): """ Initialization Arguments: handle (file): file handle of the open Rearrangement file. base (int): one of 0 or 1 specifying the coordinate schema in the input file. If 1, then the file is assumed to contain 1-based closed intervals that will be converted to python style 0-based half-open intervals for known fields. If 0, then values will be unchanged. validate (bool): perform validation. If True then basic validation will be performed will reading the data. A ValidationError exception will be raised if an error is found. debug (bool): debug state. If True prints debug information. Returns: airr.io.RearrangementReader: reader object. """ # arguments self.handle = handle self.base = base self.debug = debug self.validate = validate self.schema = RearrangementSchema # data reader, collect field names self.dict_reader = csv.DictReader(self.handle, dialect='excel-tab') def __iter__(self): """ Iterator initializer Returns: airr.io.RearrangementReader """ # Validate fields if (self.validate): self.schema.validate_header(self.dict_reader.fieldnames) return self def __next__(self): """ Next method Returns: dict: parsed Rearrangement data. """ try: row = next(self.dict_reader) except StopIteration: raise StopIteration for f in row: # row entry with no header if f is None: if self.validate: raise ValidationError('row has extra data') else: raise ValueError('row has extra data') # Convert types spec = self.schema.type(f) try: if spec == 'boolean': row[f] = self.schema.to_bool(row[f], validate=self.validate) if spec == 'integer': row[f] = self.schema.to_int(row[f], validate=self.validate) if spec == 'number': row[f] = self.schema.to_float(row[f], validate=self.validate) except ValidationError as e: raise ValidationError('field %s has %s' %(f, e)) # Adjust coordinates if f and f.endswith('_start') and self.base == 1: try: row[f] = row[f] - 1 except TypeError: row[f] = None return row def close(self): """ Closes the Rearrangement file """ self.handle.close() def next(self): """ Next method """ return self.__next__() class RearrangementWriter: """ Writer class for Rearrangement objects in TSV format Attributes: fields (list): field names in the output Rearrangement file. external_fields (list): list of fields in the output file that are not part of the Rearrangement definition. """ @property def fields(self): """ Get list of fields Returns: list : field names. """ return self.dict_writer.fieldnames @property def external_fields(self): """ Get list of field that are not in the Rearrangements schema Returns: list : field names. """ return [f for f in self.dict_writer.fieldnames \ if f not in self.schema.properties] def __init__(self, handle, fields=None, base=1, debug=False): """ Initialization Arguments: handle (file): file handle of the open Rearrangements file. fields (list) : list of non-required fields to add. May include fields undefined by the schema. base (int): one of 0 or 1 specifying the coordinate schema in the output file. Data provided to the write is assumed to be in python style 0-based half-open intervals. If 1, then data will be converted to 1-based closed intervals for known fields before writing. If 0, then values will be unchanged. debug (bool): debug state. If True prints debug information. Returns: airr.io.RearrangementWriter: writer object. """ # arguments self.handle = handle self.base = base self.debug = debug self.schema = RearrangementSchema # order fields according to spec field_names = list(self.schema.required) if fields is not None: additional_fields = [] for f in fields: if f in self.schema.required: continue elif f in self.schema.optional: field_names.append(f) else: additional_fields.append(f) field_names.extend(additional_fields) # open writer and write header self.dict_writer = csv.DictWriter(self.handle, fieldnames=field_names, dialect='excel-tab', extrasaction='ignore', lineterminator='\n') self.dict_writer.writeheader() def close(self): """ Closes the Rearrangement file """ self.handle.close() def write(self, row): """ Write a row to the Rearrangement file Arguments: row (dict): row to write. """ # validate row if self.debug: for field in self.schema.required: if row.get(field, None) is None: sys.stderr.write('Warning: Record is missing AIRR required field (' + field + ').\n') for f in row.keys(): # Adjust coordinates if f.endswith('_start') and self.base == 1: try: row[f] = self.schema.to_int(row[f]) + 1 except TypeError: row[f] = None # Convert types spec = self.schema.type(f) if spec == 'boolean': row[f] = self.schema.from_bool(row[f]) self.dict_writer.writerow(row) # TODO: pandas validation need if we load with pandas directly # def validate_df(df, airr_schema): # valid = True # # # check required fields # missing_fields = set(airr_schema.required) - set(df.columns) # if len(missing_fields) > 0: # print('Warning: file is missing mandatory fields: {}'.format(', '.join(missing_fields))) # valid = False # # if not valid: # raise ValueError('invalid AIRR data file') airr-1.3.1/airr/schema.py0000644000076600000240000004103413663527407016003 0ustar vandej27staff00000000000000""" AIRR Data Representation Schema """ # Imports import sys import yaml import yamlordereddictloader from pkg_resources import resource_stream class ValidationError(Exception): """ Exception raised when validation errors are encountered. """ pass class Schema: """ AIRR schema definitions Attributes: properties (collections.OrderedDict): field definitions. info (collections.OrderedDict): schema info. required (list): list of mandatory fields. optional (list): list of non-required fields. false_values (list): accepted string values for False. true_values (list): accepted values for True. """ # Boolean list for pandas true_values = ['True', 'true', 'TRUE', 'T', 't', '1', 1, True] false_values = ['False', 'false', 'FALSE', 'F', 'f', '0', 0, False] # Generate dicts for booleans _to_bool_map = {x: True for x in true_values} _to_bool_map.update({x: False for x in false_values}) _from_bool_map = {k: 'T' if v else 'F' for k, v in _to_bool_map.items()} def __init__(self, definition): """ Initialization Arguments: definition (string): the schema definition to load. Returns: airr.schema.Schema : schema object. """ # Info is not a valid schema if definition == 'Info': raise KeyError('Info is an invalid schema definition name') # Load object definition with resource_stream(__name__, 'specs/airr-schema.yaml') as f: spec = yaml.load(f, Loader=yamlordereddictloader.Loader) try: self.definition = spec[definition] except KeyError: raise KeyError('Schema definition %s cannot be found in the specifications' % definition) except: raise try: self.info = spec['Info'] except KeyError: raise KeyError('Info object cannot be found in the specifications') except: raise self.properties = self.definition['properties'] try: self.required = self.definition['required'] except KeyError: self.required = [] except: raise self.optional = [f for f in self.properties if f not in self.required] def spec(self, field): """ Get the properties for a field Arguments: name (str): field name. Returns: collections.OrderedDict: definition for the field. """ return self.properties.get(field, None) def type(self, field): """ Get the type for a field Arguments: name (str): field name. Returns: str: the type definition for the field """ field_spec = self.properties.get(field, None) field_type = field_spec.get('type', None) if field_spec else None return field_type # import numpy as np # def numpy_types(self): # type_mapping = {} # for property in self.properties: # if self.type(property) == 'boolean': # type_mapping[property] = np.bool # elif self.type(property) == 'integer': # type_mapping[property] = np.int64 # elif self.type(property) == 'number': # type_mapping[property] = np.float64 # elif self.type(property) == 'string': # type_mapping[property] = np.unicode_ # # return type_mapping def to_bool(self, value, validate=False): """ Convert a string to a boolean Arguments: value (str): logical value as a string. validate (bool): when True raise a ValidationError for an invalid value. Otherwise, set invalid values to None. Returns: bool: conversion of the string to True or False. Raises: airr.ValidationError: raised if value is invalid when validate is set True. """ if value == '' or value is None: return None bool_value = self._to_bool_map.get(value, None) if bool_value is None and validate: raise ValidationError('invalid bool %s' % value) else: return bool_value def from_bool(self, value, validate=False): """ Converts a boolean to a string Arguments: value (bool): logical value. validate (bool): when True raise a ValidationError for an invalid value. Otherwise, set invalid values to None. Returns: str: conversion of True or False or 'T' or 'F'. Raises: airr.ValidationError: raised if value is invalid when validate is set True. """ if value == '' or value is None: return '' str_value = self._from_bool_map.get(value, None) if str_value is None and validate: raise ValidationError('invalid bool %s' % value) else: return str_value def to_int(self, value, validate=False): """ Converts a string to an integer Arguments: value (str): integer value as a string. validate (bool): when True raise a ValidationError for an invalid value. Otherwise, set invalid values to None. Returns: int: conversion of the string to an integer. Raises: airr.ValidationError: raised if value is invalid when validate is set True. """ if value == '' or value is None: return None if isinstance(value, int): return value try: return int(value) except ValueError: if validate: raise ValidationError('invalid int %s'% value) else: return None def to_float(self, value, validate=False): """ Converts a string to a float Arguments: value (str): float value as a string. validate (bool): when True raise a ValidationError for an invalid value. Otherwise, set invalid values to None. Returns: float: conversion of the string to a float. Raises: airr.ValidationError: raised if value is invalid when validate is set True. """ if value == '' or value is None: return None if isinstance(value, float): return value try: return float(value) except ValueError: if validate: raise ValidationError('invalid float %s' % value) else: return None def validate_header(self, header): """ Validate header against the schema Arguments: header (list): list of header fields. Returns: bool: True if a ValidationError exception is not raised. Raises: airr.ValidationError: raised if header fails validation. """ # Check for missing header if header is None: raise ValidationError('missing header') # Check required fields missing_fields = [f for f in self.required if f not in header] if missing_fields: raise ValidationError('missing required fields (%s)' % ', '.join(missing_fields)) else: return True def validate_row(self, row): """ Validate Rearrangements row data against schema Arguments: row (dict): dictionary containing a single record. Returns: bool: True if a ValidationError exception is not raised. Raises: airr.ValidationError: raised if row fails validation. """ for f in row: # Empty strings are valid if row[f] == '' or row[f] is None: continue # Check types spec = self.type(f) try: if spec == 'boolean': self.to_bool(row[f], validate=True) if spec == 'integer': self.to_int(row[f], validate=True) if spec == 'number': self.to_float(row[f], validate=True) except ValidationError as e: raise ValidationError('field %s has %s' %(f, e)) return True def validate_object(self, obj, missing=True, nonairr = True, context=None): """ Validate Repertoire object data against schema Arguments: obj (dict): dictionary containing a single repertoire object. missing (bool): provides warnings for missing optional fields. nonairr (bool: provides warning for non-AIRR fields that cannot be validated. context (string): used by recursion to indicate place in object hierarchy Returns: bool: True if a ValidationError exception is not raised. Raises: airr.ValidationError: raised if object fails validation. """ # object has to be a dictionary if not isinstance(obj, dict): if context is None: raise ValidationError('object is not a dictionary') else: raise ValidationError('field %s is not a dictionary object' %(context)) # first warn about non-AIRR fields if nonairr: for f in obj: if context is None: full_field = f else: full_field = context + '.' + f if self.properties.get(f) is None: sys.stderr.write('Warning: Object has non-AIRR field that cannot be validated (' + full_field + ').\n') # now walk through schema and check types for f in self.properties: if context is None: full_field = f else: full_field = context + '.' + f spec = self.spec(f) xairr = spec.get('x-airr') # check if deprecated if xairr and xairr.get('deprecated'): continue # check if null and if key is missing is_missing_key = False is_null = False if obj.get(f) is None: is_null = True if obj.get(f, 'missing') == 'missing': is_missing_key = True # check MiAIRR keys exist if xairr and xairr.get('miairr'): if is_missing_key: raise ValidationError('MiAIRR field %s is missing' %(full_field)) # check if required field if f in self.required and is_missing_key: raise ValidationError('Required field %s is missing' %(full_field)) # check if identifier field if xairr and xairr.get('identifier'): if is_missing_key: raise ValidationError('Identifier field %s is missing' %(full_field)) # check nullable requirements if is_null: if not xairr: # default is true continue if xairr.get('nullable') or xairr.get('nullable', 'missing') == 'missing': # nullable is allowed continue else: # nullable not allowed raise ValidationError('Non-nullable field %s is null or missing' %(full_field)) # if get to here, field should exist with non null value # check types field_type = self.type(f) if field_type is None: # for referenced object, recursively call validate with object and schema if spec.get('$ref') is not None: schema_name = spec['$ref'].split('/')[-1] if CachedSchema.get(schema_name): schema = CachedSchema[schema_name] else: schema = Schema(schema_name) schema.validate_object(obj[f], missing, nonairr, full_field) else: raise ValidationError('Internal error: field %s in schema not handled by validation. File a bug report.' %(full_field)) elif field_type == 'array': if not isinstance(obj[f], list): raise ValidationError('field %s is not an array' %(full_field)) # for array, check each object in it for row in obj[f]: if spec['items'].get('$ref') is not None: schema_name = spec['items']['$ref'].split('/')[-1] schema = Schema(schema_name) schema.validate_object(row, missing, nonairr, full_field) elif spec['items'].get('allOf') is not None: for s in spec['items']['allOf']: if s.get('$ref') is not None: schema_name = s['$ref'].split('/')[-1] if CachedSchema.get(schema_name): schema = CachedSchema[schema_name] else: schema = Schema(schema_name) schema.validate_object(row, missing, False, full_field) elif spec['items'].get('enum') is not None: if row not in spec['items']['enum']: raise ValidationError('field %s has value "%s" not among possible enumeration values' %(full_field, row)) elif spec['items'].get('type') == 'string': if not isinstance(row, str): raise ValidationError('array field %s does not have string type: %s' %(full_field, row)) elif spec['items'].get('type') == 'boolean': if not isinstance(row, bool): raise ValidationError('array field %s does not have boolean type: %s' %(full_field, row)) elif spec['items'].get('type') == 'integer': if not isinstance(row, int): raise ValidationError('array field %s does not have integer type: %s' %(full_field, row)) elif spec['items'].get('type') == 'number': if not isinstance(row, float) and not isinstance(row, int): raise ValidationError('array field %s does not have number type: %s' %(full_field, row)) else: raise ValidationError('Internal error: array field %s in schema not handled by validation. File a bug report.' %(full_field)) elif field_type == 'object': # right now all arrays of objects use $ref raise ValidationError('Internal error: field %s in schema not handled by validation. File a bug report.' %(full_field)) else: # check basic types if field_type == 'string': if not isinstance(obj[f], str): raise ValidationError('Field %s does not have string type: %s' %(full_field, obj[f])) elif field_type == 'boolean': if not isinstance(obj[f], bool): raise ValidationError('Field %s does not have boolean type: %s' %(full_field, obj[f])) elif field_type == 'integer': if not isinstance(obj[f], int): raise ValidationError('Field %s does not have integer type: %s' %(full_field, obj[f])) elif field_type == 'number': if not isinstance(obj[f], float) and not isinstance(obj[f], int): raise ValidationError('Field %s does not have number type: %s' %(full_field, obj[f])) else: raise ValidationError('Internal error: Field %s with type %s in schema not handled by validation. File a bug report.' %(full_field, field_type)) return True # Preloaded schema CachedSchema = { 'Alignment': Schema('Alignment'), 'Rearrangement': Schema('Rearrangement'), 'Repertoire': Schema('Repertoire'), 'Ontology': Schema('Ontology'), 'Study': Schema('Study'), 'Subject': Schema('Subject'), 'Diagnosis': Schema('Diagnosis'), 'CellProcessing': Schema('CellProcessing'), 'PCRTarget': Schema('PCRTarget'), 'NucleicAcidProcessing': Schema('NucleicAcidProcessing'), 'SequencingRun': Schema('SequencingRun'), 'RawSequenceData': Schema('RawSequenceData'), 'DataProcessing': Schema('DataProcessing'), 'SampleProcessing': Schema('SampleProcessing') } AlignmentSchema = CachedSchema['Alignment'] RearrangementSchema = CachedSchema['Rearrangement'] RepertoireSchema = CachedSchema['Repertoire'] airr-1.3.1/airr/specs/0000755000076600000240000000000013741420532015270 5ustar vandej27staff00000000000000airr-1.3.1/airr/specs/__init__.py0000644000076600000240000000000013402556313017370 0ustar vandej27staff00000000000000airr-1.3.1/airr/specs/airr-schema.yaml0000644000076600000240000032010113727716214020355 0ustar vandej27staff00000000000000# # Schema definitions for AIRR standards objects # Info: title: AIRR Schema description: Schema definitions for AIRR standards objects version: "1.3" contact: name: AIRR Community url: https://github.com/airr-community license: name: Creative Commons Attribution 4.0 International url: https://creativecommons.org/licenses/by/4.0/ # Properties that are based upon an ontology use this # standard schema definition Ontology: discriminator: AIRR type: object properties: id: type: string description: CURIE of the concept, encoding the ontology and the local ID label: type: string description: Label of the concept in the respective ontology CURIEResolution: - curie_prefix: NCBITAXON iri_prefix: - "http://purl.obolibrary.org/obo/NCBITaxon_" - "http://purl.bioontology.org/ontology/NCBITAXON/" - curie_prefix: NCIT iri_prefix: - "http://purl.obolibrary.org/obo/NCIT_" - "http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#" - curie_prefix: UO iri_prefix: - "http://purl.obolibrary.org/obo/UO_" - curie_prefix: DOID iri_prefix: - "http://purl.obolibrary.org/obo/DOID_" - curie_prefix: UBERON iri_prefix: - "http://purl.obolibrary.org/obo/UBERON_" - curie_prefix: CL iri_prefix: - "http://purl.obolibrary.org/obo/CL_" # AIRR specification extensions # # The schema definitions for AIRR standards objects is extended to # provide a number of AIRR specific attributes. This schema definition # specifies the structure, property names and data types. These # attributes are attached to an AIRR field with the x-airr property. Attributes: discriminator: AIRR type: object properties: miairr: type: string description: MiAIRR requirement level. enum: - essential - important - defined default: useful identifier: type: boolean description: > True if the field is an identifier required to link metadata and/or individual sequence records across objects in the complete AIRR Data Model and ADC API. default: false adc-query-support: type: boolean description: > True if an ADC API implementation must support queries on the field. If false, query support for the field in ADC API implementations is optional. default: false nullable: type: boolean description: True if the field may have a null value. default: true deprecated: type: boolean description: True if the field has been deprecated from the schema. default: false deprecated-description: type: string description: Information regarding the deprecation of the field. deprecated-replaced-by: type: array items: type: string description: The deprecated field is replaced by this list of fields. set: type: integer description: MiAIRR set subset: type: string description: MiAIRR subset name: type: string description: MiAIRR name format: type: string description: Field format. If null then assume the full range of the field data type enum: - ontology - controlled vocabulary - physical quantity ontology: type: object description: Ontology definition for field properties: draft: type: boolean description: Indicates if ontology definition is a draft top_node: type: object description: > Concept to use as top node for ontology. Note that this must have the same CURIE namespace as the actually annotated concept. properties: id: type: string description: CURIE for the top node term label: type: string description: Ontology name for the top node term # The overall study with a globally unique study_id Study: discriminator: AIRR type: object required: - study_id - study_title - study_type - inclusion_exclusion_criteria - grants - collected_by - lab_name - lab_address - submitted_by - pub_ids - keywords_study properties: study_id: type: string description: Unique ID assigned by study registry title: Study ID example: PRJNA001 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Study ID study_title: type: string description: Descriptive study title title: Study title example: Effects of sun light exposure of the Treg repertoire x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Study title study_type: $ref: '#/Ontology' description: Type of study design title: Study type example: id: NCIT:C15197 label: Case-Control Study x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Study type format: ontology ontology: draft: false top_node: id: NCIT:C63536 label: Study study_description: type: string description: Generic study description title: Study description example: Longer description x-airr: nullable: true name: Study description adc-query-support: true inclusion_exclusion_criteria: type: string description: List of criteria for inclusion/exclusion for the study title: Study inclusion/exclusion criteria example: "Include: Clinical P. falciparum infection; Exclude: Seropositive for HIV" x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Study inclusion/exclusion criteria grants: type: string description: Funding agencies and grant numbers title: Grant funding agency example: NIH, award number R01GM987654 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Grant funding agency collected_by: type: string description: > Full contact information of the data collector, i.e. the person who is legally responsible for data collection and release. This should include an e-mail address. title: Contact information (data collection) example: Dr. P. Stibbons, p.stibbons@unseenu.edu x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Contact information (data collection) lab_name: type: string description: Department of data collector title: Lab name example: Department for Planar Immunology x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Lab name lab_address: type: string description: Institution and institutional address of data collector title: Lab address example: School of Medicine, Unseen University, Ankh-Morpork, Disk World x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Lab address submitted_by: type: string description: > Full contact information of the data depositor, i.e. the person submitting the data to a repository. This is supposed to be a short-lived and technical role until the submission is relased. title: Contact information (data deposition) example: Adrian Turnipseed, a.turnipseed@unseenu.edu x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Contact information (data deposition) pub_ids: type: string description: Publications describing the rationale and/or outcome of the study title: Relevant publications example: "PMID:85642" x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Relevant publications keywords_study: type: array items: type: string enum: - contains_ig - contains_tcr - contains_single_cell - contains_paired_chain description: Keywords describing properties of one or more data sets in a study title: Keywords for study example: - contains_ig - contains_paired_chain x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Keywords for study format: controlled vocabulary # 1-to-n relationship between a study and its subjects # subject_id is unique within a study Subject: discriminator: AIRR type: object required: - subject_id - synthetic - species - sex - age_min - age_max - age_unit - age_event - ancestry_population - ethnicity - race - strain_name - linked_subjects - link_type properties: subject_id: type: string description: Subject ID assigned by submitter, unique within study title: Subject ID example: SUB856413 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Subject ID synthetic: type: boolean description: TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display) title: Synthetic library x-airr: miairr: essential nullable: false adc-query-support: true set: 1 subset: subject name: Synthetic library species: $ref: '#/Ontology' description: Binomial designation of subject's species title: Organism example: id: NCBITAXON:9606 label: Homo sapiens x-airr: miairr: essential nullable: false adc-query-support: true set: 1 subset: subject name: Organism format: ontology ontology: draft: false top_node: id: NCBITAXON:7776 label: Gnathostomata organism: $ref: '#/Ontology' description: Binomial designation of subject's species x-airr: deprecated: true deprecated-description: Field was renamed to species for clarity. deprecated-replaced-by: - species sex: type: string enum: - male - female - pooled - hermaphrodite - intersex - "not collected" - "not applicable" description: Biological sex of subject title: Sex example: female x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Sex format: controlled vocabulary age_min: type: number description: Specific age or lower boundary of age range. title: Age minimum example: 60 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Age minimum age_max: type: number description: > Upper boundary of age range or equal to age_min for specific age. This field should only be null if age_min is null. title: Age maximum example: 80 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Age maximum age_unit: $ref: '#/Ontology' description: Unit of age range title: Age unit example: id: UO:0000036 label: year x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Age unit format: ontology ontology: draft: false top_node: id: UO:0000003 label: time unit age_event: type: string description: > Event in the study schedule to which `Age` refers. For NCBI BioSample this MUST be `sampling`. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between `Age event` and `Sample collection time`, hence the chosen events should be in temporal proximity. title: Age event example: enrollment x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Age event age: type: string x-airr: deprecated: true deprecated-description: Split into two fields to specify as an age range. deprecated-replaced-by: - age_min - age_max - age_unit ancestry_population: type: string description: Broad geographic origin of ancestry (continent) title: Ancestry population example: list of continents, mixed or unknown x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Ancestry population ethnicity: type: string description: Ethnic group of subject (defined as cultural/language-based membership) title: Ethnicity example: English, Kurds, Manchu, Yakuts (and other fields from Wikipedia) x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Ethnicity race: type: string description: Racial group of subject (as defined by NIH) title: Race example: White, American Indian or Alaska Native, Black, Asian, Native Hawaiian or Other Pacific Islander, Other x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Race strain_name: type: string description: Non-human designation of the strain or breed of animal used title: Strain name example: C57BL/6J x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Strain name linked_subjects: type: string description: Subject ID to which `Relation type` refers title: Relation to other subjects example: SUB1355648 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Relation to other subjects link_type: type: string description: Relation between subject and `linked_subjects`, can be genetic or environmental (e.g.exposure) title: Relation type example: father, daughter, household x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Relation type diagnosis: type: array description: Diagnosis information for subject items: $ref: '#/Diagnosis' x-airr: nullable: false adc-query-support: true # 1-to-n relationship between a subject and its diagnoses Diagnosis: discriminator: AIRR type: object required: - study_group_description - disease_diagnosis - disease_length - disease_stage - prior_therapies - immunogen - intervention - medical_history properties: study_group_description: type: string description: Designation of study arm to which the subject is assigned to title: Study group description example: control x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Study group description disease_diagnosis: $ref: '#/Ontology' description: Diagnosis of subject title: Diagnosis example: id: DOID:9538 label: multiple myeloma x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Diagnosis format: ontology ontology: draft: false top_node: id: DOID:4 label: disease disease_length: type: string description: Time duration between initial diagnosis and current intervention title: Length of disease example: 23 months x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Length of disease format: physical quantity disease_stage: type: string description: Stage of disease at current intervention title: Disease stage example: Stage II x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Disease stage prior_therapies: type: string description: List of all relevant previous therapies applied to subject for treatment of `Diagnosis` title: Prior therapies for primary disease under study example: melphalan/prednisone x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Prior therapies for primary disease under study immunogen: type: string description: Antigen, vaccine or drug applied to subject at this intervention title: Immunogen/agent example: bortezomib x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Immunogen/agent intervention: type: string description: Description of intervention title: Intervention definition example: systemic chemotherapy, 6 cycles, 1.25 mg/m2 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Intervention definition medical_history: type: string description: Medical history of subject that is relevant to assess the course of disease and/or treatment title: Other relevant medical history example: MGUS, first diagnosed 5 years prior x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Other relevant medical history # 1-to-n relationship between a subject and its samples # sample_id is unique within a study Sample: discriminator: AIRR type: object required: - sample_id - sample_type - tissue - anatomic_site - disease_state_sample - collection_time_point_relative - collection_time_point_reference - biomaterial_provider properties: sample_id: type: string description: Sample ID assigned by submitter, unique within study title: Biological sample ID example: SUP52415 x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Biological sample ID sample_type: type: string description: The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture title: Sample type example: Biopsy x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Sample type tissue: $ref: '#/Ontology' description: The actual tissue sampled, e.g. lymph node, liver, peripheral blood title: Tissue example: id: UBERON:0002371 label: bone marrow x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Tissue format: ontology ontology: draft: false top_node: id: UBERON:0010000 label: multicellular anatomical structure anatomic_site: type: string description: The anatomic location of the tissue, e.g. Inguinal, femur title: Anatomic site example: Iliac crest x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Anatomic site disease_state_sample: type: string description: Histopathologic evaluation of the sample title: Disease state of sample example: Tumor infiltration x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Disease state of sample collection_time_point_relative: type: string description: Time point at which sample was taken, relative to `Collection time event` title: Sample collection time example: "14 d" x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Sample collection time format: physical quantity collection_time_point_reference: type: string description: Event in the study schedule to which `Sample collection time` relates to title: Collection time event example: Primary vaccination x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Collection time event biomaterial_provider: type: string description: Name and address of the entity providing the sample title: Biomaterial provider example: Tissues-R-Us, Tampa, FL, USA x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Biomaterial provider # 1-to-n relationship between a sample and processing of its cells CellProcessing: discriminator: AIRR type: object required: - tissue_processing - cell_subset - cell_phenotype - single_cell - cell_number - cells_per_reaction - cell_storage - cell_quality - cell_isolation - cell_processing_protocol properties: tissue_processing: type: string description: Enzymatic digestion and/or physical methods used to isolate cells from sample title: Tissue processing example: Collagenase A/Dnase I digested, followed by Percoll gradient x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Tissue processing cell_subset: $ref: '#/Ontology' description: Commonly-used designation of isolated cell population title: Cell subset example: id: CL:0000972 label: class switched memory B cell x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell subset format: ontology ontology: draft: false top_node: id: CL:0000542 label: lymphocyte cell_phenotype: type: string description: List of cellular markers and their expression levels used to isolate the cell population title: Cell subset phenotype example: CD19+ CD38+ CD27+ IgM- IgD- x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell subset phenotype cell_species: $ref: '#/Ontology' description: > Binomial designation of the species from which the analyzed cells originate. Typically, this value should be identical to `species`, if which case it SHOULD NOT be set explicitly. Howver, there are valid experimental setups in which the two might differ, e.g. chimeric animal models. If set, this key will overwrite the `species` information for all lower layers of the schema. title: Cell species example: id: NCBITAXON:9606 label: Homo sapiens x-airr: miairr: defined nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell species format: ontology ontology: draft: false top_node: id: NCBITAXON:7776 label: Gnathostomata single_cell: type: boolean description: TRUE if single cells were isolated into separate compartments title: Single-cell sort x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Single-cell sort cell_number: type: integer description: Total number of cells that went into the experiment title: Number of cells in experiment example: 1000000 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Number of cells in experiment cells_per_reaction: type: integer description: Number of cells for each biological replicate title: Number of cells per sequencing reaction example: 50000 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Number of cells per sequencing reaction cell_storage: type: boolean description: TRUE if cells were cryo-preserved between isolation and further processing title: Cell storage example: TRUE x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell storage cell_quality: type: string description: Relative amount of viable cells after preparation and (if applicable) thawing title: Cell quality example: 90% viability as determined by 7-AAD x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell quality cell_isolation: type: string description: Description of the procedure used for marker-based isolation or enrich cells title: Cell isolation / enrichment procedure example: > Cells were stained with fluorochrome labeled antibodies and then sorted on a FlowMerlin (CE) cytometer. x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell isolation / enrichment procedure cell_processing_protocol: type: string description: > Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript. title: Processing protocol example: Stimulated wih anti-CD3/anti-CD28 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Processing protocol # object for PCR primer targets PCRTarget: discriminator: AIRR type: object required: - pcr_target_locus - forward_pcr_primer_target_location - reverse_pcr_primer_target_location properties: pcr_target_locus: type: string enum: - IGH - IGI - IGK - IGL - TRA - TRB - TRD - TRG description: > Designation of the target locus. Note that this field uses a controlled vocubulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature. title: Target locus for PCR example: IGK x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid [pcr]) name: Target locus for PCR format: controlled vocabulary forward_pcr_primer_target_location: type: string description: Position of the most distal nucleotide templated by the forward primer or primer mix title: Forward PCR primer target location example: IGHV, +23 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid [pcr]) name: Forward PCR primer target location reverse_pcr_primer_target_location: type: string description: Position of the most proximal nucleotide templated by the reverse primer or primer mix title: Reverse PCR primer target location example: IGHG, +57 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid [pcr]) name: Reverse PCR primer target location # generally, a 1-to-1 relationship between a CellProcessing and processing of its nucleic acid # but may be 1-to-n for technical replicates. NucleicAcidProcessing: discriminator: AIRR type: object required: - template_class - template_quality - template_amount - library_generation_method - library_generation_protocol - library_generation_kit_version - complete_sequences - physical_linkage properties: template_class: type: string enum: - DNA - RNA description: > The class of nucleic acid that was used as primary starting material for the following procedures title: Target substrate example: RNA x-airr: miairr: essential nullable: false adc-query-support: true set: 3 subset: process (nucleic acid) name: Target substrate format: controlled vocabulary template_quality: type: string description: Description and results of the quality control performed on the template material title: Target substrate quality example: RIN 9.2 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid) name: Target substrate quality template_amount: type: string description: Amount of template that went into the process title: Template amount example: 1000 ng x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid) name: Template amount format: physical quantity library_generation_method: type: string enum: - "PCR" - "RT(RHP)+PCR" - "RT(oligo-dT)+PCR" - "RT(oligo-dT)+TS+PCR" - "RT(oligo-dT)+TS(UMI)+PCR" - "RT(specific)+PCR" - "RT(specific)+TS+PCR" - "RT(specific)+TS(UMI)+PCR" - "RT(specific+UMI)+PCR" - "RT(specific+UMI)+TS+PCR" - "RT(specific)+TS" - "other" description: Generic type of library generation title: Library generation method example: RT(oligo-dT)+TS(UMI)+PCR x-airr: miairr: essential nullable: false adc-query-support: true set: 3 subset: process (nucleic acid) name: Library generation method format: controlled vocabulary library_generation_protocol: type: string description: Description of processes applied to substrate to obtain a library that is ready for sequencing title: Library generation protocol example: cDNA was generated using x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid) name: Library generation protocol library_generation_kit_version: type: string description: When using a library generation protocol from a commercial provider, provide the protocol version number title: Protocol IDs example: v2.1 (2016-09-15) x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid) name: Protocol IDs pcr_target: type: array description: > If a PCR step was performed that specifically targets the IG/TR loci, the target and primer locations need to be provided here. This field holds an array of PCRTarget objects, so that multiplex PCR setups amplifying multiple loci at the same time can be annotated using one record per locus. PCR setups not targeting any specific locus must not annotate this field but select the appropriate library_generation_method instead. items: $ref: '#/PCRTarget' x-airr: nullable: false adc-query-support: true complete_sequences: type: string enum: - partial - complete - "complete+untemplated" - mixed description: > To be considered `complete`, the procedure used for library construction MUST generate sequences that 1) include the first V gene codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J gene (i.e. 1 bp 5' of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered `complete & untemplated`, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. `mixed` should only be used if the procedure used for library construction will likely produce multiple categories of sequences in the given experiment. It SHOULD NOT be used as a replacement of a NULL value. title: Complete sequences example: partial x-airr: miairr: essential nullable: false adc-query-support: true set: 3 subset: process (nucleic acid) name: Complete sequences format: controlled vocabulary physical_linkage: type: string enum: - none - "hetero_head-head" - "hetero_tail-head" - "hetero_prelinked" description: > In case an experimental setup is used that physically links nucleic acids derived from distinct `Rearrangements` before library preparation, this field describes the mode of that linkage. All `hetero_*` terms indicate that in case of paired-read sequencing, the two reads should be expected to map to distinct IG/TR loci. `*_head-head` refers to techniques that link the 5' ends of transcripts in a single-cell context. `*_tail-head` refers to techniques that link the 3' end of one transcript to the 5' end of another one in a single-cell context. This term does not provide any information whether a continuous reading-frame between the two is generated. `*_prelinked` refers to constructs in which the linkage was already present on the DNA level (e.g. scFv). title: Physical linkage of different rearrangements example: hetero_head-head x-airr: miairr: essential nullable: false adc-query-support: true set: 3 subset: process (nucleic acid) name: Physical linkage of different rearrangements format: controlled vocabulary # 1-to-n relationship between a NucleicAcidProcessing and SequencingRun with resultant raw sequence file(s) SequencingRun: discriminator: AIRR type: object required: - sequencing_run_id - total_reads_passing_qc_filter - sequencing_platform - sequencing_facility - sequencing_run_date - sequencing_kit properties: sequencing_run_id: type: string description: ID of sequencing run assigned by the sequencing facility title: Batch number example: 160101_M01234 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Batch number total_reads_passing_qc_filter: type: integer description: Number of usable reads for analysis title: Total reads passing QC filter example: 10365118 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Total reads passing QC filter sequencing_platform: type: string description: Designation of sequencing instrument used title: Sequencing platform example: Alumina LoSeq 1000 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Sequencing platform sequencing_facility: type: string description: Name and address of sequencing facility title: Sequencing facility example: Seqs-R-Us, Vancouver, BC, Canada x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Sequencing facility sequencing_run_date: type: string description: Date of sequencing run title: Date of sequencing run format: date example: 2016-12-16 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Date of sequencing run sequencing_kit: type: string description: Name, manufacturer, order and lot numbers of sequencing kit title: Sequencing kit example: "FullSeq 600, Alumina, #M123456C0, 789G1HK" x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Sequencing kit sequencing_files: $ref: '#/RawSequenceData' description: Set of sequencing files produced by the sequencing run x-airr: nullable: false adc-query-support: true # Resultant raw sequencing files from a SequencingRun RawSequenceData: discriminator: AIRR type: object required: - file_type - filename - read_direction - read_length - paired_filename - paired_read_direction - paired_read_length properties: file_type: type: string description: File format for the raw reads or sequences title: Raw sequencing data file type enum: - fasta - fastq x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Raw sequencing data file type format: controlled vocabulary filename: type: string description: File name for the raw reads or sequences. The first file in paired-read sequencing. title: Raw sequencing data file name example: MS10R-NMonson-C7JR9_S1_R1_001.fastq x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Raw sequencing data file name read_direction: type: string description: Read direction for the raw reads or sequences. The first file in paired-read sequencing. title: Read direction example: forward enum: - forward - reverse - mixed x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Read direction format: controlled vocabulary read_length: type: integer description: Read length in bases for the first file in paired-read sequencing title: Forward read length example: 300 x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: process (sequencing) name: Forward read length paired_filename: type: string description: File name for the second file in paired-read sequencing title: Paired raw sequencing data file name example: MS10R-NMonson-C7JR9_S1_R2_001.fastq x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Paired raw sequencing data file name paired_read_direction: type: string description: Read direction for the second file in paired-read sequencing title: Paired read direction example: reverse enum: - forward - reverse - mixed x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Paired read direction format: controlled vocabulary paired_read_length: type: integer description: Read length in bases for the second file in paired-read sequencing title: Paired read length example: 300 x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: process (sequencing) name: Paired read length # 1-to-n relationship between a repertoire and data processing # # Set of annotated rearrangement sequences produced by # data processing upon the raw sequence data for a repertoire. DataProcessing: discriminator: AIRR type: object required: - software_versions - paired_reads_assembly - quality_thresholds - primer_match_cutoffs - collapsing_method - data_processing_protocols - germline_database properties: data_processing_id: type: string description: Identifier for the data processing object. title: Data processing ID x-airr: nullable: true name: Data processing ID adc-query-support: true identifier: true primary_annotation: type: boolean default: false description: > If true, indicates this is the primary or default data processing for the repertoire and its rearrangements. If false, indicates this is a secondary or additional data processing. title: Primary annotation x-airr: nullable: false adc-query-support: true identifier: true software_versions: type: string description: Version number and / or date, include company pipelines title: Software tools and version numbers example: IgBLAST 1.6 x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Software tools and version numbers paired_reads_assembly: type: string description: How paired end reads were assembled into a single receptor sequence title: Paired read assembly example: PandaSeq (minimal overlap 50, threshold 0.8) x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Paired read assembly quality_thresholds: type: string description: How sequences were removed from (4) based on base quality scores title: Quality thresholds example: Average Phred score >=20 x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Quality thresholds primer_match_cutoffs: type: string description: How primers were identified in the sequences, were they removed/masked/etc? title: Primer match cutoffs example: Hamming distance <= 2 x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Primer match cutoffs collapsing_method: type: string description: The method used for combining multiple sequences from (4) into a single sequence in (5) title: Collapsing method example: MUSCLE 3.8.31 x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Collapsing method data_processing_protocols: type: string description: General description of how QC is performed title: Data processing protocols example: Data was processed using [...] x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Data processing protocols data_processing_files: type: array items: type: string description: Array of file names for data produced by this data processing. title: Processed data file names example: - 'ERR1278153_aa.txz' - 'ERR1278153_ab.txz' - 'ERR1278153_ac.txz' x-airr: nullable: true adc-query-support: true name: Processed data file names germline_database: type: string description: Source of germline V(D)J genes with version number or date accessed. title: V(D)J germline reference database example: ENSEMBL, Homo sapiens build 90, 2017-10-01 x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: data (processed sequence) name: V(D)J germline reference database analysis_provenance_id: type: string description: Identifier for machine-readable PROV model of analysis provenance title: Analysis provenance ID x-airr: nullable: true adc-query-support: true SampleProcessing: discriminator: AIRR type: object properties: sample_processing_id: type: string description: > Identifier for the sample processing object. This field should be unique within the repertoire. This field can be used to uniquely identify the combination of sample, cell processing, nucleic acid processing and sequencing run information for the repertoire. title: Sample processing ID x-airr: nullable: true name: Sample processing ID adc-query-support: true identifier: true # The composite schema for the repertoire object # # This represents a sample repertoire as defined by the study # and experimentally observed by raw sequence data. A repertoire # can only be for one subject but may include multiple samples. Repertoire: discriminator: AIRR type: object required: - study - subject - sample - data_processing properties: repertoire_id: type: string description: > Identifier for the repertoire object. This identifier should be globally unique so that repertoires from multiple studies can be combined together without conflict. The repertoire_id is used to link other AIRR data to a Repertoire. Specifically, the Rearrangements Schema includes repertoire_id for referencing the specific Repertoire for that Rearrangement. title: Repertoire ID x-airr: nullable: true adc-query-support: true identifier: true repertoire_name: type: string description: Short generic display name for the repertoire title: Repertoire name x-airr: nullable: true name: Repertoire name adc-query-support: true repertoire_description: type: string description: Generic repertoire description title: Repertoire description x-airr: nullable: true name: Repertoire description adc-query-support: true study: $ref: '#/Study' description: Study object x-airr: nullable: false adc-query-support: true subject: $ref: '#/Subject' description: Subject object x-airr: nullable: false adc-query-support: true sample: type: array description: List of Sample objects items: allOf: - $ref: '#/SampleProcessing' - $ref: '#/Sample' - $ref: '#/CellProcessing' - $ref: '#/NucleicAcidProcessing' - $ref: '#/SequencingRun' x-airr: nullable: false adc-query-support: true data_processing: type: array description: List of Data Processing objects items: $ref: '#/DataProcessing' x-airr: nullable: false adc-query-support: true Alignment: discriminator: AIRR type: object required: - sequence_id - segment - call - score - cigar properties: sequence_id: type: string description: > Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. segment: type: string description: > The segment for this alignment. One of V, D, J or C. rev_comp: type: boolean description: > Alignment result is from the reverse complement of the query sequence. call: type: string description: > Gene assignment with allele. score: type: number description: > Alignment score. identity: type: number description: > Alignment fractional identity. support: type: number description: > Alignment E-value, p-value, likelihood, probability or other similar measure of support for the gene assignment as defined by the alignment tool. cigar: type: string description: > Alignment CIGAR string. sequence_start: type: integer description: > Start position of the segment in the query sequence (1-based closed interval). sequence_end: type: integer description: > End position of the segment in the query sequence (1-based closed interval). germline_start: type: integer description: > Alignment start position in the reference sequence (1-based closed interval). germline_end: type: integer description: > Alignment end position in the reference sequence (1-based closed interval). rank: type: integer description: > Alignment rank. rearrangement_id: type: string description: > Identifier for the Rearrangement object. May be identical to sequence_id, but will usually be a universally unique record locator for database applications. x-airr: deprecated: true deprecated-description: Field has been merged with sequence_id to avoid confusion. deprecated-replaced-by: - sequence_id data_processing_id: type: string description: > Identifier to the data processing object in the repertoire metadata for this rearrangement. If this field is empty than the primary data processing object is assumed. germline_database: type: string description: Source of germline V(D)J genes with version number or date accessed. example: ENSEMBL, Homo sapiens build 90, 2017-10-01 x-airr: deprecated: true deprecated-description: Field was moved up to the DataProcessing level to avoid data duplication. deprecated-replaced-by: - "DataProcessing:germline_database" # The extended rearrangement object Rearrangement: discriminator: AIRR type: object required: - sequence_id - sequence - rev_comp - productive - v_call - d_call - j_call - sequence_alignment - germline_alignment - junction - junction_aa - v_cigar - d_cigar - j_cigar properties: sequence_id: type: string description: > Unique query sequence identifier for the Rearrangment. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. When downloaded from an AIRR Data Commons repository, this will usually be a universally unique record locator for linking with other objects in the AIRR Data Model. x-airr: adc-query-support: true identifier: true sequence: type: string description: > The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment. sequence_aa: type: string description: > Amino acid translation of the query nucleotide sequence. rev_comp: type: boolean description: > True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of 'sequence'. productive: type: boolean description: > True if the V(D)J sequence is predicted to be productive. x-airr: adc-query-support: true vj_in_frame: type: boolean description: True if the V and J gene alignments are in-frame. stop_codon: type: boolean description: True if the aligned sequence contains a stop codon. complete_vdj: type: boolean description: > True if the sequence alignment spans the entire V(D)J region. Meaning, sequence_alignment includes both the first V gene codon that encodes the mature polypeptide chain (i.e., after the leader sequence) and the last complete codon of the J gene (i.e., before the J-C splice site). This does not require an absence of deletions within the internal FWR and CDR regions of the alignment. locus: type: string enum: - IGH - IGI - IGK - IGL - TRA - TRB - TRD - TRG description: > Gene locus (chain type). Note that this field uses a controlled vocabulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature. title: Gene locus example: IGH x-airr: nullable: true adc-query-support: true name: Gene locus format: controlled vocabulary v_call: type: string description: > V gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHV4-59*01 if using IMGT/GENE-DB). title: V gene with allele example: IGHV4-59*01 x-airr: miairr: important nullable: true adc-query-support: true set: 6 subset: data (processed sequence) name: V gene with allele d_call: type: string description: > First or only D gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01 if using IMGT/GENE-DB). title: D gene with allele example: IGHD3-10*01 x-airr: miairr: important nullable: true adc-query-support: true set: 6 subset: data (processed sequence) name: D gene with allele d2_call: type: string description: > Second D gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01 if using IMGT/GENE-DB). example: IGHD3-10*01 j_call: type: string description: > J gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHJ4*02 if using IMGT/GENE-DB). title: J gene with allele example: IGHJ4*02 x-airr: miairr: important nullable: true adc-query-support: true set: 6 subset: data (processed sequence) name: J gene with allele c_call: type: string description: > Constant region gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHG1*01 if using IMGT/GENE-DB). title: C region example: IGHG1*01 x-airr: miairr: important nullable: true adc-query-support: true set: 6 subset: data (processed sequence) name: C region sequence_alignment: type: string description: > Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement. sequence_alignment_aa: type: string description: > Amino acid translation of the aligned query sequence. germline_alignment: type: string description: > Assembled, aligned, full-length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any). germline_alignment_aa: type: string description: > Amino acid translation of the assembled germline sequence. junction: type: string description: > Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. title: IMGT-JUNCTION nucleotide sequence example: TGTGCAAGAGCGGGAGTTTACGACGGATATACTATGGACTACTGG x-airr: miairr: important nullable: true set: 6 subset: data (processed sequence) name: IMGT-JUNCTION nucleotide sequence junction_aa: type: string description: > Amino acid translation of the junction. title: IMGT-JUNCTION amino acid sequence example: CARAGVYDGYTMDYW x-airr: miairr: important nullable: true adc-query-support: true set: 6 subset: data (processed sequence) name: IMGT-JUNCTION amino acid sequence np1: type: string description: > Nucleotide sequence of the combined N/P region between the V gene and first D gene alignment or between the V gene and J gene alignments. np1_aa: type: string description: > Amino acid translation of the np1 field. np2: type: string description: > Nucleotide sequence of the combined N/P region between either the first D gene and J gene alignments or the first D gene and second D gene alignments. np2_aa: type: string description: > Amino acid translation of the np2 field. np3: type: string description: > Nucleotide sequence of the combined N/P region between the second D gene and J gene alignments. np3_aa: type: string description: > Amino acid translation of the np3 field. cdr1: type: string description: > Nucleotide sequence of the aligned CDR1 region. cdr1_aa: type: string description: > Amino acid translation of the cdr1 field. cdr2: type: string description: > Nucleotide sequence of the aligned CDR2 region. cdr2_aa: type: string description: > Amino acid translation of the cdr2 field. cdr3: type: string description: > Nucleotide sequence of the aligned CDR3 region. cdr3_aa: type: string description: > Amino acid translation of the cdr3 field. fwr1: type: string description: > Nucleotide sequence of the aligned FWR1 region. fwr1_aa: type: string description: > Amino acid translation of the fwr1 field. fwr2: type: string description: > Nucleotide sequence of the aligned FWR2 region. fwr2_aa: type: string description: > Amino acid translation of the fwr2 field. fwr3: type: string description: > Nucleotide sequence of the aligned FWR3 region. fwr3_aa: type: string description: > Amino acid translation of the fwr3 field. fwr4: type: string description: > Nucleotide sequence of the aligned FWR4 region. fwr4_aa: type: string description: > Amino acid translation of the fwr4 field. v_score: type: number description: Alignment score for the V gene. v_identity: type: number description: Fractional identity for the V gene alignment. v_support: type: number description: > V gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the V gene assignment as defined by the alignment tool. v_cigar: type: string description: CIGAR string for the V gene alignment. d_score: type: number description: Alignment score for the first or only D gene alignment. d_identity: type: number description: Fractional identity for the first or only D gene alignment. d_support: type: number description: > D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the first or only D gene as defined by the alignment tool. d_cigar: type: string description: CIGAR string for the first or only D gene alignment. d2_score: type: number description: Alignment score for the second D gene alignment. d2_identity: type: number description: Fractional identity for the second D gene alignment. d2_support: type: number description: > D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the second D gene as defined by the alignment tool. d2_cigar: type: string description: CIGAR string for the second D gene alignment. j_score: type: number description: Alignment score for the J gene alignment. j_identity: type: number description: Fractional identity for the J gene alignment. j_support: type: number description: > J gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the J gene assignment as defined by the alignment tool. j_cigar: type: string description: CIGAR string for the J gene alignment. c_score: type: number description: Alignment score for the C gene alignment. c_identity: type: number description: Fractional identity for the C gene alignment. c_support: type: number description: > C gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the C gene assignment as defined by the alignment tool. c_cigar: type: string description: CIGAR string for the C gene alignment. v_sequence_start: type: integer description: > Start position of the V gene in the query sequence (1-based closed interval). v_sequence_end: type: integer description: > End position of the V gene in the query sequence (1-based closed interval). v_germline_start: type: integer description: > Alignment start position in the V gene reference sequence (1-based closed interval). v_germline_end: type: integer description: > Alignment end position in the V gene reference sequence (1-based closed interval). v_alignment_start: type: integer description: > Start position of the V gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). v_alignment_end: type: integer description: > End position of the V gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). d_sequence_start: type: integer description: > Start position of the first or only D gene in the query sequence. (1-based closed interval). d_sequence_end: type: integer description: > End position of the first or only D gene in the query sequence. (1-based closed interval). d_germline_start: type: integer description: > Alignment start position in the D gene reference sequence for the first or only D gene (1-based closed interval). d_germline_end: type: integer description: > Alignment end position in the D gene reference sequence for the first or only D gene (1-based closed interval). d_alignment_start: type: integer description: > Start position of the first or only D gene in both the sequence_alignment and germline_alignment fields (1-based closed interval). d_alignment_end: type: integer description: > End position of the first or only D gene in both the sequence_alignment and germline_alignment fields (1-based closed interval). d2_sequence_start: type: integer description: > Start position of the second D gene in the query sequence (1-based closed interval). d2_sequence_end: type: integer description: > End position of the second D gene in the query sequence (1-based closed interval). d2_germline_start: type: integer description: > Alignment start position in the second D gene reference sequence (1-based closed interval). d2_germline_end: type: integer description: > Alignment end position in the second D gene reference sequence (1-based closed interval). d2_alignment_start: type: integer description: > Start position of the second D gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). d2_alignment_end: type: integer description: > End position of the second D gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). j_sequence_start: type: integer description: > Start position of the J gene in the query sequence (1-based closed interval). j_sequence_end: type: integer description: > End position of the J gene in the query sequence (1-based closed interval). j_germline_start: type: integer description: > Alignment start position in the J gene reference sequence (1-based closed interval). j_germline_end: type: integer description: > Alignment end position in the J gene reference sequence (1-based closed interval). j_alignment_start: type: integer description: > Start position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). j_alignment_end: type: integer description: > End position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). cdr1_start: type: integer description: CDR1 start position in the query sequence (1-based closed interval). cdr1_end: type: integer description: CDR1 end position in the query sequence (1-based closed interval). cdr2_start: type: integer description: CDR2 start position in the query sequence (1-based closed interval). cdr2_end: type: integer description: CDR2 end position in the query sequence (1-based closed interval). cdr3_start: type: integer description: CDR3 start position in the query sequence (1-based closed interval). cdr3_end: type: integer description: CDR3 end position in the query sequence (1-based closed interval). fwr1_start: type: integer description: FWR1 start position in the query sequence (1-based closed interval). fwr1_end: type: integer description: FWR1 end position in the query sequence (1-based closed interval). fwr2_start: type: integer description: FWR2 start position in the query sequence (1-based closed interval). fwr2_end: type: integer description: FWR2 end position in the query sequence (1-based closed interval). fwr3_start: type: integer description: FWR3 start position in the query sequence (1-based closed interval). fwr3_end: type: integer description: FWR3 end position in the query sequence (1-based closed interval). fwr4_start: type: integer description: FWR4 start position in the query sequence (1-based closed interval). fwr4_end: type: integer description: FWR4 end position in the query sequence (1-based closed interval). v_sequence_alignment: type: string description: > Aligned portion of query sequence assigned to the V gene, including any indel corrections or numbering spacers. v_sequence_alignment_aa: type: string description: > Amino acid translation of the v_sequence_alignment field. d_sequence_alignment: type: string description: > Aligned portion of query sequence assigned to the first or only D gene, including any indel corrections or numbering spacers. d_sequence_alignment_aa: type: string description: > Amino acid translation of the d_sequence_alignment field. d2_sequence_alignment: type: string description: > Aligned portion of query sequence assigned to the second D gene, including any indel corrections or numbering spacers. d2_sequence_alignment_aa: type: string description: > Amino acid translation of the d2_sequence_alignment field. j_sequence_alignment: type: string description: > Aligned portion of query sequence assigned to the J gene, including any indel corrections or numbering spacers. j_sequence_alignment_aa: type: string description: > Amino acid translation of the j_sequence_alignment field. c_sequence_alignment: type: string description: > Aligned portion of query sequence assigned to the constant region, including any indel corrections or numbering spacers. c_sequence_alignment_aa: type: string description: > Amino acid translation of the c_sequence_alignment field. v_germline_alignment: type: string description: > Aligned V gene germline sequence spanning the same region as the v_sequence_alignment field and including the same set of corrections and spacers (if any). v_germline_alignment_aa: type: string description: > Amino acid translation of the v_germline_alignment field. d_germline_alignment: type: string description: > Aligned D gene germline sequence spanning the same region as the d_sequence_alignment field and including the same set of corrections and spacers (if any). d_germline_alignment_aa: type: string description: > Amino acid translation of the d_germline_alignment field. d2_germline_alignment: type: string description: > Aligned D gene germline sequence spanning the same region as the d2_sequence_alignment field and including the same set of corrections and spacers (if any). d2_germline_alignment_aa: type: string description: > Amino acid translation of the d2_germline_alignment field. j_germline_alignment: type: string description: > Aligned J gene germline sequence spanning the same region as the j_sequence_alignment field and including the same set of corrections and spacers (if any). j_germline_alignment_aa: type: string description: > Amino acid translation of the j_germline_alignment field. c_germline_alignment: type: string description: > Aligned constant region germline sequence spanning the same region as the c_sequence_alignment field and including the same set of corrections and spacers (if any). c_germline_alignment_aa: type: string description: > Amino acid translation of the c_germline_aligment field. junction_length: type: integer description: Number of nucleotides in the junction sequence. junction_aa_length: type: integer description: Number of amino acids in the junction sequence. x-airr: adc-query-support: true np1_length: type: integer description: > Number of nucleotides between the V gene and first D gene alignments or between the V gene and J gene alignments. np2_length: type: integer description: > Number of nucleotides between either the first D gene and J gene alignments or the first D gene and second D gene alignments. np3_length: type: integer description: > Number of nucleotides between the second D gene and J gene alignments. n1_length: type: integer description: Number of untemplated nucleotides 5' of the first or only D gene alignment. n2_length: type: integer description: Number of untemplated nucleotides 3' of the first or only D gene alignment. n3_length: type: integer description: Number of untemplated nucleotides 3' of the second D gene alignment. p3v_length: type: integer description: Number of palindromic nucleotides 3' of the V gene alignment. p5d_length: type: integer description: Number of palindromic nucleotides 5' of the first or only D gene alignment. p3d_length: type: integer description: Number of palindromic nucleotides 3' of the first or only D gene alignment. p5d2_length: type: integer description: Number of palindromic nucleotides 5' of the second D gene alignment. p3d2_length: type: integer description: Number of palindromic nucleotides 3' of the second D gene alignment. p5j_length: type: integer description: Number of palindromic nucleotides 5' of the J gene alignment. consensus_count: type: integer description: > Number of reads contributing to the (UMI) consensus for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence. duplicate_count: type: integer description: > Copy number or number of duplicate observations for the query sequence. For example, the number of UMIs sharing an identical sequence or the number of identical observations of this sequence absent UMIs. title: Read count example: 123 x-airr: miairr: important nullable: true set: 6 subset: data (processed sequence) name: Read count cell_id: type: string description: > Identifier defining the cell of origin for the query sequence. title: Cell index example: W06_046_091 x-airr: miairr: important nullable: true adc-query-support: true identifier: true set: 6 subset: data (processed sequence) name: Cell index clone_id: type: string description: Clonal cluster assignment for the query sequence. x-airr: nullable: true adc-query-support: true identifier: true repertoire_id: type: string description: Identifier to the associated repertoire in study metadata. x-airr: nullable: true adc-query-support: true identifier: true sample_processing_id: type: string description: > Identifier to the sample processing object in the repertoire metadata for this rearrangement. If the repertoire has a single sample then this field may be empty or missing. If the repertoire has multiple samples then this field may be empty or missing if the sample cannot be differentiated or the relationship is not maintained by the data processing. x-airr: nullable: true adc-query-support: true identifier: true data_processing_id: type: string description: > Identifier to the data processing object in the repertoire metadata for this rearrangement. If this field is empty than the primary data processing object is assumed. x-airr: nullable: true adc-query-support: true identifier: true rearrangement_id: type: string description: > Identifier for the Rearrangement object. May be identical to sequence_id, but will usually be a universally unique record locator for database applications. x-airr: deprecated: true deprecated-description: Field has been merged with sequence_id to avoid confusion. deprecated-replaced-by: - sequence_id rearrangement_set_id: type: string description: > Identifier for grouping Rearrangement objects. x-airr: deprecated: true deprecated-description: Field has been replaced by other specialized identifiers. deprecated-replaced-by: - repertoire_id - sample_processing_id - data_processing_id germline_database: type: string description: Source of germline V(D)J genes with version number or date accessed. example: ENSEMBL, Homo sapiens build 90, 2017-10-01 x-airr: deprecated: true deprecated-description: Field was moved up to the DataProcessing level to avoid data duplication. deprecated-replaced-by: - "DataProcessing:germline_database" # A unique inferred clone object that has been constructed within a single data processing # for a single repertoire and a subset of its sequences and/or rearrangements. Clone: discriminator: AIRR type: object required: - clone_id - germline_alignment properties: clone_id: type: string description: Identifier for the clone. repertoire_id: type: string description: Identifier to the associated repertoire in study metadata. x-airr: nullable: true adc-query-support: true data_processing_id: type: string description: Identifier of the data processing object in the repertoire metadata for this clone. x-airr: nullable: true adc-query-support: true sequences: type: array items: type: string description: > List sequence_id strings that act as keys to the Rearrangement records for members of the clone. v_call: type: string description: > V gene with allele of the inferred ancestral of the clone. For example, IGHV4-59*01. example: IGHV4-59*01 d_call: type: string description: > D gene with allele of the inferred ancestor of the clone. For example, IGHD3-10*01. example: IGHD3-10*01 j_call: type: string description: > J gene with allele of the inferred ancestor of the clone. For example, IGHJ4*02. example: IGHJ4*02 junction: type: string description: > Nucleotide sequence for the junction region of the inferred ancestor of the clone, where the junction is defined as the CDR3 plus the two flanking conserved codons. junction_aa: type: string description: > Amino acid translation of the junction. junction_length: type: integer description: Number of nucleotides in the junction. junction_aa_length: type: integer description: Number of amino acids in junction_aa. germline_alignment: type: string description: > Assembled, aligned, full-length inferred ancestor of the clone spanning the same region as the sequence_alignment field of nodes (typically the V(D)J region) and including the same set of corrections and spacers (if any). germline_alignment_aa: type: string description: > Amino acid translation of germline_alignment. v_alignment_start: type: integer description: > Start position in the V gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). v_alignment_end: type: integer description: > End position in the V gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). d_alignment_start: type: integer description: > Start position of the D gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). d_alignment_end: type: integer description: > End position of the D gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). j_alignment_start: type: integer description: > Start position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). j_alignment_end: type: integer description: > End position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). junction_start: type: integer description: Junction region start position in the alignment (1-based closed interval). junction_end: type: integer description: Junction region end position in the alignment (1-based closed interval). sequence_count: type: integer description: Number of Rearrangement records (sequences) included in this clone seed_id: type: string description: sequence_id of the seed sequence. Empty string (or null) if there is no seed sequence. # 1-to-n relationship for a clone to its trees. Tree: discriminator: AIRR type: object required: - tree_id - clone_id - newick properties: tree_id: type: string description: Identifier for the tree. clone_id: type: string description: Identifier for the clone. newick: type: string description: Newick string of the tree edges. nodes: type: object description: Dictionary of nodes in the tree, keyed by sequence_id string additionalProperties: $ref: '#/Node' # 1-to-n relationship between a tree and its nodes Node: discriminator: AIRR type: object required: - sequence_id properties: sequence_id: type: string description: > Identifier for this node that matches the identifier in the newick string and, where possible, the sequence_id in the source repertoire. sequence_alignment: type: string description: > Nucleotide sequence of the node, aligned to the germline_alignment for this clone, including including any indel corrections or spacers. junction: type: string description: > Junction region nucleotide sequence for the node, where the junction is defined as the CDR3 plus the two flanking conserved codons. junction_aa: type: string description: > Amino acid translation of the junction. # The cell object acts as point of reference for all data that can be related # to an individual cell, either by direct observation or inference. Cell: discriminator: AIRR type: object required: - cell_id #redefined cell_id > how to centralize it in the yaml - rearrangements - repertoire_id - virtual_pairing properties: cell_id: type: string description: > Identifier defining the cell of origin for the query sequence. title: Cell index example: W06_046_091 x-airr: miairr: defined nullable: false adc-query-support: true name: Cell index rearrangements: type: array description: > Array of sequence identifiers defined for the Rearrangement object title: Cell-associated rearrangements items: type: string example: [id1, id2] #empty vs NULL? x-airr: miairr: defined nullable: true adc-query-support: true name: Cell-associated rearrangements receptors: type: array description: > Array of receptor identifiers defined for the Receptor object title: Cell-associated receptors items: type: string example: [id1, id2] #empty vs NULL? x-airr: miairr: defined nullable: true adc-query-support: true name: Cell-associated receptors repertoire_id: type: string description: Identifier to the associated repertoire in study metadata. title: Parental repertoire of cell x-airr: miairr: defined nullable: true adc-query-support: true name: Parental repertoire of cell data_processing_id: type: string description: Identifier of the data processing object in the repertoire metadata for this clone. title: Data processing for cell x-airr: miairr: defined nullable: true adc-query-support: true name: Data processing for cell expression_study_method: type: string enum: flow cytometry single-cell transcriptome description: > keyword describing the methodology used to assess expression. This values for this field MUST come from a controlled vocabulary x-airr: miairr: defined nullable: true adc-api-optional: true expression_raw_doi: type: string description: > DOI of raw data set containing the current event x-airr: miairr: defined nullable: true adc-api-optional: true expression_index: type: string description: > Index addressing the current event within the raw data set. x-airr: miairr: defined nullable: true adc-api-optional: true expression_tabular: type: array description: > Expression definitions for single-cell items: type: object properties: expression_marker: type: string description: > standardized designation of the transcript or epitope example: CD27 expression_value: type: integer description: > transformed and normalized expression level. example: 14567 virtual_pairing: type: boolean description: > boolean to indicate if pairing was inferred. title: Virtual pairing x-airr: miairr: defined nullable: true # assuming only done for sc experiments, otherwise does not exist adc-query-support: true name: Virtual pairing airr-1.3.1/airr/specs/blank.airr.yaml0000644000076600000240000000625413741416302020206 0ustar vandej27staff00000000000000# # blank metadata template # Repertoire: - repertoire_id: null study: study_id: null study_title: null study_type: id: null label: null study_description: null inclusion_exclusion_criteria: null lab_name: null lab_address: null submitted_by: null collected_by: null grants: null pub_ids: null keywords_study: null subject: subject_id: null synthetic: false species: id: null label: null sex: null age_min: null age_max: null age_unit: id: null label: null age_event: null ancestry_population: null ethnicity: null race: null strain_name: null linked_subjects: null link_type: null diagnosis: - study_group_description: null disease_diagnosis: id: null label: null disease_length: null disease_stage: null prior_therapies: null immunogen: null intervention: null medical_history: null sample: - sample_processing_id: null sample_id: null sample_type: null tissue: id: null label: null anatomic_site: null disease_state_sample: null collection_time_point_relative: null collection_time_point_reference: null biomaterial_provider: null # cell processing tissue_processing: null cell_subset: id: null label: null cell_phenotype: null cell_species: id: null label: null single_cell: false cell_number: null cells_per_reaction: null cell_storage: false cell_quality: null cell_isolation: null cell_processing_protocol: null # nucleic acid processing template_class: "" template_quality: null template_amount: null library_generation_method: "" library_generation_protocol: null library_generation_kit_version: null pcr_target: - pcr_target_locus: null forward_pcr_primer_target_location: null reverse_pcr_primer_target_location: null complete_sequences: "partial" physical_linkage: "none" # sequencing run sequencing_run_id: null total_reads_passing_qc_filter: null sequencing_platform: null sequencing_facility: null sequencing_run_date: null sequencing_kit: null # raw data sequencing_files: file_type: null filename: null read_direction: null read_length: null paired_filename: null paired_read_direction: null paired_read_length: null data_processing: - data_processing_id: null primary_annotation: false software_versions: null paired_reads_assembly: null quality_thresholds: null primer_match_cutoffs: null collapsing_method: null data_processing_protocols: null data_processing_files: null germline_database: null analysis_provenance_id: null airr-1.3.1/airr/tools.py0000644000076600000240000002042213557414241015672 0ustar vandej27staff00000000000000""" AIRR tools and utilities """ # Copyright (c) 2018 AIRR Community # # This file is part of the AIRR Community Standards. # # Author: Scott Christley # Author: Jason Anthony Vander Heiden # Date: March 29, 2018 # # This library is free software; you can redistribute it and/or modify # it under the terms of the Creative Commons Attribution 4.0 License. # # This library is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # Creative Commons Attribution 4.0 License for more details. # System imports import argparse import sys # Local imports from airr import __version__ import airr.interface # internal wrapper function before calling merge interface method def merge_cmd(out_file, airr_files, drop=False, debug=False): """ Merge one or more AIRR rearrangements files Arguments: out_file (str): output file name. airr_files (list): list of input files to merge. drop (bool): drop flag. If True then drop fields that do not exist in all input files, otherwise combine fields from all input files. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if files were successfully merged, otherwise False. """ return airr.interface.merge_rearrangement(out_file, airr_files, drop=drop, debug=debug) # internal wrapper function before calling validate interface method def validate_cmd(airr_files, debug=True): """ Validates one or more AIRR rearrangements files Arguments: airr_files (list): list of input files to validate. debug (bool): debug flag. If True print debugging information to standard error. Returns: boolean: True if all files passed validation, otherwise False """ try: valid = [airr.interface.validate_rearrangement(f, debug=debug) for f in airr_files] return all(valid) except Exception as err: sys.stderr.write('Error occurred while validating AIRR rearrangement files: ' + str(err) + '\n') return False # internal wrapper function before calling validate interface method def validate_repertoire_cmd(airr_files, debug=True): """ Validates one or more AIRR repertoire metadata files Arguments: airr_files (list): list of input files to validate. debug (bool): debug flag. If True print debugging information to standard error. Returns: boolean: True if all files passed validation, otherwise False """ try: valid = [airr.interface.validate_repertoire(f, debug=debug) for f in airr_files] return all(valid) except Exception as err: sys.stderr.write('Error occurred while validating AIRR repertoire metadata files: ' + str(err) + '\n') return False def define_args(): """ Define commandline arguments Returns: argparse.ArgumentParser: argument parser. """ parser = argparse.ArgumentParser(add_help=False, description='AIRR Community Standards utility commands.') group_help = parser.add_argument_group('help') group_help.add_argument('-h', '--help', action='help', help='show this help message and exit') group_help.add_argument('--version', action='version', version='%(prog)s:' + ' %s' % __version__) # Setup subparsers subparsers = parser.add_subparsers(title='subcommands', dest='command', metavar='', help='Database operation') # TODO: This is a temporary fix for Python issue 9253 subparsers.required = True # Define arguments common to all subcommands common_parser = argparse.ArgumentParser(add_help=False) common_help = common_parser.add_argument_group('help') common_help.add_argument('--version', action='version', version='%(prog)s:' + ' %s' % __version__) common_help.add_argument('-h', '--help', action='help', help='show this help message and exit') # TODO: workflow provenance # group_prov = common_parser.add_argument_group('provenance') # group_prov.add_argument('-p', '--provenance', action='store', dest='prov_file', default=None, # help='''File name for storing workflow provenance. If specified, airr-tools # will record provenance for all activities performed.''') # TODO: study metadata # group_meta = common_parser.add_argument_group('study metadata') # group_meta.add_argument('-m', '--metadata', action='store', dest='metadata_file', default=None, # help='''File name containing study metadata.''') # Subparser to merge files parser_merge = subparsers.add_parser('merge', parents=[common_parser], add_help=False, help='Merge AIRR rearrangement files.', description='Merge AIRR rearrangement files.') group_merge = parser_merge.add_argument_group('merge arguments') group_merge.add_argument('-o', action='store', dest='out_file', required=True, help='''Output file name.''') group_merge.add_argument('--drop', action='store_true', dest='drop', help='''If specified, drop fields that do not exist in all input files. Otherwise, include all columns in all files and fill missing data with empty strings.''') group_merge.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, help='A list of AIRR rearrangement files.') parser_merge.set_defaults(func=merge_cmd) # Subparser to validate files parser_validate = subparsers.add_parser('validate', parents=[common_parser], add_help=False, help='Validate AIRR files.', description='Validate AIRR files.') validate_subparser = parser_validate.add_subparsers(title='subcommands', metavar='', help='Database operation') # Subparser to validate repertoire files parser_validate = validate_subparser.add_parser('repertoire', parents=[common_parser], add_help=False, help='Validate AIRR repertoire metadata files.', description='Validate AIRR repertoire metadata files.') group_validate = parser_validate.add_argument_group('validate arguments') group_validate.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, help='A list of AIRR repertoire metadata files.') parser_validate.set_defaults(func=validate_repertoire_cmd) # Subparser to validate rearrangement files parser_validate = validate_subparser.add_parser('rearrangement', parents=[common_parser], add_help=False, help='Validate AIRR rearrangement files.', description='Validate AIRR rearrangement files.') group_validate = parser_validate.add_argument_group('validate arguments') group_validate.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, help='A list of AIRR rearrangement files.') parser_validate.set_defaults(func=validate_cmd) return parser def main(): """ Utility commands for AIRR Community Standards files """ # Define argument parsers and print help if subcommand not specified parser = define_args() if len(sys.argv) == 1: parser.print_help() sys.exit(1) # Parse arguments args = parser.parse_args() args_dict = args.__dict__.copy() del args_dict['command'] del args_dict['func'] # Call tool function result = args.func(**args_dict) # set return code to non-zero if error occurred if args.__dict__['command'] == 'validate' or args.__dict__['command'] == 'merge': if not result: sys.exit(1) airr-1.3.1/requirements.txt0000644000076600000240000000011213402556313016475 0ustar vandej27staff00000000000000pandas>= 0.18.0 pyyaml>=3.12 yamlordereddictloader>=0.4.0 setuptools>=2.0 airr-1.3.1/setup.cfg0000644000076600000240000000030013741420532015030 0ustar vandej27staff00000000000000[versioneer] vcs = git style = pep440 versionfile_source = airr/_version.py versionfile_build = airr/_version.py tag_prefix = v parentdir_prefix = airr- [egg_info] tag_build = tag_date = 0 airr-1.3.1/setup.py0000644000076600000240000000322513663532352014741 0ustar vandej27staff00000000000000""" AIRR community formats for adaptive immune receptor data. """ import sys import os import versioneer try: from setuptools import setup, find_packages except ImportError: sys.exit('setuptools is required.') with open('README.rst', 'r') as ip: long_description = ip.read() # Parse requirements if os.environ.get('READTHEDOCS', None) == 'True': # Set empty install_requires to get install to work on readthedocs install_requires = [] else: with open('requirements.txt') as req: install_requires = req.read().splitlines() # Setup setup(name='airr', version=versioneer.get_version(), cmdclass=versioneer.get_cmdclass(), author='AIRR Community', author_email='', description='AIRR Community Data Representation Standard reference library for antibody and TCR sequencing data.', long_description=long_description, zip_safe=False, license='CC BY 4.0', url='http://docs.airr-community.org', keywords=['AIRR', 'bioinformatics', 'sequencing', 'immunoglobulin', 'antibody', 'adaptive immunity', 'T cell', 'B cell', 'BCR', 'TCR'], install_requires=install_requires, packages=find_packages(), package_data={'airr': ['specs/*.yaml']}, entry_points={'console_scripts': ['airr-tools=airr.tools:main']}, classifiers=['Intended Audience :: Science/Research', 'Natural Language :: English', 'Operating System :: OS Independent', 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3', 'Topic :: Scientific/Engineering :: Bio-Informatics']) airr-1.3.1/tests/0000755000076600000240000000000013741420532014360 5ustar vandej27staff00000000000000airr-1.3.1/tests/__init__.py0000644000076600000240000000000013402556313016460 0ustar vandej27staff00000000000000airr-1.3.1/tests/test_interface.py0000644000076600000240000000570713741416302017742 0ustar vandej27staff00000000000000""" Unit tests for interface """ # System imports import os import time import unittest # Load imports import airr from airr.schema import ValidationError # Paths test_path = os.path.dirname(os.path.realpath(__file__)) data_path = os.path.join(test_path, 'data') class TestInferface(unittest.TestCase): def setUp(self): print('-------> %s()' % self.id()) # Test data self.data_good = os.path.join(data_path, 'good_data.tsv') self.data_bad = os.path.join(data_path, 'bad_data.tsv') self.rep_good = os.path.join(data_path, 'good_repertoire.airr.yaml') self.rep_bad = os.path.join(data_path, 'bad_repertoire.airr.yaml') # Expected output self.shape_good = (9, 44) self.shape_bad = (9, 44) # Start timer self.start = time.time() def tearDown(self): t = time.time() - self.start print('<- %.3f %s()' % (t, self.id())) # @unittest.skip('-> load(): skipped\n') def test_load(self): # Good data result = airr.load_rearrangement(self.data_good) self.assertTupleEqual(result.shape, self.shape_good, 'load(): good data failed') # Bad data result = airr.load_rearrangement(self.data_bad) self.assertTupleEqual(result.shape, self.shape_bad, 'load(): bad data failed') # @unittest.skip('-> repertoire_template(): skipped\n') def test_repertoire_template(self): try: rep = airr.repertoire_template() result = airr.schema.RepertoireSchema.validate_object(rep) self.assertTrue(result, 'repertoire_template(): repertoire template failed validation') except: self.assertTrue(False, 'repertoire_template(): repertoire template failed validation') # @unittest.skip('-> validate(): skipped\n') def test_validate(self): # Good data try: result = airr.validate_rearrangement(self.data_good) self.assertTrue(result, 'validate(): good data failed') except: self.assertTrue(False, 'validate(): good data failed') # Bad data try: result = airr.validate_rearrangement(self.data_bad) self.assertFalse(result, 'validate(): bad data failed') except Exception as inst: print(type(inst)) raise inst # @unittest.skip('-> load_repertoire(): skipped\n') def test_load_repertoire(self): # Good data try: data = airr.load_repertoire(self.rep_good, validate=True) except: self.assertTrue(False, 'load_repertoire(): good data failed') # Bad data try: data = airr.load_repertoire(self.rep_bad, validate=True, debug=True) self.assertFalse(True, 'load_repertoire(): bad data failed') except ValidationError: pass except Exception as inst: print(type(inst)) raise inst if __name__ == '__main__': unittest.main() airr-1.3.1/tests/test_io.py0000644000076600000240000000370413741416302016404 0ustar vandej27staff00000000000000""" Unit tests for interface """ # System imports import os import time import unittest # Load imports from airr.io import * # Paths test_path = os.path.dirname(os.path.realpath(__file__)) data_path = os.path.join(test_path, 'data') class TestRearrangementReader(unittest.TestCase): def setUp(self): print('-------> %s()' % self.id()) # Test data self.data_good = os.path.join(data_path, 'good_data.tsv') self.data_bad = os.path.join(data_path, 'bad_data.tsv') self.data_extra = os.path.join(data_path, 'extra_data.tsv') # Start timer self.start = time.time() def tearDown(self): t = time.time() - self.start print('<- %.3f %s()' % (t, self.id())) # @unittest.skip('-> validate(): skipped\n') def test_validate(self): # Good data try: with open(self.data_good, 'r') as handle: reader = RearrangementReader(handle, validate=True) for r in reader: pass except: self.assertTrue(False, 'validate(): good data failed') # Bad data try: with open(self.data_bad, 'r') as handle: reader = RearrangementReader(handle, validate=True) for r in reader: pass self.assertFalse(True, 'validate(): bad data failed') except ValidationError: pass except Exception as inst: print(type(inst)) raise inst # Extra data try: with open(self.data_extra, 'r') as handle: reader = RearrangementReader(handle, validate=False) for r in reader: pass self.assertFalse(True, 'validate(): extra data failed') except ValueError: pass except Exception as inst: print(type(inst)) raise inst if __name__ == '__main__': unittest.main() airr-1.3.1/versioneer.py0000644000076600000240000020600313402556313015753 0ustar vandej27staff00000000000000 # Version: 0.18 """The Versioneer - like a rocketeer, but for versions. The Versioneer ============== * like a rocketeer, but for versions! * https://github.com/warner/python-versioneer * Brian Warner * License: Public Domain * Compatible With: python2.6, 2.7, 3.2, 3.3, 3.4, 3.5, 3.6, and pypy * [![Latest Version] (https://pypip.in/version/versioneer/badge.svg?style=flat) ](https://pypi.python.org/pypi/versioneer/) * [![Build Status] (https://travis-ci.org/warner/python-versioneer.png?branch=master) ](https://travis-ci.org/warner/python-versioneer) This is a tool for managing a recorded version number in distutils-based python projects. The goal is to remove the tedious and error-prone "update the embedded version string" step from your release process. Making a new release should be as easy as recording a new tag in your version-control system, and maybe making new tarballs. ## Quick Install * `pip install versioneer` to somewhere to your $PATH * add a `[versioneer]` section to your setup.cfg (see below) * run `versioneer install` in your source tree, commit the results ## Version Identifiers Source trees come from a variety of places: * a version-control system checkout (mostly used by developers) * a nightly tarball, produced by build automation * a snapshot tarball, produced by a web-based VCS browser, like github's "tarball from tag" feature * a release tarball, produced by "setup.py sdist", distributed through PyPI Within each source tree, the version identifier (either a string or a number, this tool is format-agnostic) can come from a variety of places: * ask the VCS tool itself, e.g. "git describe" (for checkouts), which knows about recent "tags" and an absolute revision-id * the name of the directory into which the tarball was unpacked * an expanded VCS keyword ($Id$, etc) * a `_version.py` created by some earlier build step For released software, the version identifier is closely related to a VCS tag. Some projects use tag names that include more than just the version string (e.g. "myproject-1.2" instead of just "1.2"), in which case the tool needs to strip the tag prefix to extract the version identifier. For unreleased software (between tags), the version identifier should provide enough information to help developers recreate the same tree, while also giving them an idea of roughly how old the tree is (after version 1.2, before version 1.3). Many VCS systems can report a description that captures this, for example `git describe --tags --dirty --always` reports things like "0.7-1-g574ab98-dirty" to indicate that the checkout is one revision past the 0.7 tag, has a unique revision id of "574ab98", and is "dirty" (it has uncommitted changes. The version identifier is used for multiple purposes: * to allow the module to self-identify its version: `myproject.__version__` * to choose a name and prefix for a 'setup.py sdist' tarball ## Theory of Operation Versioneer works by adding a special `_version.py` file into your source tree, where your `__init__.py` can import it. This `_version.py` knows how to dynamically ask the VCS tool for version information at import time. `_version.py` also contains `$Revision$` markers, and the installation process marks `_version.py` to have this marker rewritten with a tag name during the `git archive` command. As a result, generated tarballs will contain enough information to get the proper version. To allow `setup.py` to compute a version too, a `versioneer.py` is added to the top level of your source tree, next to `setup.py` and the `setup.cfg` that configures it. This overrides several distutils/setuptools commands to compute the version when invoked, and changes `setup.py build` and `setup.py sdist` to replace `_version.py` with a small static file that contains just the generated version data. ## Installation See [INSTALL.md](./INSTALL.md) for detailed installation instructions. ## Version-String Flavors Code which uses Versioneer can learn about its version string at runtime by importing `_version` from your main `__init__.py` file and running the `get_versions()` function. From the "outside" (e.g. in `setup.py`), you can import the top-level `versioneer.py` and run `get_versions()`. Both functions return a dictionary with different flavors of version information: * `['version']`: A condensed version string, rendered using the selected style. This is the most commonly used value for the project's version string. The default "pep440" style yields strings like `0.11`, `0.11+2.g1076c97`, or `0.11+2.g1076c97.dirty`. See the "Styles" section below for alternative styles. * `['full-revisionid']`: detailed revision identifier. For Git, this is the full SHA1 commit id, e.g. "1076c978a8d3cfc70f408fe5974aa6c092c949ac". * `['date']`: Date and time of the latest `HEAD` commit. For Git, it is the commit date in ISO 8601 format. This will be None if the date is not available. * `['dirty']`: a boolean, True if the tree has uncommitted changes. Note that this is only accurate if run in a VCS checkout, otherwise it is likely to be False or None * `['error']`: if the version string could not be computed, this will be set to a string describing the problem, otherwise it will be None. It may be useful to throw an exception in setup.py if this is set, to avoid e.g. creating tarballs with a version string of "unknown". Some variants are more useful than others. Including `full-revisionid` in a bug report should allow developers to reconstruct the exact code being tested (or indicate the presence of local changes that should be shared with the developers). `version` is suitable for display in an "about" box or a CLI `--version` output: it can be easily compared against release notes and lists of bugs fixed in various releases. The installer adds the following text to your `__init__.py` to place a basic version in `YOURPROJECT.__version__`: from ._version import get_versions __version__ = get_versions()['version'] del get_versions ## Styles The setup.cfg `style=` configuration controls how the VCS information is rendered into a version string. The default style, "pep440", produces a PEP440-compliant string, equal to the un-prefixed tag name for actual releases, and containing an additional "local version" section with more detail for in-between builds. For Git, this is TAG[+DISTANCE.gHEX[.dirty]] , using information from `git describe --tags --dirty --always`. For example "0.11+2.g1076c97.dirty" indicates that the tree is like the "1076c97" commit but has uncommitted changes (".dirty"), and that this commit is two revisions ("+2") beyond the "0.11" tag. For released software (exactly equal to a known tag), the identifier will only contain the stripped tag, e.g. "0.11". Other styles are available. See [details.md](details.md) in the Versioneer source tree for descriptions. ## Debugging Versioneer tries to avoid fatal errors: if something goes wrong, it will tend to return a version of "0+unknown". To investigate the problem, run `setup.py version`, which will run the version-lookup code in a verbose mode, and will display the full contents of `get_versions()` (including the `error` string, which may help identify what went wrong). ## Known Limitations Some situations are known to cause problems for Versioneer. This details the most significant ones. More can be found on Github [issues page](https://github.com/warner/python-versioneer/issues). ### Subprojects Versioneer has limited support for source trees in which `setup.py` is not in the root directory (e.g. `setup.py` and `.git/` are *not* siblings). The are two common reasons why `setup.py` might not be in the root: * Source trees which contain multiple subprojects, such as [Buildbot](https://github.com/buildbot/buildbot), which contains both "master" and "slave" subprojects, each with their own `setup.py`, `setup.cfg`, and `tox.ini`. Projects like these produce multiple PyPI distributions (and upload multiple independently-installable tarballs). * Source trees whose main purpose is to contain a C library, but which also provide bindings to Python (and perhaps other langauges) in subdirectories. Versioneer will look for `.git` in parent directories, and most operations should get the right version string. However `pip` and `setuptools` have bugs and implementation details which frequently cause `pip install .` from a subproject directory to fail to find a correct version string (so it usually defaults to `0+unknown`). `pip install --editable .` should work correctly. `setup.py install` might work too. Pip-8.1.1 is known to have this problem, but hopefully it will get fixed in some later version. [Bug #38](https://github.com/warner/python-versioneer/issues/38) is tracking this issue. The discussion in [PR #61](https://github.com/warner/python-versioneer/pull/61) describes the issue from the Versioneer side in more detail. [pip PR#3176](https://github.com/pypa/pip/pull/3176) and [pip PR#3615](https://github.com/pypa/pip/pull/3615) contain work to improve pip to let Versioneer work correctly. Versioneer-0.16 and earlier only looked for a `.git` directory next to the `setup.cfg`, so subprojects were completely unsupported with those releases. ### Editable installs with setuptools <= 18.5 `setup.py develop` and `pip install --editable .` allow you to install a project into a virtualenv once, then continue editing the source code (and test) without re-installing after every change. "Entry-point scripts" (`setup(entry_points={"console_scripts": ..})`) are a convenient way to specify executable scripts that should be installed along with the python package. These both work as expected when using modern setuptools. When using setuptools-18.5 or earlier, however, certain operations will cause `pkg_resources.DistributionNotFound` errors when running the entrypoint script, which must be resolved by re-installing the package. This happens when the install happens with one version, then the egg_info data is regenerated while a different version is checked out. Many setup.py commands cause egg_info to be rebuilt (including `sdist`, `wheel`, and installing into a different virtualenv), so this can be surprising. [Bug #83](https://github.com/warner/python-versioneer/issues/83) describes this one, but upgrading to a newer version of setuptools should probably resolve it. ### Unicode version strings While Versioneer works (and is continually tested) with both Python 2 and Python 3, it is not entirely consistent with bytes-vs-unicode distinctions. Newer releases probably generate unicode version strings on py2. It's not clear that this is wrong, but it may be surprising for applications when then write these strings to a network connection or include them in bytes-oriented APIs like cryptographic checksums. [Bug #71](https://github.com/warner/python-versioneer/issues/71) investigates this question. ## Updating Versioneer To upgrade your project to a new release of Versioneer, do the following: * install the new Versioneer (`pip install -U versioneer` or equivalent) * edit `setup.cfg`, if necessary, to include any new configuration settings indicated by the release notes. See [UPGRADING](./UPGRADING.md) for details. * re-run `versioneer install` in your source tree, to replace `SRC/_version.py` * commit any changed files ## Future Directions This tool is designed to make it easily extended to other version-control systems: all VCS-specific components are in separate directories like src/git/ . The top-level `versioneer.py` script is assembled from these components by running make-versioneer.py . In the future, make-versioneer.py will take a VCS name as an argument, and will construct a version of `versioneer.py` that is specific to the given VCS. It might also take the configuration arguments that are currently provided manually during installation by editing setup.py . Alternatively, it might go the other direction and include code from all supported VCS systems, reducing the number of intermediate scripts. ## License To make Versioneer easier to embed, all its code is dedicated to the public domain. The `_version.py` that it creates is also in the public domain. Specifically, both are released under the Creative Commons "Public Domain Dedication" license (CC0-1.0), as described in https://creativecommons.org/publicdomain/zero/1.0/ . """ from __future__ import print_function try: import configparser except ImportError: import ConfigParser as configparser import errno import json import os import re import subprocess import sys class VersioneerConfig: """Container for Versioneer configuration parameters.""" def get_root(): """Get the project root directory. We require that all commands are run from the project root, i.e. the directory that contains setup.py, setup.cfg, and versioneer.py . """ root = os.path.realpath(os.path.abspath(os.getcwd())) setup_py = os.path.join(root, "setup.py") versioneer_py = os.path.join(root, "versioneer.py") if not (os.path.exists(setup_py) or os.path.exists(versioneer_py)): # allow 'python path/to/setup.py COMMAND' root = os.path.dirname(os.path.realpath(os.path.abspath(sys.argv[0]))) setup_py = os.path.join(root, "setup.py") versioneer_py = os.path.join(root, "versioneer.py") if not (os.path.exists(setup_py) or os.path.exists(versioneer_py)): err = ("Versioneer was unable to run the project root directory. " "Versioneer requires setup.py to be executed from " "its immediate directory (like 'python setup.py COMMAND'), " "or in a way that lets it use sys.argv[0] to find the root " "(like 'python path/to/setup.py COMMAND').") raise VersioneerBadRootError(err) try: # Certain runtime workflows (setup.py install/develop in a setuptools # tree) execute all dependencies in a single python process, so # "versioneer" may be imported multiple times, and python's shared # module-import table will cache the first one. So we can't use # os.path.dirname(__file__), as that will find whichever # versioneer.py was first imported, even in later projects. me = os.path.realpath(os.path.abspath(__file__)) me_dir = os.path.normcase(os.path.splitext(me)[0]) vsr_dir = os.path.normcase(os.path.splitext(versioneer_py)[0]) if me_dir != vsr_dir: print("Warning: build in %s is using versioneer.py from %s" % (os.path.dirname(me), versioneer_py)) except NameError: pass return root def get_config_from_root(root): """Read the project setup.cfg file to determine Versioneer config.""" # This might raise EnvironmentError (if setup.cfg is missing), or # configparser.NoSectionError (if it lacks a [versioneer] section), or # configparser.NoOptionError (if it lacks "VCS="). See the docstring at # the top of versioneer.py for instructions on writing your setup.cfg . setup_cfg = os.path.join(root, "setup.cfg") parser = configparser.SafeConfigParser() with open(setup_cfg, "r") as f: parser.readfp(f) VCS = parser.get("versioneer", "VCS") # mandatory def get(parser, name): if parser.has_option("versioneer", name): return parser.get("versioneer", name) return None cfg = VersioneerConfig() cfg.VCS = VCS cfg.style = get(parser, "style") or "" cfg.versionfile_source = get(parser, "versionfile_source") cfg.versionfile_build = get(parser, "versionfile_build") cfg.tag_prefix = get(parser, "tag_prefix") if cfg.tag_prefix in ("''", '""'): cfg.tag_prefix = "" cfg.parentdir_prefix = get(parser, "parentdir_prefix") cfg.verbose = get(parser, "verbose") return cfg class NotThisMethod(Exception): """Exception raised if a method is not valid for the current scenario.""" # these dictionaries contain VCS-specific tools LONG_VERSION_PY = {} HANDLERS = {} def register_vcs_handler(vcs, method): # decorator """Decorator to mark a method as the handler for a particular VCS.""" def decorate(f): """Store f in HANDLERS[vcs][method].""" if vcs not in HANDLERS: HANDLERS[vcs] = {} HANDLERS[vcs][method] = f return f return decorate def run_command(commands, args, cwd=None, verbose=False, hide_stderr=False, env=None): """Call the given command(s).""" assert isinstance(commands, list) p = None for c in commands: try: dispcmd = str([c] + args) # remember shell=False, so use git.cmd on windows, not just git p = subprocess.Popen([c] + args, cwd=cwd, env=env, stdout=subprocess.PIPE, stderr=(subprocess.PIPE if hide_stderr else None)) break except EnvironmentError: e = sys.exc_info()[1] if e.errno == errno.ENOENT: continue if verbose: print("unable to run %s" % dispcmd) print(e) return None, None else: if verbose: print("unable to find command, tried %s" % (commands,)) return None, None stdout = p.communicate()[0].strip() if sys.version_info[0] >= 3: stdout = stdout.decode() if p.returncode != 0: if verbose: print("unable to run %s (error)" % dispcmd) print("stdout was %s" % stdout) return None, p.returncode return stdout, p.returncode LONG_VERSION_PY['git'] = ''' # This file helps to compute a version number in source trees obtained from # git-archive tarball (such as those provided by githubs download-from-tag # feature). Distribution tarballs (built by setup.py sdist) and build # directories (produced by setup.py build) will contain a much shorter file # that just contains the computed version number. # This file is released into the public domain. Generated by # versioneer-0.18 (https://github.com/warner/python-versioneer) """Git implementation of _version.py.""" import errno import os import re import subprocess import sys def get_keywords(): """Get the keywords needed to look up the version information.""" # these strings will be replaced by git during git-archive. # setup.py/versioneer.py will grep for the variable names, so they must # each be defined on a line of their own. _version.py will just call # get_keywords(). git_refnames = "%(DOLLAR)sFormat:%%d%(DOLLAR)s" git_full = "%(DOLLAR)sFormat:%%H%(DOLLAR)s" git_date = "%(DOLLAR)sFormat:%%ci%(DOLLAR)s" keywords = {"refnames": git_refnames, "full": git_full, "date": git_date} return keywords class VersioneerConfig: """Container for Versioneer configuration parameters.""" def get_config(): """Create, populate and return the VersioneerConfig() object.""" # these strings are filled in when 'setup.py versioneer' creates # _version.py cfg = VersioneerConfig() cfg.VCS = "git" cfg.style = "%(STYLE)s" cfg.tag_prefix = "%(TAG_PREFIX)s" cfg.parentdir_prefix = "%(PARENTDIR_PREFIX)s" cfg.versionfile_source = "%(VERSIONFILE_SOURCE)s" cfg.verbose = False return cfg class NotThisMethod(Exception): """Exception raised if a method is not valid for the current scenario.""" LONG_VERSION_PY = {} HANDLERS = {} def register_vcs_handler(vcs, method): # decorator """Decorator to mark a method as the handler for a particular VCS.""" def decorate(f): """Store f in HANDLERS[vcs][method].""" if vcs not in HANDLERS: HANDLERS[vcs] = {} HANDLERS[vcs][method] = f return f return decorate def run_command(commands, args, cwd=None, verbose=False, hide_stderr=False, env=None): """Call the given command(s).""" assert isinstance(commands, list) p = None for c in commands: try: dispcmd = str([c] + args) # remember shell=False, so use git.cmd on windows, not just git p = subprocess.Popen([c] + args, cwd=cwd, env=env, stdout=subprocess.PIPE, stderr=(subprocess.PIPE if hide_stderr else None)) break except EnvironmentError: e = sys.exc_info()[1] if e.errno == errno.ENOENT: continue if verbose: print("unable to run %%s" %% dispcmd) print(e) return None, None else: if verbose: print("unable to find command, tried %%s" %% (commands,)) return None, None stdout = p.communicate()[0].strip() if sys.version_info[0] >= 3: stdout = stdout.decode() if p.returncode != 0: if verbose: print("unable to run %%s (error)" %% dispcmd) print("stdout was %%s" %% stdout) return None, p.returncode return stdout, p.returncode def versions_from_parentdir(parentdir_prefix, root, verbose): """Try to determine the version from the parent directory name. Source tarballs conventionally unpack into a directory that includes both the project name and a version string. We will also support searching up two directory levels for an appropriately named parent directory """ rootdirs = [] for i in range(3): dirname = os.path.basename(root) if dirname.startswith(parentdir_prefix): return {"version": dirname[len(parentdir_prefix):], "full-revisionid": None, "dirty": False, "error": None, "date": None} else: rootdirs.append(root) root = os.path.dirname(root) # up a level if verbose: print("Tried directories %%s but none started with prefix %%s" %% (str(rootdirs), parentdir_prefix)) raise NotThisMethod("rootdir doesn't start with parentdir_prefix") @register_vcs_handler("git", "get_keywords") def git_get_keywords(versionfile_abs): """Extract version information from the given file.""" # the code embedded in _version.py can just fetch the value of these # keywords. When used from setup.py, we don't want to import _version.py, # so we do it with a regexp instead. This function is not used from # _version.py. keywords = {} try: f = open(versionfile_abs, "r") for line in f.readlines(): if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["refnames"] = mo.group(1) if line.strip().startswith("git_full ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["full"] = mo.group(1) if line.strip().startswith("git_date ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["date"] = mo.group(1) f.close() except EnvironmentError: pass return keywords @register_vcs_handler("git", "keywords") def git_versions_from_keywords(keywords, tag_prefix, verbose): """Get version information from git keywords.""" if not keywords: raise NotThisMethod("no keywords at all, weird") date = keywords.get("date") if date is not None: # git-2.2.0 added "%%cI", which expands to an ISO-8601 -compliant # datestamp. However we prefer "%%ci" (which expands to an "ISO-8601 # -like" string, which we must then edit to make compliant), because # it's been around since git-1.5.3, and it's too difficult to # discover which version we're using, or to work around using an # older one. date = date.strip().replace(" ", "T", 1).replace(" ", "", 1) refnames = keywords["refnames"].strip() if refnames.startswith("$Format"): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") refs = set([r.strip() for r in refnames.strip("()").split(",")]) # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " tags = set([r[len(TAG):] for r in refs if r.startswith(TAG)]) if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %%d # expansion behaves like git log --decorate=short and strips out the # refs/heads/ and refs/tags/ prefixes that would let us distinguish # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". tags = set([r for r in refs if re.search(r'\d', r)]) if verbose: print("discarding '%%s', no digits" %% ",".join(refs - tags)) if verbose: print("likely tags: %%s" %% ",".join(sorted(tags))) for ref in sorted(tags): # sorting will prefer e.g. "2.0" over "2.0rc1" if ref.startswith(tag_prefix): r = ref[len(tag_prefix):] if verbose: print("picking %%s" %% r) return {"version": r, "full-revisionid": keywords["full"].strip(), "dirty": False, "error": None, "date": date} # no suitable tags, so version is "0+unknown", but full hex is still there if verbose: print("no suitable tags, using unknown + full revision id") return {"version": "0+unknown", "full-revisionid": keywords["full"].strip(), "dirty": False, "error": "no suitable tags", "date": None} @register_vcs_handler("git", "pieces_from_vcs") def git_pieces_from_vcs(tag_prefix, root, verbose, run_command=run_command): """Get version from 'git describe' in the root of the source tree. This only gets called if the git-archive 'subst' keywords were *not* expanded, and _version.py hasn't already been rewritten with a short version string, meaning we're inside a checked out source tree. """ GITS = ["git"] if sys.platform == "win32": GITS = ["git.cmd", "git.exe"] out, rc = run_command(GITS, ["rev-parse", "--git-dir"], cwd=root, hide_stderr=True) if rc != 0: if verbose: print("Directory %%s not under git control" %% root) raise NotThisMethod("'git rev-parse --git-dir' returned error") # if there is a tag matching tag_prefix, this yields TAG-NUM-gHEX[-dirty] # if there isn't one, this yields HEX[-dirty] (no NUM) describe_out, rc = run_command(GITS, ["describe", "--tags", "--dirty", "--always", "--long", "--match", "%%s*" %% tag_prefix], cwd=root) # --long was added in git-1.5.5 if describe_out is None: raise NotThisMethod("'git describe' failed") describe_out = describe_out.strip() full_out, rc = run_command(GITS, ["rev-parse", "HEAD"], cwd=root) if full_out is None: raise NotThisMethod("'git rev-parse' failed") full_out = full_out.strip() pieces = {} pieces["long"] = full_out pieces["short"] = full_out[:7] # maybe improved later pieces["error"] = None # parse describe_out. It will be like TAG-NUM-gHEX[-dirty] or HEX[-dirty] # TAG might have hyphens. git_describe = describe_out # look for -dirty suffix dirty = git_describe.endswith("-dirty") pieces["dirty"] = dirty if dirty: git_describe = git_describe[:git_describe.rindex("-dirty")] # now we have TAG-NUM-gHEX or HEX if "-" in git_describe: # TAG-NUM-gHEX mo = re.search(r'^(.+)-(\d+)-g([0-9a-f]+)$', git_describe) if not mo: # unparseable. Maybe git-describe is misbehaving? pieces["error"] = ("unable to parse git-describe output: '%%s'" %% describe_out) return pieces # tag full_tag = mo.group(1) if not full_tag.startswith(tag_prefix): if verbose: fmt = "tag '%%s' doesn't start with prefix '%%s'" print(fmt %% (full_tag, tag_prefix)) pieces["error"] = ("tag '%%s' doesn't start with prefix '%%s'" %% (full_tag, tag_prefix)) return pieces pieces["closest-tag"] = full_tag[len(tag_prefix):] # distance: number of commits since tag pieces["distance"] = int(mo.group(2)) # commit: short hex revision ID pieces["short"] = mo.group(3) else: # HEX: no tags pieces["closest-tag"] = None count_out, rc = run_command(GITS, ["rev-list", "HEAD", "--count"], cwd=root) pieces["distance"] = int(count_out) # total number of commits # commit date: see ISO-8601 comment in git_versions_from_keywords() date = run_command(GITS, ["show", "-s", "--format=%%ci", "HEAD"], cwd=root)[0].strip() pieces["date"] = date.strip().replace(" ", "T", 1).replace(" ", "", 1) return pieces def plus_or_dot(pieces): """Return a + if we don't already have one, else return a .""" if "+" in pieces.get("closest-tag", ""): return "." return "+" def render_pep440(pieces): """Build up version string, with post-release "local version identifier". Our goal: TAG[+DISTANCE.gHEX[.dirty]] . Note that if you get a tagged build and then dirty it, you'll get TAG+0.gHEX.dirty Exceptions: 1: no tags. git_describe was just HEX. 0+untagged.DISTANCE.gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += plus_or_dot(pieces) rendered += "%%d.g%%s" %% (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0+untagged.%%d.g%%s" %% (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" return rendered def render_pep440_pre(pieces): """TAG[.post.devDISTANCE] -- No -dirty. Exceptions: 1: no tags. 0.post.devDISTANCE """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += ".post.dev%%d" %% pieces["distance"] else: # exception #1 rendered = "0.post.dev%%d" %% pieces["distance"] return rendered def render_pep440_post(pieces): """TAG[.postDISTANCE[.dev0]+gHEX] . The ".dev0" means dirty. Note that .dev0 sorts backwards (a dirty tree will appear "older" than the corresponding clean one), but you shouldn't be releasing software with -dirty anyways. Exceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "g%%s" %% pieces["short"] else: # exception #1 rendered = "0.post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += "+g%%s" %% pieces["short"] return rendered def render_pep440_old(pieces): """TAG[.postDISTANCE[.dev0]] . The ".dev0" means dirty. Eexceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" else: # exception #1 rendered = "0.post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" return rendered def render_git_describe(pieces): """TAG[-DISTANCE-gHEX][-dirty]. Like 'git describe --tags --dirty --always'. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += "-%%d-g%%s" %% (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render_git_describe_long(pieces): """TAG-DISTANCE-gHEX[-dirty]. Like 'git describe --tags --dirty --always -long'. The distance/hash is unconditional. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] rendered += "-%%d-g%%s" %% (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render(pieces, style): """Render the given version pieces into the requested style.""" if pieces["error"]: return {"version": "unknown", "full-revisionid": pieces.get("long"), "dirty": None, "error": pieces["error"], "date": None} if not style or style == "default": style = "pep440" # the default if style == "pep440": rendered = render_pep440(pieces) elif style == "pep440-pre": rendered = render_pep440_pre(pieces) elif style == "pep440-post": rendered = render_pep440_post(pieces) elif style == "pep440-old": rendered = render_pep440_old(pieces) elif style == "git-describe": rendered = render_git_describe(pieces) elif style == "git-describe-long": rendered = render_git_describe_long(pieces) else: raise ValueError("unknown style '%%s'" %% style) return {"version": rendered, "full-revisionid": pieces["long"], "dirty": pieces["dirty"], "error": None, "date": pieces.get("date")} def get_versions(): """Get version information or return default if unable to do so.""" # I am in _version.py, which lives at ROOT/VERSIONFILE_SOURCE. If we have # __file__, we can work backwards from there to the root. Some # py2exe/bbfreeze/non-CPython implementations don't do __file__, in which # case we can only use expanded keywords. cfg = get_config() verbose = cfg.verbose try: return git_versions_from_keywords(get_keywords(), cfg.tag_prefix, verbose) except NotThisMethod: pass try: root = os.path.realpath(__file__) # versionfile_source is the relative path from the top of the source # tree (where the .git directory might live) to this file. Invert # this to find the root from __file__. for i in cfg.versionfile_source.split('/'): root = os.path.dirname(root) except NameError: return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to find root of source tree", "date": None} try: pieces = git_pieces_from_vcs(cfg.tag_prefix, root, verbose) return render(pieces, cfg.style) except NotThisMethod: pass try: if cfg.parentdir_prefix: return versions_from_parentdir(cfg.parentdir_prefix, root, verbose) except NotThisMethod: pass return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to compute version", "date": None} ''' @register_vcs_handler("git", "get_keywords") def git_get_keywords(versionfile_abs): """Extract version information from the given file.""" # the code embedded in _version.py can just fetch the value of these # keywords. When used from setup.py, we don't want to import _version.py, # so we do it with a regexp instead. This function is not used from # _version.py. keywords = {} try: f = open(versionfile_abs, "r") for line in f.readlines(): if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["refnames"] = mo.group(1) if line.strip().startswith("git_full ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["full"] = mo.group(1) if line.strip().startswith("git_date ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["date"] = mo.group(1) f.close() except EnvironmentError: pass return keywords @register_vcs_handler("git", "keywords") def git_versions_from_keywords(keywords, tag_prefix, verbose): """Get version information from git keywords.""" if not keywords: raise NotThisMethod("no keywords at all, weird") date = keywords.get("date") if date is not None: # git-2.2.0 added "%cI", which expands to an ISO-8601 -compliant # datestamp. However we prefer "%ci" (which expands to an "ISO-8601 # -like" string, which we must then edit to make compliant), because # it's been around since git-1.5.3, and it's too difficult to # discover which version we're using, or to work around using an # older one. date = date.strip().replace(" ", "T", 1).replace(" ", "", 1) refnames = keywords["refnames"].strip() if refnames.startswith("$Format"): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") refs = set([r.strip() for r in refnames.strip("()").split(",")]) # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " tags = set([r[len(TAG):] for r in refs if r.startswith(TAG)]) if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %d # expansion behaves like git log --decorate=short and strips out the # refs/heads/ and refs/tags/ prefixes that would let us distinguish # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". tags = set([r for r in refs if re.search(r'\d', r)]) if verbose: print("discarding '%s', no digits" % ",".join(refs - tags)) if verbose: print("likely tags: %s" % ",".join(sorted(tags))) for ref in sorted(tags): # sorting will prefer e.g. "2.0" over "2.0rc1" if ref.startswith(tag_prefix): r = ref[len(tag_prefix):] if verbose: print("picking %s" % r) return {"version": r, "full-revisionid": keywords["full"].strip(), "dirty": False, "error": None, "date": date} # no suitable tags, so version is "0+unknown", but full hex is still there if verbose: print("no suitable tags, using unknown + full revision id") return {"version": "0+unknown", "full-revisionid": keywords["full"].strip(), "dirty": False, "error": "no suitable tags", "date": None} @register_vcs_handler("git", "pieces_from_vcs") def git_pieces_from_vcs(tag_prefix, root, verbose, run_command=run_command): """Get version from 'git describe' in the root of the source tree. This only gets called if the git-archive 'subst' keywords were *not* expanded, and _version.py hasn't already been rewritten with a short version string, meaning we're inside a checked out source tree. """ GITS = ["git"] if sys.platform == "win32": GITS = ["git.cmd", "git.exe"] out, rc = run_command(GITS, ["rev-parse", "--git-dir"], cwd=root, hide_stderr=True) if rc != 0: if verbose: print("Directory %s not under git control" % root) raise NotThisMethod("'git rev-parse --git-dir' returned error") # if there is a tag matching tag_prefix, this yields TAG-NUM-gHEX[-dirty] # if there isn't one, this yields HEX[-dirty] (no NUM) describe_out, rc = run_command(GITS, ["describe", "--tags", "--dirty", "--always", "--long", "--match", "%s*" % tag_prefix], cwd=root) # --long was added in git-1.5.5 if describe_out is None: raise NotThisMethod("'git describe' failed") describe_out = describe_out.strip() full_out, rc = run_command(GITS, ["rev-parse", "HEAD"], cwd=root) if full_out is None: raise NotThisMethod("'git rev-parse' failed") full_out = full_out.strip() pieces = {} pieces["long"] = full_out pieces["short"] = full_out[:7] # maybe improved later pieces["error"] = None # parse describe_out. It will be like TAG-NUM-gHEX[-dirty] or HEX[-dirty] # TAG might have hyphens. git_describe = describe_out # look for -dirty suffix dirty = git_describe.endswith("-dirty") pieces["dirty"] = dirty if dirty: git_describe = git_describe[:git_describe.rindex("-dirty")] # now we have TAG-NUM-gHEX or HEX if "-" in git_describe: # TAG-NUM-gHEX mo = re.search(r'^(.+)-(\d+)-g([0-9a-f]+)$', git_describe) if not mo: # unparseable. Maybe git-describe is misbehaving? pieces["error"] = ("unable to parse git-describe output: '%s'" % describe_out) return pieces # tag full_tag = mo.group(1) if not full_tag.startswith(tag_prefix): if verbose: fmt = "tag '%s' doesn't start with prefix '%s'" print(fmt % (full_tag, tag_prefix)) pieces["error"] = ("tag '%s' doesn't start with prefix '%s'" % (full_tag, tag_prefix)) return pieces pieces["closest-tag"] = full_tag[len(tag_prefix):] # distance: number of commits since tag pieces["distance"] = int(mo.group(2)) # commit: short hex revision ID pieces["short"] = mo.group(3) else: # HEX: no tags pieces["closest-tag"] = None count_out, rc = run_command(GITS, ["rev-list", "HEAD", "--count"], cwd=root) pieces["distance"] = int(count_out) # total number of commits # commit date: see ISO-8601 comment in git_versions_from_keywords() date = run_command(GITS, ["show", "-s", "--format=%ci", "HEAD"], cwd=root)[0].strip() pieces["date"] = date.strip().replace(" ", "T", 1).replace(" ", "", 1) return pieces def do_vcs_install(manifest_in, versionfile_source, ipy): """Git-specific installation logic for Versioneer. For Git, this means creating/changing .gitattributes to mark _version.py for export-subst keyword substitution. """ GITS = ["git"] if sys.platform == "win32": GITS = ["git.cmd", "git.exe"] files = [manifest_in, versionfile_source] if ipy: files.append(ipy) try: me = __file__ if me.endswith(".pyc") or me.endswith(".pyo"): me = os.path.splitext(me)[0] + ".py" versioneer_file = os.path.relpath(me) except NameError: versioneer_file = "versioneer.py" files.append(versioneer_file) present = False try: f = open(".gitattributes", "r") for line in f.readlines(): if line.strip().startswith(versionfile_source): if "export-subst" in line.strip().split()[1:]: present = True f.close() except EnvironmentError: pass if not present: f = open(".gitattributes", "a+") f.write("%s export-subst\n" % versionfile_source) f.close() files.append(".gitattributes") run_command(GITS, ["add", "--"] + files) def versions_from_parentdir(parentdir_prefix, root, verbose): """Try to determine the version from the parent directory name. Source tarballs conventionally unpack into a directory that includes both the project name and a version string. We will also support searching up two directory levels for an appropriately named parent directory """ rootdirs = [] for i in range(3): dirname = os.path.basename(root) if dirname.startswith(parentdir_prefix): return {"version": dirname[len(parentdir_prefix):], "full-revisionid": None, "dirty": False, "error": None, "date": None} else: rootdirs.append(root) root = os.path.dirname(root) # up a level if verbose: print("Tried directories %s but none started with prefix %s" % (str(rootdirs), parentdir_prefix)) raise NotThisMethod("rootdir doesn't start with parentdir_prefix") SHORT_VERSION_PY = """ # This file was generated by 'versioneer.py' (0.18) from # revision-control system data, or from the parent directory name of an # unpacked source archive. Distribution tarballs contain a pre-generated copy # of this file. import json version_json = ''' %s ''' # END VERSION_JSON def get_versions(): return json.loads(version_json) """ def versions_from_file(filename): """Try to determine the version from _version.py if present.""" try: with open(filename) as f: contents = f.read() except EnvironmentError: raise NotThisMethod("unable to read _version.py") mo = re.search(r"version_json = '''\n(.*)''' # END VERSION_JSON", contents, re.M | re.S) if not mo: mo = re.search(r"version_json = '''\r\n(.*)''' # END VERSION_JSON", contents, re.M | re.S) if not mo: raise NotThisMethod("no version_json in _version.py") return json.loads(mo.group(1)) def write_to_version_file(filename, versions): """Write the given version number to the given _version.py file.""" os.unlink(filename) contents = json.dumps(versions, sort_keys=True, indent=1, separators=(",", ": ")) with open(filename, "w") as f: f.write(SHORT_VERSION_PY % contents) print("set %s to '%s'" % (filename, versions["version"])) def plus_or_dot(pieces): """Return a + if we don't already have one, else return a .""" if "+" in pieces.get("closest-tag", ""): return "." return "+" def render_pep440(pieces): """Build up version string, with post-release "local version identifier". Our goal: TAG[+DISTANCE.gHEX[.dirty]] . Note that if you get a tagged build and then dirty it, you'll get TAG+0.gHEX.dirty Exceptions: 1: no tags. git_describe was just HEX. 0+untagged.DISTANCE.gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += plus_or_dot(pieces) rendered += "%d.g%s" % (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0+untagged.%d.g%s" % (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" return rendered def render_pep440_pre(pieces): """TAG[.post.devDISTANCE] -- No -dirty. Exceptions: 1: no tags. 0.post.devDISTANCE """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += ".post.dev%d" % pieces["distance"] else: # exception #1 rendered = "0.post.dev%d" % pieces["distance"] return rendered def render_pep440_post(pieces): """TAG[.postDISTANCE[.dev0]+gHEX] . The ".dev0" means dirty. Note that .dev0 sorts backwards (a dirty tree will appear "older" than the corresponding clean one), but you shouldn't be releasing software with -dirty anyways. Exceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "g%s" % pieces["short"] else: # exception #1 rendered = "0.post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += "+g%s" % pieces["short"] return rendered def render_pep440_old(pieces): """TAG[.postDISTANCE[.dev0]] . The ".dev0" means dirty. Eexceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" else: # exception #1 rendered = "0.post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" return rendered def render_git_describe(pieces): """TAG[-DISTANCE-gHEX][-dirty]. Like 'git describe --tags --dirty --always'. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += "-%d-g%s" % (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render_git_describe_long(pieces): """TAG-DISTANCE-gHEX[-dirty]. Like 'git describe --tags --dirty --always -long'. The distance/hash is unconditional. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] rendered += "-%d-g%s" % (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render(pieces, style): """Render the given version pieces into the requested style.""" if pieces["error"]: return {"version": "unknown", "full-revisionid": pieces.get("long"), "dirty": None, "error": pieces["error"], "date": None} if not style or style == "default": style = "pep440" # the default if style == "pep440": rendered = render_pep440(pieces) elif style == "pep440-pre": rendered = render_pep440_pre(pieces) elif style == "pep440-post": rendered = render_pep440_post(pieces) elif style == "pep440-old": rendered = render_pep440_old(pieces) elif style == "git-describe": rendered = render_git_describe(pieces) elif style == "git-describe-long": rendered = render_git_describe_long(pieces) else: raise ValueError("unknown style '%s'" % style) return {"version": rendered, "full-revisionid": pieces["long"], "dirty": pieces["dirty"], "error": None, "date": pieces.get("date")} class VersioneerBadRootError(Exception): """The project root directory is unknown or missing key files.""" def get_versions(verbose=False): """Get the project version from whatever source is available. Returns dict with two keys: 'version' and 'full'. """ if "versioneer" in sys.modules: # see the discussion in cmdclass.py:get_cmdclass() del sys.modules["versioneer"] root = get_root() cfg = get_config_from_root(root) assert cfg.VCS is not None, "please set [versioneer]VCS= in setup.cfg" handlers = HANDLERS.get(cfg.VCS) assert handlers, "unrecognized VCS '%s'" % cfg.VCS verbose = verbose or cfg.verbose assert cfg.versionfile_source is not None, \ "please set versioneer.versionfile_source" assert cfg.tag_prefix is not None, "please set versioneer.tag_prefix" versionfile_abs = os.path.join(root, cfg.versionfile_source) # extract version from first of: _version.py, VCS command (e.g. 'git # describe'), parentdir. This is meant to work for developers using a # source checkout, for users of a tarball created by 'setup.py sdist', # and for users of a tarball/zipball created by 'git archive' or github's # download-from-tag feature or the equivalent in other VCSes. get_keywords_f = handlers.get("get_keywords") from_keywords_f = handlers.get("keywords") if get_keywords_f and from_keywords_f: try: keywords = get_keywords_f(versionfile_abs) ver = from_keywords_f(keywords, cfg.tag_prefix, verbose) if verbose: print("got version from expanded keyword %s" % ver) return ver except NotThisMethod: pass try: ver = versions_from_file(versionfile_abs) if verbose: print("got version from file %s %s" % (versionfile_abs, ver)) return ver except NotThisMethod: pass from_vcs_f = handlers.get("pieces_from_vcs") if from_vcs_f: try: pieces = from_vcs_f(cfg.tag_prefix, root, verbose) ver = render(pieces, cfg.style) if verbose: print("got version from VCS %s" % ver) return ver except NotThisMethod: pass try: if cfg.parentdir_prefix: ver = versions_from_parentdir(cfg.parentdir_prefix, root, verbose) if verbose: print("got version from parentdir %s" % ver) return ver except NotThisMethod: pass if verbose: print("unable to compute version") return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to compute version", "date": None} def get_version(): """Get the short version string for this project.""" return get_versions()["version"] def get_cmdclass(): """Get the custom setuptools/distutils subclasses used by Versioneer.""" if "versioneer" in sys.modules: del sys.modules["versioneer"] # this fixes the "python setup.py develop" case (also 'install' and # 'easy_install .'), in which subdependencies of the main project are # built (using setup.py bdist_egg) in the same python process. Assume # a main project A and a dependency B, which use different versions # of Versioneer. A's setup.py imports A's Versioneer, leaving it in # sys.modules by the time B's setup.py is executed, causing B to run # with the wrong versioneer. Setuptools wraps the sub-dep builds in a # sandbox that restores sys.modules to it's pre-build state, so the # parent is protected against the child's "import versioneer". By # removing ourselves from sys.modules here, before the child build # happens, we protect the child from the parent's versioneer too. # Also see https://github.com/warner/python-versioneer/issues/52 cmds = {} # we add "version" to both distutils and setuptools from distutils.core import Command class cmd_version(Command): description = "report generated version string" user_options = [] boolean_options = [] def initialize_options(self): pass def finalize_options(self): pass def run(self): vers = get_versions(verbose=True) print("Version: %s" % vers["version"]) print(" full-revisionid: %s" % vers.get("full-revisionid")) print(" dirty: %s" % vers.get("dirty")) print(" date: %s" % vers.get("date")) if vers["error"]: print(" error: %s" % vers["error"]) cmds["version"] = cmd_version # we override "build_py" in both distutils and setuptools # # most invocation pathways end up running build_py: # distutils/build -> build_py # distutils/install -> distutils/build ->.. # setuptools/bdist_wheel -> distutils/install ->.. # setuptools/bdist_egg -> distutils/install_lib -> build_py # setuptools/install -> bdist_egg ->.. # setuptools/develop -> ? # pip install: # copies source tree to a tempdir before running egg_info/etc # if .git isn't copied too, 'git describe' will fail # then does setup.py bdist_wheel, or sometimes setup.py install # setup.py egg_info -> ? # we override different "build_py" commands for both environments if "setuptools" in sys.modules: from setuptools.command.build_py import build_py as _build_py else: from distutils.command.build_py import build_py as _build_py class cmd_build_py(_build_py): def run(self): root = get_root() cfg = get_config_from_root(root) versions = get_versions() _build_py.run(self) # now locate _version.py in the new build/ directory and replace # it with an updated value if cfg.versionfile_build: target_versionfile = os.path.join(self.build_lib, cfg.versionfile_build) print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, versions) cmds["build_py"] = cmd_build_py if "cx_Freeze" in sys.modules: # cx_freeze enabled? from cx_Freeze.dist import build_exe as _build_exe # nczeczulin reports that py2exe won't like the pep440-style string # as FILEVERSION, but it can be used for PRODUCTVERSION, e.g. # setup(console=[{ # "version": versioneer.get_version().split("+", 1)[0], # FILEVERSION # "product_version": versioneer.get_version(), # ... class cmd_build_exe(_build_exe): def run(self): root = get_root() cfg = get_config_from_root(root) versions = get_versions() target_versionfile = cfg.versionfile_source print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, versions) _build_exe.run(self) os.unlink(target_versionfile) with open(cfg.versionfile_source, "w") as f: LONG = LONG_VERSION_PY[cfg.VCS] f.write(LONG % {"DOLLAR": "$", "STYLE": cfg.style, "TAG_PREFIX": cfg.tag_prefix, "PARENTDIR_PREFIX": cfg.parentdir_prefix, "VERSIONFILE_SOURCE": cfg.versionfile_source, }) cmds["build_exe"] = cmd_build_exe del cmds["build_py"] if 'py2exe' in sys.modules: # py2exe enabled? try: from py2exe.distutils_buildexe import py2exe as _py2exe # py3 except ImportError: from py2exe.build_exe import py2exe as _py2exe # py2 class cmd_py2exe(_py2exe): def run(self): root = get_root() cfg = get_config_from_root(root) versions = get_versions() target_versionfile = cfg.versionfile_source print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, versions) _py2exe.run(self) os.unlink(target_versionfile) with open(cfg.versionfile_source, "w") as f: LONG = LONG_VERSION_PY[cfg.VCS] f.write(LONG % {"DOLLAR": "$", "STYLE": cfg.style, "TAG_PREFIX": cfg.tag_prefix, "PARENTDIR_PREFIX": cfg.parentdir_prefix, "VERSIONFILE_SOURCE": cfg.versionfile_source, }) cmds["py2exe"] = cmd_py2exe # we override different "sdist" commands for both environments if "setuptools" in sys.modules: from setuptools.command.sdist import sdist as _sdist else: from distutils.command.sdist import sdist as _sdist class cmd_sdist(_sdist): def run(self): versions = get_versions() self._versioneer_generated_versions = versions # unless we update this, the command will keep using the old # version self.distribution.metadata.version = versions["version"] return _sdist.run(self) def make_release_tree(self, base_dir, files): root = get_root() cfg = get_config_from_root(root) _sdist.make_release_tree(self, base_dir, files) # now locate _version.py in the new base_dir directory # (remembering that it may be a hardlink) and replace it with an # updated value target_versionfile = os.path.join(base_dir, cfg.versionfile_source) print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, self._versioneer_generated_versions) cmds["sdist"] = cmd_sdist return cmds CONFIG_ERROR = """ setup.cfg is missing the necessary Versioneer configuration. You need a section like: [versioneer] VCS = git style = pep440 versionfile_source = src/myproject/_version.py versionfile_build = myproject/_version.py tag_prefix = parentdir_prefix = myproject- You will also need to edit your setup.py to use the results: import versioneer setup(version=versioneer.get_version(), cmdclass=versioneer.get_cmdclass(), ...) Please read the docstring in ./versioneer.py for configuration instructions, edit setup.cfg, and re-run the installer or 'python versioneer.py setup'. """ SAMPLE_CONFIG = """ # See the docstring in versioneer.py for instructions. Note that you must # re-run 'versioneer.py setup' after changing this section, and commit the # resulting files. [versioneer] #VCS = git #style = pep440 #versionfile_source = #versionfile_build = #tag_prefix = #parentdir_prefix = """ INIT_PY_SNIPPET = """ from ._version import get_versions __version__ = get_versions()['version'] del get_versions """ def do_setup(): """Main VCS-independent setup function for installing Versioneer.""" root = get_root() try: cfg = get_config_from_root(root) except (EnvironmentError, configparser.NoSectionError, configparser.NoOptionError) as e: if isinstance(e, (EnvironmentError, configparser.NoSectionError)): print("Adding sample versioneer config to setup.cfg", file=sys.stderr) with open(os.path.join(root, "setup.cfg"), "a") as f: f.write(SAMPLE_CONFIG) print(CONFIG_ERROR, file=sys.stderr) return 1 print(" creating %s" % cfg.versionfile_source) with open(cfg.versionfile_source, "w") as f: LONG = LONG_VERSION_PY[cfg.VCS] f.write(LONG % {"DOLLAR": "$", "STYLE": cfg.style, "TAG_PREFIX": cfg.tag_prefix, "PARENTDIR_PREFIX": cfg.parentdir_prefix, "VERSIONFILE_SOURCE": cfg.versionfile_source, }) ipy = os.path.join(os.path.dirname(cfg.versionfile_source), "__init__.py") if os.path.exists(ipy): try: with open(ipy, "r") as f: old = f.read() except EnvironmentError: old = "" if INIT_PY_SNIPPET not in old: print(" appending to %s" % ipy) with open(ipy, "a") as f: f.write(INIT_PY_SNIPPET) else: print(" %s unmodified" % ipy) else: print(" %s doesn't exist, ok" % ipy) ipy = None # Make sure both the top-level "versioneer.py" and versionfile_source # (PKG/_version.py, used by runtime code) are in MANIFEST.in, so # they'll be copied into source distributions. Pip won't be able to # install the package without this. manifest_in = os.path.join(root, "MANIFEST.in") simple_includes = set() try: with open(manifest_in, "r") as f: for line in f: if line.startswith("include "): for include in line.split()[1:]: simple_includes.add(include) except EnvironmentError: pass # That doesn't cover everything MANIFEST.in can do # (http://docs.python.org/2/distutils/sourcedist.html#commands), so # it might give some false negatives. Appending redundant 'include' # lines is safe, though. if "versioneer.py" not in simple_includes: print(" appending 'versioneer.py' to MANIFEST.in") with open(manifest_in, "a") as f: f.write("include versioneer.py\n") else: print(" 'versioneer.py' already in MANIFEST.in") if cfg.versionfile_source not in simple_includes: print(" appending versionfile_source ('%s') to MANIFEST.in" % cfg.versionfile_source) with open(manifest_in, "a") as f: f.write("include %s\n" % cfg.versionfile_source) else: print(" versionfile_source already in MANIFEST.in") # Make VCS-specific changes. For git, this means creating/changing # .gitattributes to mark _version.py for export-subst keyword # substitution. do_vcs_install(manifest_in, cfg.versionfile_source, ipy) return 0 def scan_setup_py(): """Validate the contents of setup.py against Versioneer's expectations.""" found = set() setters = False errors = 0 with open("setup.py", "r") as f: for line in f.readlines(): if "import versioneer" in line: found.add("import") if "versioneer.get_cmdclass()" in line: found.add("cmdclass") if "versioneer.get_version()" in line: found.add("get_version") if "versioneer.VCS" in line: setters = True if "versioneer.versionfile_source" in line: setters = True if len(found) != 3: print("") print("Your setup.py appears to be missing some important items") print("(but I might be wrong). Please make sure it has something") print("roughly like the following:") print("") print(" import versioneer") print(" setup( version=versioneer.get_version(),") print(" cmdclass=versioneer.get_cmdclass(), ...)") print("") errors += 1 if setters: print("You should remove lines like 'versioneer.VCS = ' and") print("'versioneer.versionfile_source = ' . This configuration") print("now lives in setup.cfg, and should be removed from setup.py") print("") errors += 1 return errors if __name__ == "__main__": cmd = sys.argv[1] if cmd == "setup": errors = do_setup() errors += scan_setup_py() if errors: sys.exit(1)