././@PaxHeader0000000000000000000000000000003200000000000010210 xustar0026 mtime=1717370576.63572 airr-1.5.1/0000755000076500000240000000000014627177321012030 5ustar00vandej27staff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1616358979.0 airr-1.5.1/.gitattributes0000644000076500000240000000003614025727103014711 0ustar00vandej27staffairr/_version.py export-subst ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1616358979.0 airr-1.5.1/MANIFEST.in0000644000076500000240000000020414025727103013551 0ustar00vandej27staffinclude requirements.txt include README.rst include NEWS.rst # versioneer-generated include versioneer.py include airr/_version.py ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/NEWS.rst0000644000076500000240000001162114627177010013332 0ustar00vandej27staffVersion 1.5.1: June 2, 2024 -------------------------------------------------------------------------------- 1. Updated versioneer to v0.29. Version 1.5.0: August 29, 2023 -------------------------------------------------------------------------------- 1. Updated schema set and examples to v1.5. 2. Officially dropped support for Python 2. 3. Added check for valid enum values to schema validation routines. 4. Set enum values to first defined value during template generation routines. 5. Removed mock dependency installation in ReadTheDocs environments from setup. 6. Improved package import time. Version 1.4.1: August 27, 2022 -------------------------------------------------------------------------------- General: 1. Updated pandas requirement to 0.24.0 or higher. 2. Added support for missing integer values (``NaN``) in ``load_rearrangement`` by casting to the pandas ``Int64`` data type. 3. Added gzip support to ``read_rearrangement``. 4. Significant internal refactoring to improve schema generalizability, harmonize behavior between the python and R libraries, and prepare for AIRR Standards v2.0. 5. Fixed a bug in the ``validate`` subcommand of ``airr-tools`` causing validation errors to only be reporting for the first invalid file when multiple files were specified on the command line. Data Model and Schema: 1. Added support for arrays of objects in a single JSON or YAML file. 2. Added support for the AIRR Data File and associated schema (DataFile, Info). The Data File data format holds AIRR object of multiple types and is backwards compatible with Repertoire metadata. 3. Added support for the new germline and genotyping schema (GermlineSet, GenotypeSet) and associated schema. 4. Renamed ``schema.CachedSchema`` to ``schema.AIRRSchema``. 5. Removed ``specs/blank.airr.yaml``. Deprecations: 1. Deprecated ``load_repertoire``. Use ``read_airr`` instead. 2. Deprecated ``write_repertoire``. Use ``write_airr`` instead. 3. Deprecated ``validate_repertoire``. Use ``validate_airr`` instead. 4. Deprecated ``repertoire_template``. Use ``schema.RepertoireSchema.template`` instead. 5. Deprecated the commandline tool ``airr-tools validate repertoire``. Use ``airr-tools validate airr`` instead. Version 1.3.1: October 13, 2020 -------------------------------------------------------------------------------- 1. Refactored ``merge_rearrangement`` to allow for larger number of files. 2. Improved error handling in format validation operations. Version 1.3.0: May 30, 2020 -------------------------------------------------------------------------------- 1. Updated schema set to v1.3. 2. Added ``load_repertoire``, ``write_repertoire``, and ``validate_repertoire`` to ``airr.interface`` to read, write and validate Repertoire metadata, respectively. 3. Added ``repertoire_template`` to ``airr.interface`` which will return a complete repertoire object where all fields have ``null`` values. 4. Added ``validate_object`` to ``airr.schema`` that will validate a single repertoire object against the schema. 5. Extended the ``airr-tools`` commandline program to validate both rearrangement and repertoire files. Version 1.2.1: October 5, 2018 -------------------------------------------------------------------------------- 1. Fixed a bug in the python reference library causing start coordinate values to be empty in some cases when writing data. Version 1.2.0: August 17, 2018 -------------------------------------------------------------------------------- 1. Updated schema set to v1.2. 2. Several improvements to the ``validate_rearrangement`` function. 3. Changed behavior of all `airr.interface` functions to accept a file path (string) to a single Rearrangement TSV, instead of requiring a file handle as input. 4. Added ``base`` argument to ``RearrangementReader`` and ``RearrangementWriter`` to support optional conversion of 1-based closed intervals in the TSV to python-style 0-based half-open intervals. Defaults to conversion. 5. Added the custom exception ``ValidationError`` for handling validation checks. 6. Added the ``validate`` argument to ``RearrangementReader`` which will raise a ``ValidationError`` exception when reading files with missing required fields or invalid values for known field types. 7. Added ``validate`` argument to all type conversion methods in ``Schema``, which will now raise a ``ValidationError`` exception for value that cannot be converted when set to ``True``. When set ``False`` (default), the previous behavior of assigning ``None`` as the converted value is retained. 8. Added ``validate_header`` and ``validate_row`` methods to ``Schema`` and removed validations methods from ``RearrangementReader``. 9. Removed automatic closure of file handle upon reaching the iterator end in ``RearrangementReader``. Version 1.1.0: May 1, 2018 -------------------------------------------------------------------------------- Initial release.././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1717370576.635809 airr-1.5.1/PKG-INFO0000644000076500000240000002114014627177321013123 0ustar00vandej27staffMetadata-Version: 2.1 Name: airr Version: 1.5.1 Summary: AIRR Community Data Representation Standard reference library for antibody and TCR sequencing data. Home-page: http://docs.airr-community.org Author: AIRR Community Author-email: License: CC BY 4.0 Keywords: AIRR,bioinformatics,sequencing,immunoglobulin,antibody,adaptive immunity,T cell,B cell,BCR,TCR Classifier: Intended Audience :: Science/Research Classifier: Natural Language :: English Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python :: 3 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics Installation ------------------------------------------------------------------------------ Install in the usual manner from PyPI:: > pip3 install airr --user Or from the `downloaded `__ source code directory:: > python3 setup.py install --user Quick Start ------------------------------------------------------------------------------ Deprecation Notice ^^^^^^^^^^^^^^^^^^^^ The ``load_repertoire``, ``write_repertoire``, and ``validate_repertoire`` functions have been deprecated for the new generic ``load_airr_data``, ``write_airr_data``, and ``validate_airr_data`` functions. These new functions are backwards compatible with the Repertoire metadata format but also support the new AIRR objects such as GermlineSet, RepertoireGroup, GenotypeSet, Cell and Clone. This new format is defined by the DataFile Schema, which describes a standard set of objects included in a file containing AIRR Data Model presentations. Currently, the AIRR DataFile does not completely support Rearrangement, so users should continue using AIRR TSV files and its specific functions. Also, the ``repertoire_template`` function has been deprecated for the ``Schema.template`` method, which can now be called on any AIRR Schema to create a blank object. Reading AIRR Data Files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package contains functions to read and write AIRR Data Model files. The file format is either YAML or JSON, and the package provides a light wrapper over the standard parsers. The file needs a ``json``, ``yaml``, or ``yml`` file extension so that the proper parser is utilized. All of the AIRR objects are loaded into memory at once and no streaming interface is provided:: import airr # Load the AIRR data data = airr.read_airr('input.airr.json') # loop through the repertoires for rep in data['Repertoire']: print(rep) Why are the AIRR objects, such as Repertoire, GermlineSet, and etc., in a list versus in a dictionary keyed by their identifier (e.g., ``repertoire_id``)? There are two primary reasons for this. First, the identifier might not have been assigned yet. Some systems might allow MiAIRR metadata to be entered but the identifier is assigned to that data later by another process. Without the identifier, the data could not be stored in a dictionary. Secondly, the list allows the data to have a default ordering. If you know that the data has a unique identifier then you can quickly create a dictionary object using a comprehension. For example, with repertoires:: rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } another example with germline sets:: germline_dict = { obj['germline_set_id'] : obj for obj in data['GermlineSet'] } Writing AIRR Data Files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Writing an AIRR Data File is also a light wrapper over standard YAML or JSON parsers. Multiple AIRR objects, such as Repertoire, GermlineSet, and etc., can be written together into the same file. In this example, we use the ``airr`` library ``template`` method to create some blank Repertoire objects, and write them to a file. As with the read function, the complete list of repertoires are written at once, there is no streaming interface:: import airr # Create some blank repertoire objects in a list data = { 'Repertoire': [] } for i in range(5): data['Repertoire'].append(airr.schema.RepertoireSchema.template()) # Write the AIRR Data airr.write_airr('output.airr.json', data) Reading AIRR Rearrangement TSV files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package contains functions to read and write AIRR Rearrangement TSV files as either iterables or pandas data frames. The usage is straightforward, as the file format is a typical tab delimited file, but the package performs some additional validation and type conversion beyond using a standard CSV reader:: import airr # Create an iteratable that returns a dictionary for each row reader = airr.read_rearrangement('input.tsv') for row in reader: print(row) # Load the entire file into a pandas data frame df = airr.load_rearrangement('input.tsv') Writing AIRR Rearrangement TSV files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Similar to the read operations, write functions are provided for either creating a writer class to perform row-wise output or writing the entire contents of a pandas data frame to a file. Again, usage is straightforward with the ``airr`` output functions simply performing some type conversion and field ordering operations:: import airr # Create a writer class for iterative row output writer = airr.create_rearrangement('output.tsv') for row in reader: writer.write(row) # Write an entire pandas data frame to a file airr.dump_rearrangement(df, 'file.tsv') By default, ``create_rearrangement`` will only write the ``required`` fields in the output file. Additional fields can be included in the output file by providing the ``fields`` parameter with an array of additional field names:: # Specify additional fields in the output fields = ['new_calc', 'another_field'] writer = airr.create_rearrangement('output.tsv', fields=fields) A common operation is to read an AIRR rearrangement file, and then write an AIRR rearrangement file with additional fields in it while keeping all of the existing fields from the original file. The ``derive_rearrangement`` function provides this capability:: import airr # Read rearrangement data and write new file with additional fields reader = airr.read_rearrangement('input.tsv') fields = ['new_calc'] writer = airr.derive_rearrangement('output.tsv', 'input.tsv', fields=fields) for row in reader: row['new_calc'] = 'a value' writer.write(row) Validating AIRR data files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package can validate AIRR Data Model JSON/YAML files and Rearrangement TSV files to ensure that they contain all required fields and that the fields types match the AIRR Schema. This can be done using the ``airr-tools`` command line program or the validate functions in the library can be called:: # Validate a rearrangement TSV file airr-tools validate rearrangement -a input.tsv # Validate an AIRR DataFile airr-tools validate airr -a input.airr.json Combining Repertoire metadata and Rearrangement files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package does not currently keep track of which AIRR Data Model files are associated with which Rearrangement TSV files, though there is ongoing work to define a standardized manifest, so users will need to handle those associations themselves. However, in the data, AIRR identifier fields, such as ``repertoire_id``, form the link between objects in the AIRR Data Model. The typical usage is that a program is going to perform some computation on the Rearrangements, and it needs access to the Repertoire metadata as part of the computation logic. This example code shows the basic framework for doing that, in this case doing gender specific computation:: import airr # Load AIRR data containing repertoires data = airr.read_airr('input.airr.json') # Put repertoires in dictionary keyed by repertoire_id rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } # Create an iteratable for rearrangement data reader = airr.read_rearrangement('input.tsv') for row in reader: # get repertoire metadata with this rearrangement rep = rep_dict[row['repertoire_id']] # check the gender if rep['subject']['sex'] == 'male': # do male specific computation elif rep['subject']['sex'] == 'female': # do female specific computation else: # do other specific computation ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1661708083.0 airr-1.5.1/README.rst0000644000076500000240000001777614302723463013533 0ustar00vandej27staffInstallation ------------------------------------------------------------------------------ Install in the usual manner from PyPI:: > pip3 install airr --user Or from the `downloaded `__ source code directory:: > python3 setup.py install --user Quick Start ------------------------------------------------------------------------------ Deprecation Notice ^^^^^^^^^^^^^^^^^^^^ The ``load_repertoire``, ``write_repertoire``, and ``validate_repertoire`` functions have been deprecated for the new generic ``load_airr_data``, ``write_airr_data``, and ``validate_airr_data`` functions. These new functions are backwards compatible with the Repertoire metadata format but also support the new AIRR objects such as GermlineSet, RepertoireGroup, GenotypeSet, Cell and Clone. This new format is defined by the DataFile Schema, which describes a standard set of objects included in a file containing AIRR Data Model presentations. Currently, the AIRR DataFile does not completely support Rearrangement, so users should continue using AIRR TSV files and its specific functions. Also, the ``repertoire_template`` function has been deprecated for the ``Schema.template`` method, which can now be called on any AIRR Schema to create a blank object. Reading AIRR Data Files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package contains functions to read and write AIRR Data Model files. The file format is either YAML or JSON, and the package provides a light wrapper over the standard parsers. The file needs a ``json``, ``yaml``, or ``yml`` file extension so that the proper parser is utilized. All of the AIRR objects are loaded into memory at once and no streaming interface is provided:: import airr # Load the AIRR data data = airr.read_airr('input.airr.json') # loop through the repertoires for rep in data['Repertoire']: print(rep) Why are the AIRR objects, such as Repertoire, GermlineSet, and etc., in a list versus in a dictionary keyed by their identifier (e.g., ``repertoire_id``)? There are two primary reasons for this. First, the identifier might not have been assigned yet. Some systems might allow MiAIRR metadata to be entered but the identifier is assigned to that data later by another process. Without the identifier, the data could not be stored in a dictionary. Secondly, the list allows the data to have a default ordering. If you know that the data has a unique identifier then you can quickly create a dictionary object using a comprehension. For example, with repertoires:: rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } another example with germline sets:: germline_dict = { obj['germline_set_id'] : obj for obj in data['GermlineSet'] } Writing AIRR Data Files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Writing an AIRR Data File is also a light wrapper over standard YAML or JSON parsers. Multiple AIRR objects, such as Repertoire, GermlineSet, and etc., can be written together into the same file. In this example, we use the ``airr`` library ``template`` method to create some blank Repertoire objects, and write them to a file. As with the read function, the complete list of repertoires are written at once, there is no streaming interface:: import airr # Create some blank repertoire objects in a list data = { 'Repertoire': [] } for i in range(5): data['Repertoire'].append(airr.schema.RepertoireSchema.template()) # Write the AIRR Data airr.write_airr('output.airr.json', data) Reading AIRR Rearrangement TSV files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package contains functions to read and write AIRR Rearrangement TSV files as either iterables or pandas data frames. The usage is straightforward, as the file format is a typical tab delimited file, but the package performs some additional validation and type conversion beyond using a standard CSV reader:: import airr # Create an iteratable that returns a dictionary for each row reader = airr.read_rearrangement('input.tsv') for row in reader: print(row) # Load the entire file into a pandas data frame df = airr.load_rearrangement('input.tsv') Writing AIRR Rearrangement TSV files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Similar to the read operations, write functions are provided for either creating a writer class to perform row-wise output or writing the entire contents of a pandas data frame to a file. Again, usage is straightforward with the ``airr`` output functions simply performing some type conversion and field ordering operations:: import airr # Create a writer class for iterative row output writer = airr.create_rearrangement('output.tsv') for row in reader: writer.write(row) # Write an entire pandas data frame to a file airr.dump_rearrangement(df, 'file.tsv') By default, ``create_rearrangement`` will only write the ``required`` fields in the output file. Additional fields can be included in the output file by providing the ``fields`` parameter with an array of additional field names:: # Specify additional fields in the output fields = ['new_calc', 'another_field'] writer = airr.create_rearrangement('output.tsv', fields=fields) A common operation is to read an AIRR rearrangement file, and then write an AIRR rearrangement file with additional fields in it while keeping all of the existing fields from the original file. The ``derive_rearrangement`` function provides this capability:: import airr # Read rearrangement data and write new file with additional fields reader = airr.read_rearrangement('input.tsv') fields = ['new_calc'] writer = airr.derive_rearrangement('output.tsv', 'input.tsv', fields=fields) for row in reader: row['new_calc'] = 'a value' writer.write(row) Validating AIRR data files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package can validate AIRR Data Model JSON/YAML files and Rearrangement TSV files to ensure that they contain all required fields and that the fields types match the AIRR Schema. This can be done using the ``airr-tools`` command line program or the validate functions in the library can be called:: # Validate a rearrangement TSV file airr-tools validate rearrangement -a input.tsv # Validate an AIRR DataFile airr-tools validate airr -a input.airr.json Combining Repertoire metadata and Rearrangement files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``airr`` package does not currently keep track of which AIRR Data Model files are associated with which Rearrangement TSV files, though there is ongoing work to define a standardized manifest, so users will need to handle those associations themselves. However, in the data, AIRR identifier fields, such as ``repertoire_id``, form the link between objects in the AIRR Data Model. The typical usage is that a program is going to perform some computation on the Rearrangements, and it needs access to the Repertoire metadata as part of the computation logic. This example code shows the basic framework for doing that, in this case doing gender specific computation:: import airr # Load AIRR data containing repertoires data = airr.read_airr('input.airr.json') # Put repertoires in dictionary keyed by repertoire_id rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] } # Create an iteratable for rearrangement data reader = airr.read_rearrangement('input.tsv') for row in reader: # get repertoire metadata with this rearrangement rep = rep_dict[row['repertoire_id']] # check the gender if rep['subject']['sex'] == 'male': # do male specific computation elif rep['subject']['sex'] == 'female': # do female specific computation else: # do other specific computation ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1717370576.6323764 airr-1.5.1/airr/0000755000076500000240000000000014627177321012765 5ustar00vandej27staff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1716315049.0 airr-1.5.1/airr/__init__.py0000644000076500000240000000035114623161651015070 0ustar00vandej27staff""" Reference library for AIRR schema for Ig/TCR rearrangements """ from airr.interface import * from airr.schema import ValidationError # versioneer-generated from . import _version __version__ = _version.get_versions()['version'] ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1717370576.6361873 airr-1.5.1/airr/_version.py0000644000076500000240000000076114627177321015167 0ustar00vandej27staff # This file was generated by 'versioneer.py' (0.29) from # revision-control system data, or from the parent directory name of an # unpacked source archive. Distribution tarballs contain a pre-generated copy # of this file. import json version_json = ''' { "date": "2024-06-02T16:08:01-0700", "dirty": false, "error": null, "full-revisionid": "2ccc341fc420705b1f1447bd732f9084f160fc0e", "version": "1.5.1" } ''' # END VERSION_JSON def get_versions(): return json.loads(version_json) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/airr/interface.py0000644000076500000240000004561414627177010015304 0ustar00vandej27staff""" Interface functions for file operations """ from __future__ import absolute_import # System imports import gzip import json import sys import pandas as pd import yaml import yamlordereddictloader from collections import OrderedDict from itertools import chain from io import open from warnings import warn if (sys.version_info > (3, 0)): from io import StringIO else: # Python 2 code in this block from io import BytesIO as StringIO # Load imports from airr.io import RearrangementReader, RearrangementWriter from airr.schema import Schema, RearrangementSchema, RepertoireSchema, AIRRSchema, DataFileSchema, ValidationError #### Rearrangement #### def read_rearrangement(filename, validate=False, debug=False): """ Open an iterator to read an AIRR rearrangements file Arguments: file (str): path to the input file. validate (bool): whether to validate data as it is read, raising a ValidationError exception in the event of an error. debug (bool): debug flag. If True print debugging information to standard error. Returns: airr.io.RearrangementReader: iterable reader class. """ if filename.endswith(".gz"): handle = gzip.open(filename, 'r') else: handle = open(filename, 'r') return RearrangementReader(handle, validate=validate, debug=debug) def create_rearrangement(filename, fields=None, debug=False): """ Create an empty AIRR rearrangements file writer Arguments: filename (str): output file path. fields (list): additional non-required fields to add to the output. debug (bool): debug flag. If True print debugging information to standard error. Returns: airr.io.RearrangementWriter: open writer class. """ return RearrangementWriter(open(filename, 'w+'), fields=fields, debug=debug) def derive_rearrangement(out_filename, in_filename, fields=None, debug=False): """ Create an empty AIRR rearrangements file with fields derived from an existing file Arguments: out_filename (str): output file path. in_filename (str): existing file to derive fields from. fields (list): additional non-required fields to add to the output. debug (bool): debug flag. If True print debugging information to standard error. Returns: airr.io.RearrangementWriter: open writer class. """ reader = RearrangementReader(open(in_filename, 'r')) in_fields = list(reader.fields) if fields is not None: in_fields.extend([f for f in fields if f not in in_fields]) return RearrangementWriter(open(out_filename, 'w+'), fields=in_fields, debug=debug) def load_rearrangement(filename, validate=False, debug=False): """ Load the contents of an AIRR rearrangements file into a data frame Arguments: filename (str): input file path. validate (bool): whether to validate data as it is read, raising a ValidationError exception in the event of an error. debug (bool): debug flag. If True print debugging information to standard error. Returns: pandas.DataFrame: Rearrangement records as rows of a data frame. """ # TODO: test pandas.DataFrame.read_csv with converters argument as an alterative schema = RearrangementSchema df = pd.read_csv(filename, sep='\t', header=0, index_col=None, dtype=schema.pandas_types(), true_values=schema.true_values, false_values=schema.false_values) # added to use RearrangementReader without modifying it: buffer = StringIO() # create an empty buffer df.to_csv(buffer, sep='\t', index=False) # fill buffer buffer.seek(0) # set to the start of the stream reader = RearrangementReader(buffer, validate=validate, debug=debug) df = pd.DataFrame(list(reader)) return df def dump_rearrangement(dataframe, filename, debug=False): """ Write the contents of a data frame to an AIRR rearrangements file Arguments: dataframe (pandas.DataFrame): data frame of rearrangement data. filename (str): output file path. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if the file is written without error. """ # TODO: test pandas.DataFrame.to_csv with converters argument as an alterative # dataframe.to_csv(handle, sep='\t', header=True, index=False, encoding='utf-8') fields = dataframe.columns.tolist() with open(filename, 'w+') as handle: writer = RearrangementWriter(handle, fields=fields, debug=debug) for __, row in dataframe.iterrows(): writer.write(row.to_dict()) return True def merge_rearrangement(out_filename, in_filenames, drop=False, debug=False): """ Merge one or more AIRR rearrangements files Arguments: out_filename (str): output file path. in_filenames (list): list of input files to merge. drop (bool): drop flag. If True then drop fields that do not exist in all input files, otherwise combine fields from all input files. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if files were successfully merged, otherwise False. """ try: # gather fields from input files readers = (RearrangementReader(open(f, 'r'), debug=False) for f in in_filenames) field_list = [x.fields for x in readers] if drop: field_set = set.intersection(*map(set, field_list)) else: field_set = set.union(*map(set, field_list)) field_order = OrderedDict([(f, None) for f in chain(*field_list)]) out_fields = [f for f in field_order if f in field_set] # write input files to output file sequentially readers = (RearrangementReader(open(f, 'r'), debug=debug) for f in in_filenames) with open(out_filename, 'w+') as handle: writer = RearrangementWriter(handle, fields=out_fields, debug=debug) for reader in readers: for r in reader: writer.write(r) reader.close() except Exception as e: sys.stderr.write('Error occurred while merging AIRR rearrangement files: %s\n' % e) return False return True def validate_rearrangement(filename, debug=False): """ Validates an AIRR rearrangements file Arguments: filename (str): path of the file to validate. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if files passed validation, otherwise False. """ valid = True if debug: sys.stderr.write('Validating: %s\n' % filename) # Open reader handle = open(filename, 'r') reader = RearrangementReader(handle, validate=True) # Validate header try: iter(reader) except ValidationError as e: valid = False if debug: sys.stderr.write('%s has validation error: %s\n' % (filename, e)) # Validate each row i = 0 while True: try: i = i + 1 next(reader) except StopIteration: break except ValidationError as e: valid = False if debug: sys.stderr.write('%s at record %i has validation error: %s\n' % (filename, i, e)) # Close handle.close() return valid #### AIRR Data Model #### def read_airr(filename, format=None, validate=False, model=True, debug=False): """ Load an AIRR Data file Arguments: filename (str): path to the input file. format (str): input file format valid strings are "yaml" or "json". If set to None, the file format will be automatically detected from the file extension. validate (bool): whether to validate data as it is read, raising a ValidationError exception in the event of a validation failure. model (bool): If True only validate objects defined in the AIRR DataFile schema. If False, attempt validation of all top-level objects. Ignored if validate=False. debug (bool): debug flag. If True print debugging information to standard error. Returns: dict: dictionary of AIRR Data objects. """ # Because the AIRR Data File is read in completely, we do not bother with a reader class. # Determine file type from extension and use appropriate loader ext = str.lower(filename.split('.')[-1]) if not format else format if ext in ('yaml', 'yml'): with open(filename, 'r', encoding='utf-8') as handle: data = yaml.load(handle, Loader=yamlordereddictloader.Loader) elif ext == 'json': with open(filename, 'r', encoding='utf-8') as handle: data = json.load(handle) else: if debug: sys.stderr.write('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % ext) raise TypeError('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % ext) data = None # Validate if requested if validate: if debug: sys.stderr.write('Validating: %s\n' % filename) try: valid = validate_airr(data, model=model, debug=debug) except ValidationError as e: if debug: sys.stderr.write('%s failed validation\n' % filename) raise ValidationError(e) # We do not perform any additional processing return data def validate_airr(data, model=True, debug=False): """ Validates an AIRR Data file Arguments: data (dict): dictionary containing AIRR Data Model objects model (bool): If True only validate objects defined in the AIRR DataFile schema. If False, attempt validation of all top-level objects debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if files passed validation, otherwise False. """ # Type check that input type is either dict or OrderedDict if not hasattr(data, 'items'): if debug: sys.stderr.write('Data parameter is not a dictionary\n') raise TypeError('Data parameter is not a dictionary') # Loop through each AIRR object and validate valid = True for k, object in data.items(): if k in ('Info', 'DataFile'): continue if not object: continue # Check for DataFile schema if model and k not in DataFileSchema.properties: if debug: sys.stderr.write('Skipping non-DataFile object: %s\n' % k) continue # Get Schema schema = AIRRSchema.get(k, Schema(k)) # Determine input type and set appropriate iterator if hasattr(object, 'items'): # Validate named array (dict) obj_iter = object.items() # Validate named array (dict) or a single object (dict) # obj_iter = object.items() if 'definition' not in object.keys() else [0, object] elif isinstance(object, list): # Validate array obj_iter = enumerate(object) else: # Unrecognized data structure valid = False if debug: sys.stderr.write('%s is an unrecognized data structure: %s\n' % k) continue # Validate each record in array for i, record in obj_iter: try: schema.validate_object(record) except ValidationError as e: valid = False if debug: sys.stderr.write('%s at array position %s with validation error: %s\n' % (k, i, e)) if not valid: raise ValidationError('AIRR Data Model has validation failures') return valid def write_airr(filename, data, format=None, info=None, validate=False, model=True, debug=False): """ Write an AIRR Data file Arguments: filename (str): path to the output file. data (dict): dictionary of AIRR Data Model objects. format (str): output file format valid strings are "yaml" or "json". If set to None, the file format will be automatically detected from the file extension. info (object): info object to write. Will write current AIRR Schema info if not specified. validate (bool): whether to validate data before it is written, raising a ValidationError exception in the event of a validation failure. model (bool): If True only validate and write objects defined in the AIRR DataFile schema. If False, attempt validation and write of all top-level objects debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if the file is written without error. """ # Type check that input type is either dict or OrderedDict if not hasattr(data, 'items'): if debug: sys.stderr.write('Data parameter is not a dictionary\n') raise TypeError('Data parameter is not a dictionary') # Validate if requested if validate: if debug: sys.stderr.write('Validating: %s\n' % filename) try: valid = validate_airr(data, model=model, debug=debug) except ValidationError as e: if debug: sys.stderr.write(e) raise ValidationError(e) md = OrderedDict() if info is None: info = RearrangementSchema.info.copy() info['title'] = 'AIRR Data File' info['description'] = 'AIRR Data File written by AIRR Standards Python Library' md['Info'] = info # Loop through each object and add them to the output dict for k, obj in data.items(): if k in ('Info', 'DataFile'): continue if not obj: continue if model and k not in DataFileSchema.properties: if debug: sys.stderr.write('Skipping non-DataFile object: %s\n' % k) continue md[k] = obj # Determine file type from extension and use appropriate loader ext = str.lower(filename.split('.')[-1]) if not format else format if ext in ('yaml', 'yml'): with open(filename, 'w') as handle: yaml.dump(md, handle, default_flow_style=False) elif ext == 'json': with open(filename, 'w') as handle: json.dump(md, handle, sort_keys=False, indent=2) else: if debug: sys.stderr.write('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % ext) raise TypeError('Unknown file type: %s. Supported file extensions are "yaml", "yml" or "json"\n' % ext) return True #### Deprecated #### def repertoire_template(): """ Return a blank repertoire object from the template. This object has the complete structure with all of the fields and all values set to None or empty string. Returns: object: empty repertoire object. .. deprecated:: 1.4 Use :meth:`schema.Schema.template` instead. """ # Deprecation warn('repertoire_template is deprecated and will be removed in a future release.\nUse schema.Schema.template instead.\n', DeprecationWarning, stacklevel=2) # Build template object = RepertoireSchema.template() return object def load_repertoire(filename, validate=False, debug=False): """ Load an AIRR repertoire metadata file Arguments: filename (str): path to the input file. validate (bool): whether to validate data as it is read, raising a ValidationError exception in the event of an error. debug (bool): debug flag. If True print debugging information to standard error. Returns: dict: dictionary of AIRR Data objects. .. deprecated:: 1.4 Use :func:`read_airr` instead. """ # Deprecation warn('load_repertoire is deprecated and will be removed in a future release.\nUse read_airr instead.\n', DeprecationWarning, stacklevel=2) # use standard load function, we only validate Repertoire if requested md = read_airr(filename, validate=validate, debug=debug) if md.get('Repertoire') is None: if debug: sys.stderr.write('%s is missing "Repertoire" key\n' % (filename)) raise KeyError('Repertoire object cannot be found in the file') # validate if requested if validate: valid = True reps = md['Repertoire'] i = 0 for r in reps: try: RepertoireSchema.validate_object(r) except ValidationError as e: valid = False if debug: sys.stderr.write('%s has repertoire at array position %i with validation error: %s\n' % (filename, i, e)) i = i + 1 if not valid: raise ValidationError('Repertoire file %s has validation errors\n' % (filename)) # we do not perform any additional processing return md def validate_repertoire(filename, debug=False): """ Validates an AIRR repertoire metadata file Arguments: filename (str): path of the file to validate. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if files passed validation, otherwise False. .. deprecated:: 1.4 Use :func:`validate_airr` instead. """ # Deprecation warn('validate_repertoire is deprecated and will be removed in a future release.\nUse validate_airr instead.\n', DeprecationWarning, stacklevel=2) valid = True if debug: sys.stderr.write('Validating: %s\n' % filename) # load with validate try: data = load_repertoire(filename, validate=True, debug=debug) except TypeError: valid = False except KeyError: valid = False except ValidationError as e: valid = False if debug: sys.stderr.write('%s has validation error: %s\n' % (filename, e)) return valid def write_repertoire(filename, repertoires, info=None, debug=False): """ Write an AIRR repertoire metadata file Arguments: file (str): path to the output file. repertoires (list): array of repertoire objects. info (object): info object to write. Will write current AIRR Schema info if not specified. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if the file is written without error. .. deprecated:: 1.4 Use :func:`write_airr` instead. """ # Deprecation warn('write_repertoire is deprecated and will be removed in a future release.\nUse write_airr instead.\n', DeprecationWarning, stacklevel=2) if not isinstance(repertoires, list): if debug: sys.stderr.write('Repertoires parameter is not a list\n') raise TypeError('Repertoires parameter is not a list') md = OrderedDict() if info is None: info = RearrangementSchema.info.copy() info['title'] = 'Repertoire metadata' info['description'] = 'Repertoire metadata written by AIRR Standards Python Library' md['Info'] = info md['Repertoire'] = repertoires return write_airr(filename, md, info=info, debug=debug) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1616358979.0 airr-1.5.1/airr/io.py0000644000076500000240000001740414025727103013743 0ustar00vandej27staff""" Reference library for AIRR schema for Ig/TCR rearrangements """ from __future__ import print_function import sys import csv from airr.schema import RearrangementSchema, ValidationError class RearrangementReader: """ Iterator for reading Rearrangement objects in TSV format Attributes: fields (list): field names in the input Rearrangement file. external_fields (list): list of fields in the input file that are not part of the Rearrangement definition. """ @property def fields(self): """ Get list of fields Returns: list : field names. """ return self.dict_reader.fieldnames @property def external_fields(self): """ Get list of field that are not in the Rearrangement schema Returns: list : field names. """ return [f for f in self.dict_reader.fieldnames \ if f not in self.schema.properties] def __init__(self, handle, base=1, validate=False, debug=False): """ Initialization Arguments: handle (file): file handle of the open Rearrangement file. base (int): one of 0 or 1 specifying the coordinate schema in the input file. If 1, then the file is assumed to contain 1-based closed intervals that will be converted to python style 0-based half-open intervals for known fields. If 0, then values will be unchanged. validate (bool): perform validation. If True then basic validation will be performed will reading the data. A ValidationError exception will be raised if an error is found. debug (bool): debug state. If True prints debug information. Returns: airr.io.RearrangementReader: reader object. """ # arguments self.handle = handle self.base = base self.debug = debug self.validate = validate self.schema = RearrangementSchema # data reader, collect field names self.dict_reader = csv.DictReader(self.handle, dialect='excel-tab') def __iter__(self): """ Iterator initializer Returns: airr.io.RearrangementReader """ # Validate fields if (self.validate): self.schema.validate_header(self.dict_reader.fieldnames) return self def __next__(self): """ Next method Returns: dict: parsed Rearrangement data. """ try: row = next(self.dict_reader) except StopIteration: raise StopIteration for f in row: # row entry with no header if f is None: if self.validate: raise ValidationError('row has extra data') else: raise ValueError('row has extra data') # Convert types spec = self.schema.type(f) try: if spec == 'boolean': row[f] = self.schema.to_bool(row[f], validate=self.validate) if spec == 'integer': row[f] = self.schema.to_int(row[f], validate=self.validate) if spec == 'number': row[f] = self.schema.to_float(row[f], validate=self.validate) except ValidationError as e: raise ValidationError('field %s has %s' %(f, e)) # Adjust coordinates if f and f.endswith('_start') and self.base == 1: try: row[f] = row[f] - 1 except TypeError: row[f] = None return row def close(self): """ Closes the Rearrangement file """ self.handle.close() def next(self): """ Next method """ return self.__next__() class RearrangementWriter: """ Writer class for Rearrangement objects in TSV format Attributes: fields (list): field names in the output Rearrangement file. external_fields (list): list of fields in the output file that are not part of the Rearrangement definition. """ @property def fields(self): """ Get list of fields Returns: list : field names. """ return self.dict_writer.fieldnames @property def external_fields(self): """ Get list of field that are not in the Rearrangements schema Returns: list : field names. """ return [f for f in self.dict_writer.fieldnames \ if f not in self.schema.properties] def __init__(self, handle, fields=None, base=1, debug=False): """ Initialization Arguments: handle (file): file handle of the open Rearrangements file. fields (list) : list of non-required fields to add. May include fields undefined by the schema. base (int): one of 0 or 1 specifying the coordinate schema in the output file. Data provided to the write is assumed to be in python style 0-based half-open intervals. If 1, then data will be converted to 1-based closed intervals for known fields before writing. If 0, then values will be unchanged. debug (bool): debug state. If True prints debug information. Returns: airr.io.RearrangementWriter: writer object. """ # arguments self.handle = handle self.base = base self.debug = debug self.schema = RearrangementSchema # order fields according to spec field_names = list(self.schema.required) if fields is not None: additional_fields = [] for f in fields: if f in self.schema.required: continue elif f in self.schema.optional: field_names.append(f) else: additional_fields.append(f) field_names.extend(additional_fields) # open writer and write header self.dict_writer = csv.DictWriter(self.handle, fieldnames=field_names, dialect='excel-tab', extrasaction='ignore', lineterminator='\n') self.dict_writer.writeheader() def close(self): """ Closes the Rearrangement file """ self.handle.close() def write(self, row): """ Write a row to the Rearrangement file Arguments: row (dict): row to write. """ # validate row if self.debug: for field in self.schema.required: if row.get(field, None) is None: sys.stderr.write('Warning: Record is missing AIRR required field (' + field + ').\n') for f in row.keys(): # Adjust coordinates if f.endswith('_start') and self.base == 1: try: row[f] = self.schema.to_int(row[f]) + 1 except TypeError: row[f] = None # Convert types spec = self.schema.type(f) if spec == 'boolean': row[f] = self.schema.from_bool(row[f]) self.dict_writer.writerow(row) # TODO: pandas validation need if we load with pandas directly # def validate_df(df, airr_schema): # valid = True # # # check required fields # missing_fields = set(airr_schema.required) - set(df.columns) # if len(missing_fields) > 0: # print('Warning: file is missing mandatory fields: {}'.format(', '.join(missing_fields))) # valid = False # # if not valid: # raise ValueError('invalid AIRR data file') ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/airr/schema.py0000644000076500000240000005206314627177010014600 0ustar00vandej27staff""" AIRR Data Representation Schema """ # Imports import sys import yaml import yamlordereddictloader from collections import OrderedDict from pkg_resources import resource_stream with resource_stream(__name__, 'specs/airr-schema.yaml') as f: DEFAULT_SPEC = yaml.load(f, Loader=yamlordereddictloader.Loader) class ValidationError(Exception): """ Exception raised when validation errors are encountered. """ pass class Schema: """ AIRR schema definitions Attributes: definition: name of the schema definition. info (collections.OrderedDict): schema info. properties (collections.OrderedDict): field definitions. required (list): list of mandatory fields. optional (list): list of non-required fields. false_values (list): accepted string values for False. true_values (list): accepted values for True. """ # Boolean list for pandas true_values = ['True', 'true', 'TRUE', 'T', 't', '1'] false_values = ['False', 'false', 'FALSE', 'F', 'f', '0'] # Generate dicts for booleans _to_bool_map = {x: True for x in true_values + [1, True]} _to_bool_map.update({x: False for x in false_values + [0, False]}) _from_bool_map = {k: 'T' if v else 'F' for k, v in _to_bool_map.items()} def __init__(self, definition): """ Initialization Arguments: definition (string): the schema definition to load. Returns: airr.schema.Schema : schema object. """ # Info is not a valid schema if definition == 'Info': raise KeyError('Info is an invalid schema definition name') # Load object definition if isinstance(definition, dict): # on-the-fly definition of a nested object self.definition = definition spec = {'Info': []} else: spec = DEFAULT_SPEC try: self.definition = spec[definition] except KeyError: raise KeyError('Schema definition %s cannot be found in the specifications' % definition) except: raise try: self.info = spec['Info'] except KeyError: raise KeyError('Info object cannot be found in the specifications') except: raise if self.definition.get('properties') is not None: self.properties = self.definition['properties'] try: self.required = self.definition['required'] except KeyError: self.required = [] except: raise elif self.definition.get('allOf') is not None: self.properties = {} self.required = [] for s in self.definition['allOf']: if s.get('$ref') is not None: schema_name = s['$ref'].split('/')[-1] # cannot use cache here schema = Schema(schema_name) # no nested allOf ... self.properties.update(schema.properties) self.required.extend(schema.required) elif s.get('properties') is not None: self.properties.update(s.get('properties')) if s.get('required') is not None: self.required.extend(s.get('required')) else: raise KeyError('Cannot find properties for schema definition %s' % definition) self.optional = [f for f in self.properties if f not in self.required] def spec(self, field): """ Get the properties for a field Arguments: name (str): field name. Returns: collections.OrderedDict: definition for the field. """ return self.properties.get(field, None) def type(self, field): """ Get the type for a field Arguments: name (str): field name. Returns: str: the type definition for the field """ field_spec = self.properties.get(field, None) field_type = field_spec.get('type', None) if field_spec else None return field_type def pandas_types(self): """ Map of schema types to pandas types Returns: dict: mapping dictionary for pandas types """ type_mapping = {} for property in self.properties: if self.type(property) == 'boolean': type_mapping[property] = bool elif self.type(property) == 'integer': type_mapping[property] = 'Int64' elif self.type(property) == 'number': type_mapping[property] = 'float64' elif self.type(property) == 'string': type_mapping[property] = str return type_mapping def to_bool(self, value, validate=False): """ Convert a string to a boolean Arguments: value (str): logical value as a string. validate (bool): when True raise a ValidationError for an invalid value. Otherwise, set invalid values to None. Returns: bool: conversion of the string to True or False. Raises: airr.ValidationError: raised if value is invalid when validate is set True. """ if value == '' or value is None: return None bool_value = self._to_bool_map.get(value, None) if bool_value is None and validate: raise ValidationError('invalid bool %s' % value) else: return bool_value def from_bool(self, value, validate=False): """ Converts a boolean to a string Arguments: value (bool): logical value. validate (bool): when True raise a ValidationError for an invalid value. Otherwise, set invalid values to None. Returns: str: conversion of True or False or 'T' or 'F'. Raises: airr.ValidationError: raised if value is invalid when validate is set True. """ if value == '' or value is None: return '' str_value = self._from_bool_map.get(value, None) if str_value is None and validate: raise ValidationError('invalid bool %s' % value) else: return str_value def to_int(self, value, validate=False): """ Converts a string to an integer Arguments: value (str): integer value as a string. validate (bool): when True raise a ValidationError for an invalid value. Otherwise, set invalid values to None. Returns: int: conversion of the string to an integer. Raises: airr.ValidationError: raised if value is invalid when validate is set True. """ if value == '' or value is None: return None if isinstance(value, int): return value try: return int(value) except ValueError: if validate: raise ValidationError('invalid int %s' % value) else: return None def to_float(self, value, validate=False): """ Converts a string to a float Arguments: value (str): float value as a string. validate (bool): when True raise a ValidationError for an invalid value. Otherwise, set invalid values to None. Returns: float: conversion of the string to a float. Raises: airr.ValidationError: raised if value is invalid when validate is set True. """ if value == '' or value is None: return None if isinstance(value, float): return value try: return float(value) except ValueError: if validate: raise ValidationError('invalid float %s' % value) else: return None def validate_header(self, header): """ Validate header against the schema Arguments: header (list): list of header fields. Returns: bool: True if a ValidationError exception is not raised. Raises: airr.ValidationError: raised if header fails validation. """ # Check for missing header if header is None: raise ValidationError('missing header') # Check required fields missing_fields = [f for f in self.required if f not in header] if missing_fields: raise ValidationError('missing required fields (%s)' % ', '.join(missing_fields)) else: return True def validate_row(self, row): """ Validate Rearrangements row data against schema Arguments: row (dict): dictionary containing a single record. Returns: bool: True if a ValidationError exception is not raised. Raises: airr.ValidationError: raised if row fails validation. """ for f in row: # Empty strings are valid if row[f] == '' or row[f] is None: continue # Check types spec = self.type(f) try: if spec == 'boolean': self.to_bool(row[f], validate=True) if spec == 'integer': self.to_int(row[f], validate=True) if spec == 'number': self.to_float(row[f], validate=True) except ValidationError as e: raise ValidationError('field %s has %s' % (f, e)) return True def validate_object(self, obj, missing=True, nonairr=True, context=None): """ Validate Repertoire object data against schema Arguments: obj (dict): dictionary containing a single repertoire object. missing (bool): provides warnings for missing optional fields. nonairr (bool: provides warning for non-AIRR fields that cannot be validated. context (string): used by recursion to indicate place in object hierarchy Returns: bool: True if a ValidationError exception is not raised. Raises: airr.ValidationError: raised if object fails validation. """ # object has to be a dictionary if not hasattr(obj, 'items'): if context is None: raise ValidationError('object is not a dictionary') else: raise ValidationError('field "%s" is not a dictionary object' % context) # first warn about non-AIRR fields if nonairr: for f in obj: if context is None: full_field = f else: full_field = context + '.' + f if self.properties.get(f) is None: sys.stderr.write('Warning: Object has non-AIRR field that cannot be validated (' + full_field + ').\n') # now walk through schema and check types for f in self.properties: if context is None: full_field = f else: full_field = context + '.' + f spec = self.spec(f) xairr = spec.get('x-airr') # check if deprecated if xairr and xairr.get('deprecated'): continue # check if null and if key is missing is_missing_key = False is_null = False if obj.get(f) is None: is_null = True if obj.get(f, 'missing') == 'missing': is_missing_key = True # check MiAIRR keys exist if xairr and xairr.get('miairr'): if is_missing_key: raise ValidationError('MiAIRR field "%s" is missing' % full_field) # check if required field if f in self.required and is_missing_key: raise ValidationError('Required field "%s" is missing' % full_field) # check if identifier field if xairr and xairr.get('identifier'): if is_missing_key: if xairr.get('nullable'): sys.stderr.write( 'Warning: Nullable identifier field "%s" is missing.\n' % full_field) else: raise ValidationError('Not-nullable identifier field "%s" is missing' % full_field) # check nullable requirements if is_null: if not xairr: # default is true continue if xairr.get('nullable') or xairr.get('nullable', 'missing') == 'missing': # nullable is allowed continue else: # nullable not allowed raise ValidationError('Non-nullable field "%s" is null or missing' % full_field) # if get to here, field should exist with non null value # check types field_type = self.type(f) if field_type is None: # for referenced object, recursively call validate with object and schema if spec.get('$ref') is not None: schema_name = spec['$ref'].split('/')[-1] if AIRRSchema.get(schema_name): schema = AIRRSchema[schema_name] else: schema = Schema(schema_name) schema.validate_object(obj[f], missing, nonairr, full_field) else: raise ValidationError('Internal error: field "%s" in schema not handled by validation. File a bug report.' % full_field) elif field_type == 'array': if not isinstance(obj[f], list): raise ValidationError('field "%s" is not an array' % full_field) # for array, check each object in it for row in obj[f]: if spec['items'].get('$ref') is not None: schema_name = spec['items']['$ref'].split('/')[-1] schema = Schema(schema_name) schema.validate_object(row, missing, nonairr, full_field) elif spec['items'].get('allOf') is not None: for s in spec['items']['allOf']: if s.get('$ref') is not None: schema_name = s['$ref'].split('/')[-1] if AIRRSchema.get(schema_name): schema = AIRRSchema[schema_name] else: schema = Schema(schema_name) schema.validate_object(row, missing, False, full_field) elif spec['items'].get('enum') is not None: if row not in spec['items']['enum']: raise ValidationError('field "%s" has value "%s" not among possible enumeration values' % (full_field, row)) elif spec['items'].get('type') == 'string': if not isinstance(row, str): raise ValidationError('array field "%s" does not have string type: %s' % (full_field, row)) elif spec['items'].get('type') == 'boolean': if not isinstance(row, bool): raise ValidationError('array field "%s" does not have boolean type: %s' % (full_field, row)) elif spec['items'].get('type') == 'integer': if not isinstance(row, int): raise ValidationError('array field "%s" does not have integer type: %s' % (full_field, row)) elif spec['items'].get('type') == 'number': if not isinstance(row, float) and not isinstance(row, int): raise ValidationError('array field "%s" does not have number type: %s' % (full_field, row)) elif spec['items'].get('type') == 'object': sub_schema = Schema({'properties': spec['items'].get('properties')}) sub_schema.validate_object(row, missing, nonairr, context) else: raise ValidationError('Internal error: array field "%s" in schema not handled by validation. File a bug report.' % full_field) elif field_type == 'object': # right now all arrays of objects use $ref raise ValidationError('Internal error: field "%s" in schema not handled by validation. File a bug report.' % full_field) else: # check basic types if field_type == 'string': if not isinstance(obj[f], str): raise ValidationError('Field "%s" does not have string type: %s' % (full_field, obj[f])) elif field_type == 'boolean': if not isinstance(obj[f], bool): raise ValidationError('Field "%s" does not have boolean type: %s' % (full_field, obj[f])) elif field_type == 'integer': if not isinstance(obj[f], int): raise ValidationError('Field "%s" does not have integer type: %s' % (full_field, obj[f])) elif field_type == 'number': if not isinstance(obj[f], float) and not isinstance(obj[f], int): raise ValidationError('Field "%s" does not have number type: %s' % (full_field, obj[f])) else: raise ValidationError('Internal error: Field "%s" with type %s in schema not handled by validation. File a bug report.' % (full_field, field_type)) # check basic types enums enums = spec.get('enum') if enums is not None: field_value = obj[f] if field_value not in enums: raise ValidationError( 'field "%s" has value "%s" not among possible enumeration values %s' % (full_field, field_value, enums) ) return True def template(self): """ Create an empty template object Returns: collections.OrderedDict: dictionary with all schema properties set as None or an empty list. """ # Set defaults for each data type type_default = {'boolean': False, 'integer': 0, 'number': 0.0, 'string': '', 'array':[]} # Fetch schema template definition for a $ref string def _reference(ref): x = ref.split('/')[-1] schema = AIRRSchema.get(x, Schema(x)) return(schema.template()) # Get default value def _default(spec): if 'nullable' in spec['x-airr'] and not spec['x-airr']['nullable']: if 'enum' in spec: return spec['enum'][0] else: return type_default.get(spec['type'], None) else: return None # Populate empty object object = OrderedDict() for k, spec in self.properties.items(): # Skip deprecated if 'x-airr' in spec and spec['x-airr'].get('deprecated', False): continue # Population values if '$ref' in spec: object[k] = _reference(spec['$ref']) elif spec['type'] == 'array': if '$ref' in spec['items']: object[k] = [_reference(spec['items']['$ref'])] else: object[k] = [] elif 'x-airr' in spec: object[k] = _default(spec) else: object[k] = None return(object) # Preloaded schema AIRRSchema = { 'Info': Schema('InfoObject'), 'DataFile': Schema('DataFile'), 'Alignment': Schema('Alignment'), 'Rearrangement': Schema('Rearrangement'), 'Repertoire': Schema('Repertoire'), 'RepertoireGroup': Schema('RepertoireGroup'), 'Ontology': Schema('Ontology'), 'Study': Schema('Study'), 'Subject': Schema('Subject'), 'Diagnosis': Schema('Diagnosis'), 'SampleProcessing': Schema('SampleProcessing'), 'CellProcessing': Schema('CellProcessing'), 'PCRTarget': Schema('PCRTarget'), 'NucleicAcidProcessing': Schema('NucleicAcidProcessing'), 'SequencingRun': Schema('SequencingRun'), 'SequencingData': Schema('SequencingData'), 'DataProcessing': Schema('DataProcessing'), 'GermlineSet': Schema('GermlineSet'), 'Acknowledgement': Schema('Acknowledgement'), 'RearrangedSequence': Schema('RearrangedSequence'), 'UnrearrangedSequence': Schema('UnrearrangedSequence'), 'SequenceDelineationV': Schema('SequenceDelineationV'), 'AlleleDescription': Schema('AlleleDescription'), 'GenotypeSet': Schema('GenotypeSet'), 'Genotype': Schema('Genotype'), 'Cell': Schema('Cell'), 'Clone': Schema('Clone') } InfoSchema = AIRRSchema['Info'] DataFileSchema = AIRRSchema['DataFile'] AlignmentSchema = AIRRSchema['Alignment'] RearrangementSchema = AIRRSchema['Rearrangement'] RepertoireSchema = AIRRSchema['Repertoire'] GermlineSetSchema = AIRRSchema['GermlineSet'] GenotypeSetSchema = AIRRSchema['GenotypeSet'] ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1717370576.6335478 airr-1.5.1/airr/specs/0000755000076500000240000000000014627177321014102 5ustar00vandej27staff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1616358979.0 airr-1.5.1/airr/specs/__init__.py0000644000076500000240000000000014025727103016170 0ustar00vandej27staff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/airr/specs/airr-schema.yaml0000644000076500000240000052523114627177010017164 0ustar00vandej27staff# # Schema definitions for AIRR standards objects # Info: title: AIRR Schema description: Schema definitions for AIRR standards objects version: 1.5 contact: name: AIRR Community url: https://github.com/airr-community license: name: Creative Commons Attribution 4.0 International url: https://creativecommons.org/licenses/by/4.0/ # Properties that are based upon an ontology use this # standard schema definition Ontology: type: object properties: id: type: string description: CURIE of the concept, encoding the ontology and the local ID label: type: string description: Label of the concept in the respective ontology # Map to expand CURIE prefixes to full IRIs CURIEMap: ABREG: type: identifier default: map: ABREG map: ABREG: iri_prefix: "http://antibodyregistry.org/AB_" CHEBI: type: ontology default: map: OBO provider: OLS map: OBO: iri_prefix: "http://purl.obolibrary.org/obo/CHEBI_" CL: type: ontology default: map: OBO provider: OLS map: OBO: iri_prefix: "http://purl.obolibrary.org/obo/CL_" DOI: type: identifier default: map: DOI map: DOI: iri_prefix: "https://doi.org/" DOID: type: ontology default: map: OBO provider: OLS map: OBO: iri_prefix: "http://purl.obolibrary.org/obo/DOID_" ENA: type: identifier default: map: ENA map: ENA: iri_prefix: "https://www.ebi.ac.uk/ena/browser/view/" ENSG: type: identifier default: map: ENSG map: ENSG: iri_prefix: "https://www.ensembl.org/Multi/Search/Results?q=" IEDB_RECEPTOR: type: identifier default: map: IEDB provider: IEDB map: IEDB: iri_prefix: "https://www.iedb.org/receptor/" MRO: type: ontology default: map: OBO provider: OLS map: OBO: iri_prefix: "http://purl.obolibrary.org/obo/MRO_" NCBITAXON: type: taxonomy default: map: OBO provider: OLS map: OBO: iri_prefix: "http://purl.obolibrary.org/obo/NCBITaxon_" BioPortal: iri_prefix: "http://purl.bioontology.org/ontology/NCBITAXON/" NCIT: type: ontology default: map: OBO provider: OLS map: OBO: iri_prefix: "http://purl.obolibrary.org/obo/NCIT_" ORCID: type: catalog default: map: ORCID provider: ORCID map: ORCID: iri_prefix: "https://orcid.org/" ROR: type: catalog default: map: ROR provider: ROR map: ROR: iri_prefix: "https://ror.org/" SRA: type: identifier default: map: SRA map: SRA: iri_prefix: "https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=" UBERON: type: ontology default: map: OBO provider: OLS map: OBO: iri_prefix: "http://purl.obolibrary.org/obo/UBERON_" UNIPROT: type: identifier default: map: UNIPROT map: UniProt: iri_prefix: "http://purl.uniprot.org/uniprot/" UO: type: ontology default: map: OBO provider: OLS map: OBO: iri_prefix: "http://purl.obolibrary.org/obo/UO_" InformationProvider: provider: ENA: request: url: "{iri}" response: text/html IEDB: request: url: "https://query-api.iedb.org/tcr_search?receptor_group_id=eq.{local_id}" response: application/json OLS: request: url: "https://www.ebi.ac.uk/ols/api/ontologies/{ontology_id}/terms?iri={iri}" response: application/json Ontobee: request: url: "http://www.ontobee.org/ontology/rdf/{ontology_id}?iri={iri}" response: application/rdf+xml ORCID: request: url: "https://pub.orcid.org/v2.1/{local_id}" header: Accept: application/json response: application/json ROR: request: url: "https://api.ror.org/organizations/{iri}" response: application/json SRA: request: url: "{iri}" response: text/html parameter: CHEBI: Ontobee: ontology_id: CHEBI OLS: ontology_id: chebi CL: Ontobee: ontology_id: CL OLS: ontology_id: cl DOID: Ontobee: ontology_id: DOID OLS: ontology_id: doid MRO: Ontobee: ontology_id: MRO OLS: ontology_id: mro NCBITAXON: Ontobee: ontology_id: NCBITaxon OLS: ontology_id: ncbitaxon BioPortal: ontology_id: NCBITAXON NCIT: Ontobee: ontology_id: NCIT OLS: ontology_id: ncit UBERON: Ontobee: ontology_id: UBERON OLS: ontology_id: uberon UO: Ontobee: ontology_id: UO OLS: ontology_id: uo # AIRR specification extensions # # The schema definitions for AIRR standards objects is extended to # provide a number of AIRR specific attributes. This schema definition # specifies the structure, property names and data types. These # attributes are attached to an AIRR field with the x-airr property. Attributes: type: object properties: miairr: type: string description: MiAIRR requirement level. enum: - essential - important - defined default: defined identifier: type: boolean description: > True if the field is an identifier required to link metadata and/or individual sequence records across objects in the complete AIRR Data Model and ADC API. default: false adc-query-support: type: boolean description: > True if an ADC API implementation must support queries on the field. If false, query support for the field in ADC API implementations is optional. default: false nullable: type: boolean description: True if the field may have a null value. default: true deprecated: type: boolean description: True if the field has been deprecated from the schema. default: false deprecated-description: type: string description: Information regarding the deprecation of the field. deprecated-replaced-by: type: array items: type: string description: The deprecated field is replaced by this list of fields. set: type: integer description: MiAIRR set subset: type: string description: MiAIRR subset name: type: string description: MiAIRR name format: type: string description: Field format. If null then assume the full range of the field data type enum: - ontology - controlled_vocabulary - physical_quantity - CURIE ontology: type: object description: Ontology definition for field properties: draft: type: boolean description: Indicates if ontology definition is a draft top_node: type: object description: > Concept to use as top node for ontology. Note that this must have the same CURIE namespace as the actually annotated concept. properties: id: type: string description: CURIE for the top node term label: type: string description: Ontology name for the top node term # AIRR Data File # # A JSON data file that holds Repertoire metadata, data processing # analysis objects, or any object in the AIRR Data Model. # # It is presumed that the objects gathered together in an AIRR Data File are related # or relevant to each other, e.g. part of the same study; thus, the ID fields can be # internally resolved unless the ID contains an external PID. This implies that AIRR # Data Files cannot be merged simply by concatenating arrays; any merge program # would need to manage duplicate or conflicting ID values. # # While the properties in an AIRR Data File are not required, if one is provided then # the value should not be null. DataFile: type: object properties: Info: $ref: '#/InfoObject' x-airr: nullable: false Repertoire: type: array description: List of repertoires items: $ref: '#/Repertoire' x-airr: nullable: false RepertoireGroup: type: array description: List of repertoire collections items: $ref: '#/RepertoireGroup' x-airr: nullable: false Rearrangement: type: array description: List of rearrangement records items: $ref: '#/Rearrangement' x-airr: nullable: false Cell: type: array description: List of cells items: $ref: '#/Cell' x-airr: nullable: false Clone: type: array description: List of clones items: $ref: '#/Clone' x-airr: nullable: false GermlineSet: type: array description: List of germline sets items: $ref: '#/GermlineSet' x-airr: nullable: false GenotypeSet: type: array description: List of genotype sets items: $ref: '#/GenotypeSet' x-airr: nullable: false # AIRR Info object, should be similar to openapi # should we point to an openapi schema? InfoObject: type: object description: Provides information about data and API responses. required: - title - version properties: title: type: string x-airr: nullable: false version: type: string x-airr: nullable: false description: type: string contact: type: object properties: name: type: string url: type: string email: type: string license: type: object required: - name properties: name: type: string x-airr: nullable: false url: type: string # A time point TimePoint: description: Time point at which an observation or other action was performed. type: object properties: label: type: string description: Informative label for the time point example: Pre-operative sampling of cancer tissue x-airr: nullable: true adc-query-support: true value: type: number description: Value of the time point example: -5.0 x-airr: nullable: true adc-query-support: true unit: $ref: '#/Ontology' description: Unit of the time point title: Unit of immunization schedule example: id: UO:0000033 label: day x-airr: nullable: true adc-query-support: true format: ontology ontology: draft: false top_node: id: UO:0000003 label: time unit # # General objects # TODO: link to global schema with JSON-LD? # # An individual Acknowledgement: description: Individual whose contribution to this work should be acknowledged type: object required: - acknowledgement_id - name - institution_name properties: acknowledgement_id: type: string description: unique identifier of this Acknowledgement within the file x-airr: identifier: true miairr: important name: type: string description: Full name of individual institution_name: type: string description: Individual's department and institution name orcid_id: type: string description: Individual's ORCID identifier # # Germline gene schema # # Rearranged and genomic germline sequences RearrangedSequence: type: object description: > Details of a directly observed rearranged sequence or an inference from rearranged sequences contributing support for a gene or allele. required: - sequence_id - sequence - derivation - observation_type - repository_name - repository_id - deposited_version - seq_start - seq_end properties: sequence_id: type: string description: > Unique identifier of this RearrangedSequence within the file, typically generated by the repository hosting the schema, for example from the underlying ID of the database record. x-airr: identifier: true miairr: important sequence: type: string description: nucleotide sequence x-airr: miairr: essential nullable: false derivation: type: string enum: - DNA - RNA - null description: The class of nucleic acid that was used as primary starting material x-airr: miairr: important nullable: true observation_type: type: string enum: - direct_sequencing - inference_from_repertoire description: > The type of observation from which this sequence was drawn, such as direct sequencing or inference from repertoire sequencing data. x-airr: miairr: essential nullable: false curation: type: string description: Curational notes on the sequence repository_name: type: string description: Name of the repository in which the sequence has been deposited x-airr: miairr: defined repository_ref: type: string description: Queryable id or accession number of the sequence published by the repository x-airr: miairr: defined deposited_version: type: string description: Version number of the sequence within the repository x-airr: miairr: defined sequence_start: type: integer description: Start co-ordinate of the sequence detailed in this record, within the sequence deposited x-airr: miairr: essential nullable: false sequence_end: type: integer description: End co-ordinate of the sequence detailed in this record, within the sequence deposited x-airr: miairr: essential nullable: false UnrearrangedSequence: description: Details of an unrearranged sequence contributing support for a gene or allele type: object required: - sequence_id - sequence - repository_name - assembly_id - gff_seqid - gff_start - gff_end - strand properties: sequence_id: type: string description: unique identifier of this UnrearrangedSequence within the file x-airr: identifier: true miairr: important sequence: type: string description: > Sequence of interest described in this record. Typically, this will include gene and promoter region. x-airr: miairr: essential nullable: false curation: type: string description: Curational notes on the sequence repository_name: type: string description: Name of the repository in which the assembly or contig is deposited x-airr: miairr: defined repository_ref: type: string description: Queryable id or accession number of the sequence published by the repository x-airr: miairr: defined patch_no: type: string description: Genome assembly patch number in which this gene was determined gff_seqid: type: string description: > Sequence (from the assembly) of a window including the gene and preferably also the promoter region. gff_start: type: integer description: > Genomic co-ordinates of the start of the sequence of interest described in this record in Ensemble GFF version 3. gff_end: type: integer description: > Genomic co-ordinates of the end of the sequence of interest described in this record in Ensemble GFF version 3. strand: type: string enum: - "+" - "-" - null description: sense (+ or -) x-airr: nullable: true # V gene delineation SequenceDelineationV: description: Delineation of a V-gene in a particular system type: object required: - sequence_delineation_id - delineation_scheme - fwr1_start - fwr1_end - cdr1_start - cdr1_end - fwr2_start - fwr2_end - cdr2_start - cdr2_end - fwr3_start - fwr3_end - cdr3_start properties: sequence_delineation_id: type: string description: > Unique identifier of this SequenceDelineationV within the file. Typically, generated by the repository hosting the record. x-airr: identifier: true miairr: important delineation_scheme: type: string description: Name of the delineation scheme example: Chothia x-airr: miairr: important unaligned_sequence: type: string x-airr: miairr: important description: entire V-sequence covered by this delineation aligned_sequence: type: string description: > Aligned sequence if this delineation provides an alignment. An aligned sequence should always be provided for IMGT delineations. fwr1_start: type: integer description: FWR1 start co-ordinate in the 'unaligned sequence' field x-airr: miairr: important fwr1_end: type: integer description: FWR1 end co-ordinate in the 'unaligned sequence' field x-airr: miairr: important cdr1_start: type: integer description: CDR1 start co-ordinate in the 'unaligned sequence' field x-airr: miairr: important cdr1_end: type: integer description: CDR1 end co-ordinate in the 'unaligned sequence' field x-airr: miairr: important fwr2_start: type: integer description: FWR2 start co-ordinate in the 'unaligned sequence' field x-airr: miairr: important fwr2_end: type: integer description: FWR2 end co-ordinate in the 'unaligned sequence' field x-airr: miairr: important cdr2_start: type: integer description: CDR2 start co-ordinate in the 'unaligned sequence' field x-airr: miairr: important cdr2_end: type: integer description: CDR2 end co-ordinate in the 'unaligned sequence' field x-airr: miairr: important fwr3_start: type: integer description: FWR3 start co-ordinate in the 'unaligned sequence' field x-airr: miairr: important fwr3_end: type: integer description: FWR3 end co-ordinate in the 'unaligned sequence' field x-airr: miairr: important cdr3_start: type: integer description: CDR3 start co-ordinate in the 'unaligned sequence' field x-airr: miairr: important alignment_labels: type: array items: type: string description: > One string for each codon in the aligned_sequence indicating the label of that codon according to the numbering of the delineation scheme if it provides one. # Description of a putative or confirmed Ig receptor gene/allele AlleleDescription: description: Details of a putative or confirmed Ig receptor gene/allele inferred from one or more observations type: object required: - allele_description_id - maintainer - lab_address - release_version - release_date - release_description - sequence - coding_sequence - locus - sequence_type - functional - inference_type - species properties: allele_description_id: type: string description: > Unique identifier of this AlleleDescription within the file. Typically, generated by the repository hosting the record. x-airr: identifier: true miairr: important allele_description_ref: type: string description: Unique reference to the allele description, in standardized form (Repo:Label:Version) example: OGRDB:Human_IGH:IGHV1-69*01.001 x-airr: miairr: important maintainer: type: string description: Maintainer of this sequence record x-airr: miairr: defined acknowledgements: type: array description: List of individuals whose contribution to the gene description should be acknowledged items: $ref: '#/Acknowledgement' lab_address: type: string description: Institution and full address of corresponding author x-airr: miairr: defined release_version: type: integer description: Version number of this record, updated whenever a revised version is published or released x-airr: miairr: important release_date: type: string format: date-time description: Date of this release title: Release Date example: "2021-02-02" x-airr: miairr: important release_description: type: string description: Brief descriptive notes of the reason for this release and the changes embodied x-airr: miairr: important label: type: string description: > The accepted name for this gene or allele following the relevant nomenclature. The value in this field should correspond to values in acceptable name fields of other schemas, such as v_call, d_call, and j_call fields. example: IGHV1-69*01 x-airr: miairr: important sequence: type: string description: > Nucleotide sequence of the gene. This should cover the full length that is available, including where possible RSS, and 5' UTR and lead-in for V-gene sequences. x-airr: miairr: essential nullable: false coding_sequence: type: string description: > Nucleotide sequence of the core coding region, such as the coding region of a D-, J- or C- gene or the coding region of a V-gene excluding the leader. x-airr: miairr: important aliases: type: array items: type: string description: Alternative names for this sequence locus: type: string enum: - IGH - IGI - IGK - IGL - TRA - TRB - TRG - TRD description: Gene locus x-airr: miairr: essential nullable: false chromosome: type: integer description: chromosome on which the gene is located sequence_type: type: string enum: - V - D - J - C description: Sequence type (V, D, J, C) x-airr: miairr: essential nullable: false functional: type: boolean description: True if the gene is functional, false if it is a pseudogene x-airr: miairr: important inference_type: type: string enum: - genomic_and_rearranged - genomic_only - rearranged_only - null description: Type of inference(s) from which this gene sequence was inferred x-airr: miairr: important nullable: true species: $ref: '#/Ontology' description: Binomial designation of subject's species title: Organism example: id: NCBITAXON:9606 label: Homo sapiens x-airr: miairr: essential nullable: false species_subgroup: type: string description: Race, strain or other species subgroup to which this subject belongs example: BALB/c species_subgroup_type: type: string enum: - breed - strain - inbred - outbred - locational - null x-airr: nullable: true status: type: string enum: - active - draft - retired - withdrawn - null description: Status of record, assumed active if the field is not present x-airr: nullable: true subgroup_designation: type: string description: Identifier of the gene subgroup or clade, as (and if) defined gene_designation: type: string description: Gene number or other identifier, as (and if) defined allele_designation: type: string description: Allele number or other identifier, as (and if) defined allele_similarity_cluster_designation: type: string description: ID of the similarity cluster used in this germline set, if designated allele_similarity_cluster_member_id: type: string description: Membership ID of the allele within the similarity cluster, if a cluster is designated j_codon_frame: type: integer enum: - 1 - 2 - 3 - null description: > Codon position of the first nucleotide in the 'coding_sequence' field. Mandatory for J genes. Not used for V or D genes. '1' means the sequence is in-frame, '2' means that the first bp is missing from the first codon, and '3' means that the first 2 bp are missing. x-airr: nullable: true gene_start: type: integer description: > Co-ordinate in the sequence field of the first nucleotide in the coding_sequence field. x-airr: miairr: important gene_end: type: integer description: > Co-ordinate in the sequence field of the last gene-coding nucleotide in the coding_sequence field. x-airr: miairr: important utr_5_prime_start: type: integer description: Start co-ordinate in the sequence field of the 5 prime UTR (V-genes only). utr_5_prime_end: type: integer description: End co-ordinate in the sequence field of the 5 prime UTR (V-genes only). leader_1_start: type: integer description: Start co-ordinate in the sequence field of L-PART1 (V-genes only). leader_1_end: type: integer description: End co-ordinate in the sequence field of L-PART1 (V-genes only). leader_2_start: type: integer description: Start co-ordinate in the sequence field of L-PART2 (V-genes only). leader_2_end: type: integer description: End co-ordinate in the sequence field of L-PART2 (V-genes only). v_rs_start: type: integer description: Start co-ordinate in the sequence field of the V recombination site (V-genes only). v_rs_end: type: integer description: End co-ordinate in the sequence field of the V recombination site (V-genes only). d_rs_3_prime_start: type: integer description: Start co-ordinate in the sequence field of the 3 prime D recombination site (D-genes only). d_rs_3_prime_end: type: integer description: End co-ordinate in the sequence field of the 3 prime D recombination site (D-genes only). d_rs_5_prime_start: type: integer description: Start co-ordinate in the sequence field of the 5 prime D recombination site (D-genes only). d_rs_5_prime_end: type: integer description: End co-ordinate in the sequence field of 5 the prime D recombination site (D-genes only). j_cdr3_end: type: integer description: > In the case of a J-gene, the co-ordinate in the sequence field of the first nucelotide of the conserved PHE or TRP (IMGT codon position 118). j_rs_start: type: integer description: Start co-ordinate in the sequence field of J recombination site (J-genes only). j_rs_end: type: integer description: End co-ordinate in the sequence field of J recombination site (J-genes only). j_donor_splice: type: integer description: Co-ordinate in the sequence field of the final 3' nucleotide of the J-REGION (J-genes only). v_gene_delineations: type: array items: $ref: '#/SequenceDelineationV' unrearranged_support: type: array items: $ref: '#/UnrearrangedSequence' rearranged_support: type: array items: $ref: '#/RearrangedSequence' paralogs: type: array items: type: string description: Gene symbols of any paralogs curation: type: string description: > Curational notes on the AlleleDescription. This can be used to give more extensive notes on the decisions taken than are provided in the release_description. curational_tags: type: array items: type: string enum: - likely_truncated - likely_full_length description: Controlled-vocabulary tags applied to this description x-airr: nullable: true # Collection of gene descriptions into a germline set GermlineSet: type: object description: > A germline object set bringing together multiple AlleleDescriptions from the same strain or species. All genes in a GermlineSet should be from a single locus. required: - germline_set_id - author - lab_name - lab_address - release_version - release_description - release_date - germline_set_name - germline_set_ref - species - locus - allele_descriptions properties: germline_set_id: type: string description: > Unique identifier of the GermlineSet within this file. Typically, generated by the repository hosting the record. x-airr: identifier: true miairr: important author: type: string description: Corresponding author x-airr: miairr: important lab_name: type: string description: Department of corresponding author x-airr: miairr: important lab_address: type: string description: Institutional address of corresponding author x-airr: miairr: important acknowledgements: type: array description: List of individuals whose contribution to the germline set should be acknowledged items: $ref: '#/Acknowledgement' release_version: type: number description: Version number of this record, allocated automatically x-airr: miairr: important release_description: type: string description: Brief descriptive notes of the reason for this release and the changes embodied x-airr: miairr: important release_date: type: string format: date-time description: Date of this release title: Release Date example: "2021-02-02" x-airr: miairr: important germline_set_name: type: string description: descriptive name of this germline set x-airr: miairr: important germline_set_ref: type: string description: Unique identifier of the germline set and version, in standardized form (Repo:Label:Version) example: OGRDB:Human_IGH:2021.11 x-airr: miairr: important pub_ids: type: string description: Publications describing the germline set example: "PMID:85642,PMID:12345" species: $ref: '#/Ontology' description: Binomial designation of subject's species title: Organism example: id: NCBITAXON:9606 label: Homo sapiens x-airr: miairr: essential nullable: false species_subgroup: type: string description: Race, strain or other species subgroup to which this subject belongs example: BALB/c species_subgroup_type: type: string enum: - breed - strain - inbred - outbred - locational - null x-airr: nullable: true locus: type: string enum: - IGH - IGI - IGK - IGL - TRA - TRB - TRG - TRD description: Gene locus x-airr: miairr: essential nullable: false allele_descriptions: type: array items: $ref: '#/AlleleDescription' description: list of allele_descriptions in the germline set x-airr: miairr: important curation: type: string description: > Curational notes on the GermlineSet. This can be used to give more extensive notes on the decisions taken than are provided in the release_description. # # Genotype schema # # GenotypeSet lists the Genotypes (describing different loci) inferred for this subject GenotypeSet: type: object required: - receptor_genotype_set_id properties: receptor_genotype_set_id: type: string description: > A unique identifier for this Receptor Genotype Set, typically generated by the repository hosting the schema, for example from the underlying ID of the database record. x-airr: identifier: true miairr: important genotype_class_list: description: List of Genotypes included in this Receptor Genotype Set. type: array items: $ref: '#/Genotype' # This enumerates the alleles and gene deletions inferred in a single subject. Included alleles may either be listed by reference to a GermlineSet, or # listed as 'undocumented', in which case the inferred sequence is provided # Genotype of adaptive immune receptors Genotype: type: object required: - receptor_genotype_id - locus properties: receptor_genotype_id: type: string description: > A unique identifier within the file for this Receptor Genotype, typically generated by the repository hosting the schema, for example from the underlying ID of the database record. x-airr: identifier: true miairr: important locus: type: string enum: - IGH - IGI - IGK - IGL - TRA - TRB - TRD - TRG description: Gene locus example: IGH x-airr: miairr: essential nullable: false adc-query-support: true format: controlled_vocabulary documented_alleles: type: array description: List of alleles documented in reference set(s) items: $ref: '#/DocumentedAllele' x-airr: miairr: important undocumented_alleles: type: array description: List of alleles inferred to be present and not documented in an identified GermlineSet items: $ref: '#/UndocumentedAllele' x-airr: adc-query-support: true deleted_genes: type: array description: Array of genes identified as being deleted in this genotype items: $ref: '#/DeletedGene' x-airr: adc-query-support: true inference_process: type: string enum: - genomic_sequencing - repertoire_sequencing - null description: Information on how the genotype was acquired. Controlled vocabulary. title: Genotype acquisition process example: repertoire_sequencing x-airr: adc-query-support: true format: controlled_vocabulary nullable: true # Documented Allele # This describes a 'known' allele found in a genotype # It 'known' in the sense that it is documented in a reference set DocumentedAllele: type: object required: - label - germline_set_ref properties: label: type: string x-airr: miairr: important description: The accepted name for this allele, taken from the GermlineSet germline_set_ref: type: string x-airr: miairr: important description: GermlineSet from which it was taken, referenced in standardized form (Repo:Label:Version) example: OGRDB:Human_IGH:2021.11 phasing: type: integer nullable: true description: > Chromosomal phasing indicator. Alleles with the same value are inferred to be located on the same chromosome. # Undocumented Allele # This describes a 'undocumented' allele found in a genotype # It is 'undocumented' in the sense that it was not found in reference sets consulted for the analysis UndocumentedAllele: required: - allele_name - sequence type: object properties: allele_name: type: string x-airr: miairr: important description: Allele name as allocated by the inference pipeline sequence: type: string x-airr: miairr: essential nullable: false description: nt sequence of the allele, as provided by the inference pipeline phasing: type: integer nullable: true description: > Chromosomal phasing indicator. Alleles with the same value are inferred to be located on the same chromosome. # Deleted Gene # It is regarded as 'deleted' in the sense that it was not identified during inference of the genotype DeletedGene: required: - label - germline_set_ref type: object properties: label: type: string x-airr: miairr: essential nullable: false description: The accepted name for this gene, taken from the GermlineSet germline_set_ref: type: string x-airr: miairr: important description: GermlineSet from which it was taken (issuer/name/version) phasing: type: integer nullable: true description: > Chromosomal phasing indicator. Alleles with the same value are inferred to be located on the same chromosome. # List of MHCGenotypes describing a subject's genotype MHCGenotypeSet: type: object required: - mhc_genotype_set_id - mhc_genotype_list properties: mhc_genotype_set_id: type: string description: A unique identifier for this MHCGenotypeSet x-airr: identifier: true miairr: important mhc_genotype_list: description: List of MHCGenotypes included in this set type: array items: $ref: '#/MHCGenotype' x-airr: miairr: important # Genotype of major histocompatibility complex (MHC) class I, class II and non-classical loci MHCGenotype: type: object required: - mhc_genotype_id - mhc_class - mhc_alleles properties: mhc_genotype_id: type: string description: A unique identifier for this MHCGenotype, assumed to be unique in the context of the study x-airr: identifier: true miairr: important mhc_class: type: string enum: - MHC-I - MHC-II - MHC-nonclassical description: Class of MHC alleles described by the MHCGenotype example: MHC-I x-airr: miairr: essential nullable: false adc-query-support: true format: controlled_vocabulary mhc_alleles: type: array description: List of MHC alleles of the indicated mhc_class identified in an individual items: $ref: '#/MHCAllele' x-airr: miairr: important adc-query-support: true mhc_genotyping_method: type: string description: > Information on how the genotype was determined. The content of this field should come from a list of recommended terms provided in the AIRR Schema documentation. title: MHC genotyping method example: pcr_low_resolution x-airr: miairr: important adc-query-support: true # Allele of an MHC gene MHCAllele: type: object properties: allele_designation: type: string description: > The accepted designation of an allele, usually its gene symbol plus allele/sub-allele/etc identifiers, if provided by the mhc_typing method x-airr: miairr: important gene: $ref: '#/Ontology' description: The MHC gene to which the described allele belongs title: MHC gene example: id: MRO:0000046 label: HLA-A x-airr: miairr: important adc-query-support: false format: ontology ontology: draft: true top_node: id: MRO:0000004 label: MHC gene reference_set_ref: type: string description: Repository and list from which it was taken (issuer/name/version) x-airr: miairr: important # # Repertoire metadata schema # # The overall study with a globally unique study_id Study: type: object required: - study_id - study_title - study_type - inclusion_exclusion_criteria - grants - collected_by - lab_name - lab_address - submitted_by - pub_ids - keywords_study properties: study_id: type: string description: > Unique ID assigned by study registry such as one of the International Nucleotide Sequence Database Collaboration (INSDC) repositories. title: Study ID example: PRJNA001 x-airr: identifier: true miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Study ID study_title: type: string description: Descriptive study title title: Study title example: Effects of sun light exposure of the Treg repertoire x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Study title study_type: $ref: '#/Ontology' description: Type of study design title: Study type example: id: NCIT:C15197 label: Case-Control Study x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Study type format: ontology ontology: draft: false top_node: id: NCIT:C63536 label: Study study_description: type: string description: Generic study description title: Study description example: Longer description x-airr: nullable: true name: Study description adc-query-support: true inclusion_exclusion_criteria: type: string description: List of criteria for inclusion/exclusion for the study title: Study inclusion/exclusion criteria example: "Include: Clinical P. falciparum infection; Exclude: Seropositive for HIV" x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Study inclusion/exclusion criteria grants: type: string description: Funding agencies and grant numbers title: Grant funding agency example: NIH, award number R01GM987654 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Grant funding agency study_contact: type: string description: > Full contact information of the contact persons for this study This should include an e-mail address and a persistent identifier such as an ORCID ID. title: Contact information (study) example: Dr. P. Stibbons, p.stibbons@unseenu.edu, https://orcid.org/0000-0002-1825-0097 x-airr: nullable: true adc-query-support: true name: Contact information (study) collected_by: type: string description: > Full contact information of the data collector, i.e. the person who is legally responsible for data collection and release. This should include an e-mail address and a persistent identifier such as an ORCID ID. title: Contact information (data collection) example: Dr. P. Stibbons, p.stibbons@unseenu.edu, https://orcid.org/0000-0002-1825-0097 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Contact information (data collection) lab_name: type: string description: Department of data collector title: Lab name example: Department for Planar Immunology x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Lab name lab_address: type: string description: Institution and institutional address of data collector title: Lab address example: School of Medicine, Unseen University, Ankh-Morpork, Disk World x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Lab address submitted_by: type: string description: > Full contact information of the data depositor, i.e., the person submitting the data to a repository. This should include an e-mail address and a persistent identifier such as an ORCID ID. This is supposed to be a short-lived and technical role until the submission is relased. title: Contact information (data deposition) example: Adrian Turnipseed, a.turnipseed@unseenu.edu, https://orcid.org/0000-0002-1825-0097 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Contact information (data deposition) pub_ids: type: string description: > Publications describing the rationale and/or outcome of the study. Where ever possible, a persistent identifier should be used such as a DOI or a Pubmed ID title: Relevant publications example: "PMID:85642" x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Relevant publications keywords_study: type: array items: type: string enum: - contains_ig - contains_tr - contains_paired_chain - contains_schema_rearrangement - contains_schema_clone - contains_schema_cell - contains_schema_receptor description: > Keywords describing properties of one or more data sets in a study. "contains_schema" keywords indicate that the study contains data objects from the AIRR Schema of that type (Rearrangement, Clone, Cell, Receptor) while the other keywords indicate that the study design considers the type of data indicated (e.g. it is possible to have a study that "contains_paired_chain" but does not "contains_schema_cell"). title: Keywords for study example: - contains_ig - contains_schema_rearrangement - contains_schema_clone - contains_schema_cell x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: study name: Keywords for study format: controlled_vocabulary adc_publish_date: type: string format: date-time description: > Date the study was first published in the AIRR Data Commons. title: ADC Publish Date example: "2021-02-02" x-airr: nullable: true adc-query-support: true name: ADC Publish Date adc_update_date: type: string format: date-time description: > Date the study data was updated in the AIRR Data Commons. title: ADC Update Date example: "2021-02-02" x-airr: nullable: true adc-query-support: true name: ADC Update Date SubjectGenotype: type: object properties: receptor_genotype_set: $ref: '#/GenotypeSet' description: Immune receptor genotype set for this subject. mhc_genotype_set: $ref: '#/MHCGenotypeSet' description: MHC genotype set for this subject. # 1-to-n relationship between a study and its subjects # subject_id is unique within a study Subject: type: object required: - subject_id - synthetic - species - sex - age_min - age_max - age_unit - age_event - ancestry_population - ethnicity - race - strain_name - linked_subjects - link_type properties: subject_id: type: string description: > Subject ID assigned by submitter, unique within study. If possible, a persistent subject ID linked to an INSDC or similar repository study should be used. title: Subject ID example: SUB856413 x-airr: identifier: true miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Subject ID synthetic: type: boolean description: TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display) title: Synthetic library x-airr: miairr: essential nullable: false adc-query-support: true set: 1 subset: subject name: Synthetic library species: $ref: '#/Ontology' description: Binomial designation of subject's species title: Organism example: id: NCBITAXON:9606 label: Homo sapiens x-airr: miairr: essential nullable: false adc-query-support: true set: 1 subset: subject name: Organism format: ontology ontology: draft: false top_node: id: NCBITAXON:7776 label: Gnathostomata organism: $ref: '#/Ontology' description: Binomial designation of subject's species x-airr: deprecated: true deprecated-description: Field was renamed to species for clarity. deprecated-replaced-by: - species sex: type: string enum: - male - female - pooled - hermaphrodite - intersex - null description: Biological sex of subject title: Sex example: female x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Sex format: controlled_vocabulary age_min: type: number description: Specific age or lower boundary of age range. title: Age minimum example: 60 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Age minimum age_max: type: number description: > Upper boundary of age range or equal to age_min for specific age. This field should only be null if age_min is null. title: Age maximum example: 80 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Age maximum age_unit: $ref: '#/Ontology' description: Unit of age range title: Age unit example: id: UO:0000036 label: year x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Age unit format: ontology ontology: draft: false top_node: id: UO:0000003 label: time unit age_event: type: string description: > Event in the study schedule to which `Age` refers. For NCBI BioSample this MUST be `sampling`. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between `Age event` and `Sample collection time`, hence the chosen events should be in temporal proximity. title: Age event example: enrollment x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Age event age: type: string x-airr: deprecated: true deprecated-description: Split into two fields to specify as an age range. deprecated-replaced-by: - age_min - age_max - age_unit ancestry_population: type: string description: Broad geographic origin of ancestry (continent) title: Ancestry population example: list of continents, mixed or unknown x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Ancestry population ethnicity: type: string description: Ethnic group of subject (defined as cultural/language-based membership) title: Ethnicity example: English, Kurds, Manchu, Yakuts (and other fields from Wikipedia) x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Ethnicity race: type: string description: Racial group of subject (as defined by NIH) title: Race example: White, American Indian or Alaska Native, Black, Asian, Native Hawaiian or Other Pacific Islander, Other x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Race strain_name: type: string description: Non-human designation of the strain or breed of animal used title: Strain name example: C57BL/6J x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Strain name linked_subjects: type: string description: Subject ID to which `Relation type` refers title: Relation to other subjects example: SUB1355648 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Relation to other subjects link_type: type: string description: Relation between subject and `linked_subjects`, can be genetic or environmental (e.g.exposure) title: Relation type example: father, daughter, household x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: subject name: Relation type diagnosis: type: array description: Diagnosis information for subject items: $ref: '#/Diagnosis' x-airr: nullable: false adc-query-support: true genotype: $ref: '#/SubjectGenotype' title: SubjectGenotype # 1-to-n relationship between a subject and its diagnoses Diagnosis: type: object required: - study_group_description - disease_diagnosis - disease_length - disease_stage - prior_therapies - immunogen - intervention - medical_history properties: study_group_description: type: string description: Designation of study arm to which the subject is assigned to title: Study group description example: control x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Study group description disease_diagnosis: $ref: '#/Ontology' description: Diagnosis of subject title: Diagnosis example: id: DOID:9538 label: multiple myeloma x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Diagnosis format: ontology ontology: draft: false top_node: id: DOID:4 label: disease disease_length: type: string description: Time duration between initial diagnosis and current intervention title: Length of disease example: 23 months x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Length of disease format: physical_quantity disease_stage: type: string description: Stage of disease at current intervention title: Disease stage example: Stage II x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Disease stage prior_therapies: type: string description: List of all relevant previous therapies applied to subject for treatment of `Diagnosis` title: Prior therapies for primary disease under study example: melphalan/prednisone x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Prior therapies for primary disease under study immunogen: type: string description: Antigen, vaccine or drug applied to subject at this intervention title: Immunogen/agent example: bortezomib x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Immunogen/agent intervention: type: string description: Description of intervention title: Intervention definition example: systemic chemotherapy, 6 cycles, 1.25 mg/m2 x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Intervention definition medical_history: type: string description: Medical history of subject that is relevant to assess the course of disease and/or treatment title: Other relevant medical history example: MGUS, first diagnosed 5 years prior x-airr: miairr: important nullable: true adc-query-support: true set: 1 subset: diagnosis and intervention name: Other relevant medical history # 1-to-n relationship between a subject and its samples # sample_id is unique within a study Sample: type: object required: - sample_id - sample_type - tissue - anatomic_site - disease_state_sample - collection_time_point_relative - collection_time_point_relative_unit - collection_time_point_reference - biomaterial_provider properties: sample_id: type: string description: > Sample ID assigned by submitter, unique within study. If possible, a persistent sample ID linked to INSDC or similar repository study should be used. title: Biological sample ID example: SUP52415 x-airr: identifier: true miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Biological sample ID sample_type: type: string description: The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture title: Sample type example: Biopsy x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Sample type tissue: $ref: '#/Ontology' description: The actual tissue sampled, e.g. lymph node, liver, peripheral blood title: Tissue example: id: UBERON:0002371 label: bone marrow x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Tissue format: ontology ontology: draft: false top_node: id: UBERON:0010000 label: multicellular anatomical structure anatomic_site: type: string description: The anatomic location of the tissue, e.g. Inguinal, femur title: Anatomic site example: Iliac crest x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Anatomic site disease_state_sample: type: string description: Histopathologic evaluation of the sample title: Disease state of sample example: Tumor infiltration x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Disease state of sample collection_time_point_relative: type: number description: Time point at which sample was taken, relative to `Collection time event` title: Sample collection time example: 14 x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Sample collection time collection_time_point_relative_unit: $ref: '#/Ontology' description: Unit of Sample collection time title: Sample collection time unit example: id: UO:0000033 label: day x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Sample collection time unit format: ontology ontology: draft: false top_node: id: UO:0000003 label: time unit collection_time_point_reference: type: string description: Event in the study schedule to which `Sample collection time` relates to title: Collection time event example: Primary vaccination x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Collection time event biomaterial_provider: type: string description: Name and address of the entity providing the sample title: Biomaterial provider example: Tissues-R-Us, Tampa, FL, USA x-airr: miairr: important nullable: true adc-query-support: true set: 2 subset: sample name: Biomaterial provider # 1-to-n relationship between a sample and processing of its cells CellProcessing: type: object required: - tissue_processing - cell_subset - cell_phenotype - single_cell - cell_number - cells_per_reaction - cell_storage - cell_quality - cell_isolation - cell_processing_protocol properties: tissue_processing: type: string description: Enzymatic digestion and/or physical methods used to isolate cells from sample title: Tissue processing example: Collagenase A/Dnase I digested, followed by Percoll gradient x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Tissue processing cell_subset: $ref: '#/Ontology' description: Commonly-used designation of isolated cell population title: Cell subset example: id: CL:0000972 label: class switched memory B cell x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell subset format: ontology ontology: draft: false top_node: id: CL:0000542 label: lymphocyte cell_phenotype: type: string description: List of cellular markers and their expression levels used to isolate the cell population title: Cell subset phenotype example: CD19+ CD38+ CD27+ IgM- IgD- x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell subset phenotype cell_species: $ref: '#/Ontology' description: > Binomial designation of the species from which the analyzed cells originate. Typically, this value should be identical to `species`, in which case it SHOULD NOT be set explicitly. However, there are valid experimental setups in which the two might differ, e.g., chimeric animal models. If set, this key will overwrite the `species` information for all lower layers of the schema. title: Cell species example: id: NCBITAXON:9606 label: Homo sapiens x-airr: miairr: defined nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell species format: ontology ontology: draft: false top_node: id: NCBITAXON:7776 label: Gnathostomata single_cell: type: boolean description: TRUE if single cells were isolated into separate compartments title: Single-cell sort x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Single-cell sort cell_number: type: integer description: Total number of cells that went into the experiment title: Number of cells in experiment example: 1000000 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Number of cells in experiment cells_per_reaction: type: integer description: Number of cells for each biological replicate title: Number of cells per sequencing reaction example: 50000 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Number of cells per sequencing reaction cell_storage: type: boolean description: TRUE if cells were cryo-preserved between isolation and further processing title: Cell storage example: TRUE x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell storage cell_quality: type: string description: Relative amount of viable cells after preparation and (if applicable) thawing title: Cell quality example: 90% viability as determined by 7-AAD x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell quality cell_isolation: type: string description: Description of the procedure used for marker-based isolation or enrich cells title: Cell isolation / enrichment procedure example: > Cells were stained with fluorochrome labeled antibodies and then sorted on a FlowMerlin (CE) cytometer. x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Cell isolation / enrichment procedure cell_processing_protocol: type: string description: > Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript. title: Processing protocol example: Stimulated wih anti-CD3/anti-CD28 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (cell) name: Processing protocol # object for PCR primer targets PCRTarget: type: object required: - pcr_target_locus - forward_pcr_primer_target_location - reverse_pcr_primer_target_location properties: pcr_target_locus: type: string enum: - IGH - IGI - IGK - IGL - TRA - TRB - TRD - TRG - null description: > Designation of the target locus. Note that this field uses a controlled vocubulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature. title: Target locus for PCR example: IGK x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid [pcr]) name: Target locus for PCR format: controlled_vocabulary forward_pcr_primer_target_location: type: string description: Position of the most distal nucleotide templated by the forward primer or primer mix title: Forward PCR primer target location example: IGHV, +23 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid [pcr]) name: Forward PCR primer target location reverse_pcr_primer_target_location: type: string description: Position of the most proximal nucleotide templated by the reverse primer or primer mix title: Reverse PCR primer target location example: IGHG, +57 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid [pcr]) name: Reverse PCR primer target location # generally, a 1-to-1 relationship between a CellProcessing and processing of its nucleic acid # but may be 1-to-n for technical replicates. NucleicAcidProcessing: type: object required: - template_class - template_quality - template_amount - template_amount_unit - library_generation_method - library_generation_protocol - library_generation_kit_version - complete_sequences - physical_linkage properties: template_class: type: string enum: - DNA - RNA description: > The class of nucleic acid that was used as primary starting material for the following procedures title: Target substrate example: RNA x-airr: miairr: essential nullable: false adc-query-support: true set: 3 subset: process (nucleic acid) name: Target substrate format: controlled_vocabulary template_quality: type: string description: Description and results of the quality control performed on the template material title: Target substrate quality example: RIN 9.2 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid) name: Target substrate quality template_amount: type: number description: Amount of template that went into the process title: Template amount example: 1000 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid) name: Template amount template_amount_unit: $ref: '#/Ontology' description: Unit of template amount title: Template amount time unit example: id: UO:0000024 label: nanogram x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid) name: Template amount time unit format: ontology ontology: draft: false top_node: id: UO:0000002 label: physical quantity library_generation_method: type: string enum: - "PCR" - "RT(RHP)+PCR" - "RT(oligo-dT)+PCR" - "RT(oligo-dT)+TS+PCR" - "RT(oligo-dT)+TS(UMI)+PCR" - "RT(specific)+PCR" - "RT(specific)+TS+PCR" - "RT(specific)+TS(UMI)+PCR" - "RT(specific+UMI)+PCR" - "RT(specific+UMI)+TS+PCR" - "RT(specific)+TS" - "other" description: Generic type of library generation title: Library generation method example: RT(oligo-dT)+TS(UMI)+PCR x-airr: miairr: essential nullable: false adc-query-support: true set: 3 subset: process (nucleic acid) name: Library generation method format: controlled_vocabulary library_generation_protocol: type: string description: Description of processes applied to substrate to obtain a library that is ready for sequencing title: Library generation protocol example: cDNA was generated using x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid) name: Library generation protocol library_generation_kit_version: type: string description: When using a library generation protocol from a commercial provider, provide the protocol version number title: Protocol IDs example: v2.1 (2016-09-15) x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (nucleic acid) name: Protocol IDs pcr_target: type: array description: > If a PCR step was performed that specifically targets the IG/TR loci, the target and primer locations need to be provided here. This field holds an array of PCRTarget objects, so that multiplex PCR setups amplifying multiple loci at the same time can be annotated using one record per locus. PCR setups not targeting any specific locus must not annotate this field but select the appropriate library_generation_method instead. items: $ref: '#/PCRTarget' x-airr: nullable: false adc-query-support: true complete_sequences: type: string enum: - partial - complete - "complete+untemplated" - mixed description: > To be considered `complete`, the procedure used for library construction MUST generate sequences that 1) include the first V gene codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J gene (i.e. 1 bp 5' of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered `complete & untemplated`, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. `mixed` should only be used if the procedure used for library construction will likely produce multiple categories of sequences in the given experiment. It SHOULD NOT be used as a replacement of a NULL value. title: Complete sequences example: partial x-airr: miairr: essential nullable: false adc-query-support: true set: 3 subset: process (nucleic acid) name: Complete sequences format: controlled_vocabulary physical_linkage: type: string enum: - none - hetero_head-head - hetero_tail-head - hetero_prelinked description: > In case an experimental setup is used that physically links nucleic acids derived from distinct `Rearrangements` before library preparation, this field describes the mode of that linkage. All `hetero_*` terms indicate that in case of paired-read sequencing, the two reads should be expected to map to distinct IG/TR loci. `*_head-head` refers to techniques that link the 5' ends of transcripts in a single-cell context. `*_tail-head` refers to techniques that link the 3' end of one transcript to the 5' end of another one in a single-cell context. This term does not provide any information whether a continuous reading-frame between the two is generated. `*_prelinked` refers to constructs in which the linkage was already present on the DNA level (e.g. scFv). title: Physical linkage of different rearrangements example: hetero_head-head x-airr: miairr: essential nullable: false adc-query-support: true set: 3 subset: process (nucleic acid) name: Physical linkage of different rearrangements format: controlled_vocabulary # 1-to-n relationship between a NucleicAcidProcessing and SequencingRun with resultant raw sequence file(s) SequencingRun: type: object required: - sequencing_run_id - total_reads_passing_qc_filter - sequencing_platform - sequencing_facility - sequencing_run_date - sequencing_kit properties: sequencing_run_id: type: string description: ID of sequencing run assigned by the sequencing facility title: Batch number example: 160101_M01234 x-airr: identifier: true miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Batch number total_reads_passing_qc_filter: type: integer description: Number of usable reads for analysis title: Total reads passing QC filter example: 10365118 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Total reads passing QC filter sequencing_platform: type: string description: Designation of sequencing instrument used title: Sequencing platform example: Alumina LoSeq 1000 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Sequencing platform sequencing_facility: type: string description: Name and address of sequencing facility title: Sequencing facility example: Seqs-R-Us, Vancouver, BC, Canada x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Sequencing facility sequencing_run_date: type: string description: Date of sequencing run title: Date of sequencing run format: date example: 2016-12-16 x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Date of sequencing run sequencing_kit: type: string description: Name, manufacturer, order and lot numbers of sequencing kit title: Sequencing kit example: "FullSeq 600, Alumina, #M123456C0, 789G1HK" x-airr: miairr: important nullable: true adc-query-support: true set: 3 subset: process (sequencing) name: Sequencing kit sequencing_files: $ref: '#/SequencingData' description: Set of sequencing files produced by the sequencing run x-airr: nullable: false adc-query-support: true # Resultant raw sequencing files from a SequencingRun SequencingData: type: object required: - sequencing_data_id - file_type - filename - read_direction - read_length - paired_filename - paired_read_direction - paired_read_length properties: sequencing_data_id: type: string description: > Persistent identifier of raw data stored in an archive (e.g. INSDC run ID). Data archive should be identified in the CURIE prefix. title: Raw sequencing data persistent identifier example: "SRA:SRR11610494" x-airr: identifier: true miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) format: CURIE file_type: type: string description: File format for the raw reads or sequences title: Raw sequencing data file type enum: - fasta - fastq - null x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Raw sequencing data file type format: controlled_vocabulary filename: type: string description: File name for the raw reads or sequences. The first file in paired-read sequencing. title: Raw sequencing data file name example: MS10R-NMonson-C7JR9_S1_R1_001.fastq x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Raw sequencing data file name read_direction: type: string description: Read direction for the raw reads or sequences. The first file in paired-read sequencing. title: Read direction example: forward enum: - forward - reverse - mixed - null x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Read direction format: controlled_vocabulary read_length: type: integer description: Read length in bases for the first file in paired-read sequencing title: Forward read length example: 300 x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Forward read length paired_filename: type: string description: File name for the second file in paired-read sequencing title: Paired raw sequencing data file name example: MS10R-NMonson-C7JR9_S1_R2_001.fastq x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Paired raw sequencing data file name paired_read_direction: type: string description: Read direction for the second file in paired-read sequencing title: Paired read direction example: reverse enum: - forward - reverse - mixed - null x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Paired read direction format: controlled_vocabulary paired_read_length: type: integer description: Read length in bases for the second file in paired-read sequencing title: Paired read length example: 300 x-airr: miairr: important nullable: true adc-query-support: true set: 4 subset: data (raw reads) name: Paired read length index_filename: type: string description: File name for the index file title: Sequencing index file name example: MS10R-NMonson-C7JR9_S1_R3_001.fastq x-airr: nullable: true adc-query-support: true index_length: type: integer description: Read length in bases for the index file title: Index read length example: 8 x-airr: nullable: true adc-query-support: true # 1-to-n relationship between a repertoire and data processing # # Set of annotated rearrangement sequences produced by # data processing upon the raw sequence data for a repertoire. DataProcessing: type: object required: - software_versions - paired_reads_assembly - quality_thresholds - primer_match_cutoffs - collapsing_method - data_processing_protocols - germline_database properties: data_processing_id: type: string description: Identifier for the data processing object. title: Data processing ID x-airr: identifier: true nullable: true name: Data processing ID adc-query-support: true primary_annotation: type: boolean default: false description: > If true, indicates this is the primary or default data processing for the repertoire and its rearrangements. If false, indicates this is a secondary or additional data processing. title: Primary annotation x-airr: nullable: false adc-query-support: true identifier: true software_versions: type: string description: Version number and / or date, include company pipelines title: Software tools and version numbers example: IgBLAST 1.6 x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Software tools and version numbers paired_reads_assembly: type: string description: How paired end reads were assembled into a single receptor sequence title: Paired read assembly example: PandaSeq (minimal overlap 50, threshold 0.8) x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Paired read assembly quality_thresholds: type: string description: How/if sequences were removed from (4) based on base quality scores title: Quality thresholds example: Average Phred score >=20 x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Quality thresholds primer_match_cutoffs: type: string description: How primers were identified in the sequences, were they removed/masked/etc? title: Primer match cutoffs example: Hamming distance <= 2 x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Primer match cutoffs collapsing_method: type: string description: The method used for combining multiple sequences from (4) into a single sequence in (5) title: Collapsing method example: MUSCLE 3.8.31 x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Collapsing method data_processing_protocols: type: string description: General description of how QC is performed title: Data processing protocols example: Data was processed using [...] x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: process (computational) name: Data processing protocols data_processing_files: type: array items: type: string description: Array of file names for data produced by this data processing. title: Processed data file names example: - 'ERR1278153_aa.txz' - 'ERR1278153_ab.txz' - 'ERR1278153_ac.txz' x-airr: nullable: true adc-query-support: true name: Processed data file names germline_database: type: string description: Source of germline V(D)J genes with version number or date accessed. title: V(D)J germline reference database example: ENSEMBL, Homo sapiens build 90, 2017-10-01 x-airr: miairr: important nullable: true adc-query-support: true set: 5 subset: data (processed sequence) name: V(D)J germline reference database germline_set_ref: type: string description: Unique identifier of the germline set and version, in standardized form (Repo:Label:Version) example: OGRDB:Human_IGH:2021.11 x-airr: nullable: true adc-query-support: true analysis_provenance_id: type: string description: Identifier for machine-readable PROV model of analysis provenance title: Analysis provenance ID x-airr: nullable: true adc-query-support: true SampleProcessing: allOf: - type: object properties: sample_processing_id: type: string description: > Identifier for the sample processing object. This field should be unique within the repertoire. This field can be used to uniquely identify the combination of sample, cell processing, nucleic acid processing and sequencing run information for the repertoire. title: Sample processing ID x-airr: identifier: true nullable: true name: Sample processing ID adc-query-support: true - $ref: '#/Sample' - $ref: '#/CellProcessing' - $ref: '#/NucleicAcidProcessing' - $ref: '#/SequencingRun' # The composite schema for the repertoire object # # This represents a sample repertoire as defined by the study # and experimentally observed by raw sequence data. A repertoire # can only be for one subject but may include multiple samples. Repertoire: type: object required: - study - subject - sample - data_processing properties: repertoire_id: type: string description: > Identifier for the repertoire object. This identifier should be globally unique so that repertoires from multiple studies can be combined together without conflict. The repertoire_id is used to link other AIRR data to a Repertoire. Specifically, the Rearrangements Schema includes repertoire_id for referencing the specific Repertoire for that Rearrangement. title: Repertoire ID x-airr: nullable: true adc-query-support: true identifier: true repertoire_name: type: string description: Short generic display name for the repertoire title: Repertoire name x-airr: nullable: true name: Repertoire name adc-query-support: true repertoire_description: type: string description: Generic repertoire description title: Repertoire description x-airr: nullable: true name: Repertoire description adc-query-support: true study: $ref: '#/Study' description: Study object x-airr: nullable: false adc-query-support: true subject: $ref: '#/Subject' description: Subject object x-airr: nullable: false adc-query-support: true sample: type: array description: List of Sample Processing objects items: $ref: '#/SampleProcessing' x-airr: nullable: false adc-query-support: true data_processing: type: array description: List of Data Processing objects items: $ref: '#/DataProcessing' x-airr: nullable: false adc-query-support: true # A collection of repertoires for analysis purposes, includes optional time course RepertoireGroup: type: object required: - repertoire_group_id - repertoires properties: repertoire_group_id: type: string description: Identifier for this repertoire collection x-airr: identifier: true repertoire_group_name: type: string description: Short display name for this repertoire collection repertoire_group_description: type: string description: Repertoire collection description repertoires: type: array description: > List of repertoires in this collection with an associated description and time point designation items: type: object properties: repertoire_id: type: string description: Identifier to the repertoire x-airr: nullable: false adc-query-support: true repertoire_description: type: string description: Description of this repertoire within the group x-airr: nullable: true adc-query-support: true time_point: $ref: '#/TimePoint' description: Time point designation for this repertoire within the group x-airr: nullable: true adc-query-support: true Alignment: type: object required: - sequence_id - segment - call - score - cigar properties: sequence_id: type: string description: > Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. x-airr: identifier: true segment: type: string description: > The segment for this alignment. One of V, D, J or C. rev_comp: type: boolean description: > Alignment result is from the reverse complement of the query sequence. call: type: string description: > Gene assignment with allele. score: type: number description: > Alignment score. identity: type: number description: > Alignment fractional identity. support: type: number description: > Alignment E-value, p-value, likelihood, probability or other similar measure of support for the gene assignment as defined by the alignment tool. cigar: type: string description: > Alignment CIGAR string. sequence_start: type: integer description: > Start position of the segment in the query sequence (1-based closed interval). sequence_end: type: integer description: > End position of the segment in the query sequence (1-based closed interval). germline_start: type: integer description: > Alignment start position in the reference sequence (1-based closed interval). germline_end: type: integer description: > Alignment end position in the reference sequence (1-based closed interval). rank: type: integer description: > Alignment rank. rearrangement_id: type: string description: > Identifier for the Rearrangement object. May be identical to sequence_id, but will usually be a universally unique record locator for database applications. x-airr: deprecated: true deprecated-description: Field has been merged with sequence_id to avoid confusion. deprecated-replaced-by: - sequence_id data_processing_id: type: string description: > Identifier to the data processing object in the repertoire metadata for this rearrangement. If this field is empty than the primary data processing object is assumed. germline_database: type: string description: Source of germline V(D)J genes with version number or date accessed. example: ENSEMBL, Homo sapiens build 90, 2017-10-01 x-airr: deprecated: true deprecated-description: Field was moved up to the DataProcessing level to avoid data duplication. deprecated-replaced-by: - "DataProcessing:germline_database" # The extended rearrangement object Rearrangement: type: object required: - sequence_id - sequence - rev_comp - productive - v_call - d_call - j_call - sequence_alignment - germline_alignment - junction - junction_aa - v_cigar - d_cigar - j_cigar properties: sequence_id: type: string description: > Unique query sequence identifier for the Rearrangement. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. When downloaded from an AIRR Data Commons repository, this will usually be a universally unique record locator for linking with other objects in the AIRR Data Model. x-airr: adc-query-support: true identifier: true sequence: type: string description: > The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment. quality: type: string description: > The Sanger/Phred quality scores for assessment of sequence quality. Phred quality scores from 0 to 93 are encoded using ASCII 33 to 126 (Used by Illumina from v1.8.) sequence_aa: type: string description: > Amino acid translation of the query nucleotide sequence. rev_comp: type: boolean description: > True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of 'sequence'. productive: type: boolean description: > True if the V(D)J sequence is predicted to be productive. x-airr: adc-query-support: true vj_in_frame: type: boolean description: True if the V and J gene alignments are in-frame. stop_codon: type: boolean description: True if the aligned sequence contains a stop codon. complete_vdj: type: boolean description: > True if the sequence alignment spans the entire V(D)J region. Meaning, sequence_alignment includes both the first V gene codon that encodes the mature polypeptide chain (i.e., after the leader sequence) and the last complete codon of the J gene (i.e., before the J-C splice site). This does not require an absence of deletions within the internal FWR and CDR regions of the alignment. locus: type: string enum: - IGH - IGI - IGK - IGL - TRA - TRB - TRD - TRG - null description: > Gene locus (chain type). Note that this field uses a controlled vocabulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature. title: Gene locus example: IGH x-airr: nullable: true adc-query-support: true name: Gene locus format: controlled_vocabulary v_call: type: string description: > V gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHV4-59*01 if using IMGT/GENE-DB). title: V gene with allele example: IGHV4-59*01 x-airr: miairr: important nullable: true adc-query-support: true set: 6 subset: data (processed sequence) name: V gene with allele d_call: type: string description: > First or only D gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01 if using IMGT/GENE-DB). title: D gene with allele example: IGHD3-10*01 x-airr: miairr: important nullable: true adc-query-support: true set: 6 subset: data (processed sequence) name: D gene with allele d2_call: type: string description: > Second D gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01 if using IMGT/GENE-DB). example: IGHD3-10*01 j_call: type: string description: > J gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHJ4*02 if using IMGT/GENE-DB). title: J gene with allele example: IGHJ4*02 x-airr: miairr: important nullable: true adc-query-support: true set: 6 subset: data (processed sequence) name: J gene with allele c_call: type: string description: > Constant region gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHG1*01 if using IMGT/GENE-DB). title: C region example: IGHG1*01 x-airr: miairr: important nullable: true adc-query-support: true set: 6 subset: data (processed sequence) name: C region sequence_alignment: type: string description: > Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement. quality_alignment: type: string description: > Sanger/Phred quality scores for assessment of sequence_alignment quality. Phred quality scores from 0 to 93 are encoded using ASCII 33 to 126 (Used by Illumina from v1.8.) sequence_alignment_aa: type: string description: > Amino acid translation of the aligned query sequence. germline_alignment: type: string description: > Assembled, aligned, full-length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any). germline_alignment_aa: type: string description: > Amino acid translation of the assembled germline sequence. junction: type: string description: > Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. title: IMGT-JUNCTION nucleotide sequence example: TGTGCAAGAGCGGGAGTTTACGACGGATATACTATGGACTACTGG x-airr: miairr: important nullable: true set: 6 subset: data (processed sequence) name: IMGT-JUNCTION nucleotide sequence junction_aa: type: string description: > Amino acid translation of the junction. title: IMGT-JUNCTION amino acid sequence example: CARAGVYDGYTMDYW x-airr: miairr: important nullable: true adc-query-support: true set: 6 subset: data (processed sequence) name: IMGT-JUNCTION amino acid sequence np1: type: string description: > Nucleotide sequence of the combined N/P region between the V gene and first D gene alignment or between the V gene and J gene alignments. np1_aa: type: string description: > Amino acid translation of the np1 field. np2: type: string description: > Nucleotide sequence of the combined N/P region between either the first D gene and J gene alignments or the first D gene and second D gene alignments. np2_aa: type: string description: > Amino acid translation of the np2 field. np3: type: string description: > Nucleotide sequence of the combined N/P region between the second D gene and J gene alignments. np3_aa: type: string description: > Amino acid translation of the np3 field. cdr1: type: string description: > Nucleotide sequence of the aligned CDR1 region. cdr1_aa: type: string description: > Amino acid translation of the cdr1 field. cdr2: type: string description: > Nucleotide sequence of the aligned CDR2 region. cdr2_aa: type: string description: > Amino acid translation of the cdr2 field. cdr3: type: string description: > Nucleotide sequence of the aligned CDR3 region. cdr3_aa: type: string description: > Amino acid translation of the cdr3 field. fwr1: type: string description: > Nucleotide sequence of the aligned FWR1 region. fwr1_aa: type: string description: > Amino acid translation of the fwr1 field. fwr2: type: string description: > Nucleotide sequence of the aligned FWR2 region. fwr2_aa: type: string description: > Amino acid translation of the fwr2 field. fwr3: type: string description: > Nucleotide sequence of the aligned FWR3 region. fwr3_aa: type: string description: > Amino acid translation of the fwr3 field. fwr4: type: string description: > Nucleotide sequence of the aligned FWR4 region. fwr4_aa: type: string description: > Amino acid translation of the fwr4 field. v_score: type: number description: Alignment score for the V gene. v_identity: type: number description: Fractional identity for the V gene alignment. v_support: type: number description: > V gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the V gene assignment as defined by the alignment tool. v_cigar: type: string description: CIGAR string for the V gene alignment. d_score: type: number description: Alignment score for the first or only D gene alignment. d_identity: type: number description: Fractional identity for the first or only D gene alignment. d_support: type: number description: > D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the first or only D gene as defined by the alignment tool. d_cigar: type: string description: CIGAR string for the first or only D gene alignment. d2_score: type: number description: Alignment score for the second D gene alignment. d2_identity: type: number description: Fractional identity for the second D gene alignment. d2_support: type: number description: > D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the second D gene as defined by the alignment tool. d2_cigar: type: string description: CIGAR string for the second D gene alignment. j_score: type: number description: Alignment score for the J gene alignment. j_identity: type: number description: Fractional identity for the J gene alignment. j_support: type: number description: > J gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the J gene assignment as defined by the alignment tool. j_cigar: type: string description: CIGAR string for the J gene alignment. c_score: type: number description: Alignment score for the C gene alignment. c_identity: type: number description: Fractional identity for the C gene alignment. c_support: type: number description: > C gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the C gene assignment as defined by the alignment tool. c_cigar: type: string description: CIGAR string for the C gene alignment. v_sequence_start: type: integer description: > Start position of the V gene in the query sequence (1-based closed interval). v_sequence_end: type: integer description: > End position of the V gene in the query sequence (1-based closed interval). v_germline_start: type: integer description: > Alignment start position in the V gene reference sequence (1-based closed interval). v_germline_end: type: integer description: > Alignment end position in the V gene reference sequence (1-based closed interval). v_alignment_start: type: integer description: > Start position of the V gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). v_alignment_end: type: integer description: > End position of the V gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). d_sequence_start: type: integer description: > Start position of the first or only D gene in the query sequence. (1-based closed interval). d_sequence_end: type: integer description: > End position of the first or only D gene in the query sequence. (1-based closed interval). d_germline_start: type: integer description: > Alignment start position in the D gene reference sequence for the first or only D gene (1-based closed interval). d_germline_end: type: integer description: > Alignment end position in the D gene reference sequence for the first or only D gene (1-based closed interval). d_alignment_start: type: integer description: > Start position of the first or only D gene in both the sequence_alignment and germline_alignment fields (1-based closed interval). d_alignment_end: type: integer description: > End position of the first or only D gene in both the sequence_alignment and germline_alignment fields (1-based closed interval). d2_sequence_start: type: integer description: > Start position of the second D gene in the query sequence (1-based closed interval). d2_sequence_end: type: integer description: > End position of the second D gene in the query sequence (1-based closed interval). d2_germline_start: type: integer description: > Alignment start position in the second D gene reference sequence (1-based closed interval). d2_germline_end: type: integer description: > Alignment end position in the second D gene reference sequence (1-based closed interval). d2_alignment_start: type: integer description: > Start position of the second D gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). d2_alignment_end: type: integer description: > End position of the second D gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). j_sequence_start: type: integer description: > Start position of the J gene in the query sequence (1-based closed interval). j_sequence_end: type: integer description: > End position of the J gene in the query sequence (1-based closed interval). j_germline_start: type: integer description: > Alignment start position in the J gene reference sequence (1-based closed interval). j_germline_end: type: integer description: > Alignment end position in the J gene reference sequence (1-based closed interval). j_alignment_start: type: integer description: > Start position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). j_alignment_end: type: integer description: > End position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). c_sequence_start: type: integer description: > Start position of the C gene in the query sequence (1-based closed interval). c_sequence_end: type: integer description: > End position of the C gene in the query sequence (1-based closed interval). c_germline_start: type: integer description: > Alignment start position in the C gene reference sequence (1-based closed interval). c_germline_end: type: integer description: > Alignment end position in the C gene reference sequence (1-based closed interval). c_alignment_start: type: integer description: > Start position of the C gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). c_alignment_end: type: integer description: > End position of the C gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). cdr1_start: type: integer description: CDR1 start position in the query sequence (1-based closed interval). cdr1_end: type: integer description: CDR1 end position in the query sequence (1-based closed interval). cdr2_start: type: integer description: CDR2 start position in the query sequence (1-based closed interval). cdr2_end: type: integer description: CDR2 end position in the query sequence (1-based closed interval). cdr3_start: type: integer description: CDR3 start position in the query sequence (1-based closed interval). cdr3_end: type: integer description: CDR3 end position in the query sequence (1-based closed interval). fwr1_start: type: integer description: FWR1 start position in the query sequence (1-based closed interval). fwr1_end: type: integer description: FWR1 end position in the query sequence (1-based closed interval). fwr2_start: type: integer description: FWR2 start position in the query sequence (1-based closed interval). fwr2_end: type: integer description: FWR2 end position in the query sequence (1-based closed interval). fwr3_start: type: integer description: FWR3 start position in the query sequence (1-based closed interval). fwr3_end: type: integer description: FWR3 end position in the query sequence (1-based closed interval). fwr4_start: type: integer description: FWR4 start position in the query sequence (1-based closed interval). fwr4_end: type: integer description: FWR4 end position in the query sequence (1-based closed interval). v_sequence_alignment: type: string description: > Aligned portion of query sequence assigned to the V gene, including any indel corrections or numbering spacers. v_sequence_alignment_aa: type: string description: > Amino acid translation of the v_sequence_alignment field. d_sequence_alignment: type: string description: > Aligned portion of query sequence assigned to the first or only D gene, including any indel corrections or numbering spacers. d_sequence_alignment_aa: type: string description: > Amino acid translation of the d_sequence_alignment field. d2_sequence_alignment: type: string description: > Aligned portion of query sequence assigned to the second D gene, including any indel corrections or numbering spacers. d2_sequence_alignment_aa: type: string description: > Amino acid translation of the d2_sequence_alignment field. j_sequence_alignment: type: string description: > Aligned portion of query sequence assigned to the J gene, including any indel corrections or numbering spacers. j_sequence_alignment_aa: type: string description: > Amino acid translation of the j_sequence_alignment field. c_sequence_alignment: type: string description: > Aligned portion of query sequence assigned to the constant region, including any indel corrections or numbering spacers. c_sequence_alignment_aa: type: string description: > Amino acid translation of the c_sequence_alignment field. v_germline_alignment: type: string description: > Aligned V gene germline sequence spanning the same region as the v_sequence_alignment field and including the same set of corrections and spacers (if any). v_germline_alignment_aa: type: string description: > Amino acid translation of the v_germline_alignment field. d_germline_alignment: type: string description: > Aligned D gene germline sequence spanning the same region as the d_sequence_alignment field and including the same set of corrections and spacers (if any). d_germline_alignment_aa: type: string description: > Amino acid translation of the d_germline_alignment field. d2_germline_alignment: type: string description: > Aligned D gene germline sequence spanning the same region as the d2_sequence_alignment field and including the same set of corrections and spacers (if any). d2_germline_alignment_aa: type: string description: > Amino acid translation of the d2_germline_alignment field. j_germline_alignment: type: string description: > Aligned J gene germline sequence spanning the same region as the j_sequence_alignment field and including the same set of corrections and spacers (if any). j_germline_alignment_aa: type: string description: > Amino acid translation of the j_germline_alignment field. c_germline_alignment: type: string description: > Aligned constant region germline sequence spanning the same region as the c_sequence_alignment field and including the same set of corrections and spacers (if any). c_germline_alignment_aa: type: string description: > Amino acid translation of the c_germline_aligment field. junction_length: type: integer description: Number of nucleotides in the junction sequence. junction_aa_length: type: integer description: Number of amino acids in the junction sequence. x-airr: adc-query-support: true np1_length: type: integer description: > Number of nucleotides between the V gene and first D gene alignments or between the V gene and J gene alignments. np2_length: type: integer description: > Number of nucleotides between either the first D gene and J gene alignments or the first D gene and second D gene alignments. np3_length: type: integer description: > Number of nucleotides between the second D gene and J gene alignments. n1_length: type: integer description: Number of untemplated nucleotides 5' of the first or only D gene alignment. n2_length: type: integer description: Number of untemplated nucleotides 3' of the first or only D gene alignment. n3_length: type: integer description: Number of untemplated nucleotides 3' of the second D gene alignment. p3v_length: type: integer description: Number of palindromic nucleotides 3' of the V gene alignment. p5d_length: type: integer description: Number of palindromic nucleotides 5' of the first or only D gene alignment. p3d_length: type: integer description: Number of palindromic nucleotides 3' of the first or only D gene alignment. p5d2_length: type: integer description: Number of palindromic nucleotides 5' of the second D gene alignment. p3d2_length: type: integer description: Number of palindromic nucleotides 3' of the second D gene alignment. p5j_length: type: integer description: Number of palindromic nucleotides 5' of the J gene alignment. v_frameshift: type: boolean description: > True if the V gene in the query nucleotide sequence contains a translational frameshift relative to the frame of the V gene reference sequence. j_frameshift: type: boolean description: > True if the J gene in the query nucleotide sequence contains a translational frameshift relative to the frame of the J gene reference sequence. d_frame: type: integer description: > Numerical reading frame (1, 2, 3) of the first or only D gene in the query nucleotide sequence, where frame 1 is relative to the first codon of D gene reference sequence. d2_frame: type: integer description: > Numerical reading frame (1, 2, 3) of the second D gene in the query nucleotide sequence, where frame 1 is relative to the first codon of D gene reference sequence. consensus_count: type: integer description: > Number of reads contributing to the UMI consensus or contig assembly for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence. duplicate_count: type: integer description: > Copy number or number of duplicate observations for the query sequence. For example, the number of identical reads observed for this sequence. title: Read count example: 123 x-airr: miairr: important nullable: true set: 6 subset: data (processed sequence) name: Read count umi_count: type: integer description: > Number of distinct UMIs represented by this sequence. For example, the total number of UMIs that contribute to the contig assembly for the query sequence. cell_id: type: string description: > Identifier defining the cell of origin for the query sequence. title: Cell index example: W06_046_091 x-airr: miairr: important nullable: true adc-query-support: true identifier: true set: 6 subset: data (processed sequence) name: Cell index clone_id: type: string description: Clonal cluster assignment for the query sequence. x-airr: nullable: true adc-query-support: true identifier: true repertoire_id: type: string description: Identifier to the associated repertoire in study metadata. x-airr: nullable: true adc-query-support: true identifier: true sample_processing_id: type: string description: > Identifier to the sample processing object in the repertoire metadata for this rearrangement. If the repertoire has a single sample then this field may be empty or missing. If the repertoire has multiple samples then this field may be empty or missing if the sample cannot be differentiated or the relationship is not maintained by the data processing. x-airr: nullable: true adc-query-support: true identifier: true data_processing_id: type: string description: > Identifier to the data processing object in the repertoire metadata for this rearrangement. If this field is empty than the primary data processing object is assumed. x-airr: nullable: true adc-query-support: true identifier: true rearrangement_id: type: string description: > Identifier for the Rearrangement object. May be identical to sequence_id, but will usually be a universally unique record locator for database applications. x-airr: deprecated: true deprecated-description: Field has been merged with sequence_id to avoid confusion. deprecated-replaced-by: - sequence_id rearrangement_set_id: type: string description: > Identifier for grouping Rearrangement objects. x-airr: deprecated: true deprecated-description: Field has been replaced by other specialized identifiers. deprecated-replaced-by: - repertoire_id - sample_processing_id - data_processing_id germline_database: type: string description: Source of germline V(D)J genes with version number or date accessed. example: ENSEMBL, Homo sapiens build 90, 2017-10-01 x-airr: deprecated: true deprecated-description: Field was moved up to the DataProcessing level to avoid data duplication. deprecated-replaced-by: - "DataProcessing:germline_database" # A unique inferred clone object that has been constructed within a single data processing # for a single repertoire and a subset of its sequences and/or rearrangements. Clone: type: object required: - clone_id - germline_alignment properties: clone_id: type: string description: Identifier for the clone. x-airr: identifier: true repertoire_id: type: string description: Identifier to the associated repertoire in study metadata. x-airr: nullable: true adc-query-support: true data_processing_id: type: string description: Identifier of the data processing object in the repertoire metadata for this clone. x-airr: nullable: true adc-query-support: true sequences: type: array items: type: string description: > List sequence_id strings that act as keys to the Rearrangement records for members of the clone. v_call: type: string description: > V gene with allele of the inferred ancestral of the clone. For example, IGHV4-59*01. example: IGHV4-59*01 d_call: type: string description: > D gene with allele of the inferred ancestor of the clone. For example, IGHD3-10*01. example: IGHD3-10*01 j_call: type: string description: > J gene with allele of the inferred ancestor of the clone. For example, IGHJ4*02. example: IGHJ4*02 junction: type: string description: > Nucleotide sequence for the junction region of the inferred ancestor of the clone, where the junction is defined as the CDR3 plus the two flanking conserved codons. junction_aa: type: string description: > Amino acid translation of the junction. junction_length: type: integer description: Number of nucleotides in the junction. junction_aa_length: type: integer description: Number of amino acids in junction_aa. germline_alignment: type: string description: > Assembled, aligned, full-length inferred ancestor of the clone spanning the same region as the sequence_alignment field of nodes (typically the V(D)J region) and including the same set of corrections and spacers (if any). germline_alignment_aa: type: string description: > Amino acid translation of germline_alignment. v_alignment_start: type: integer description: > Start position in the V gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). v_alignment_end: type: integer description: > End position in the V gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). d_alignment_start: type: integer description: > Start position of the D gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). d_alignment_end: type: integer description: > End position of the D gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). j_alignment_start: type: integer description: > Start position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). j_alignment_end: type: integer description: > End position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval). junction_start: type: integer description: Junction region start position in the alignment (1-based closed interval). junction_end: type: integer description: Junction region end position in the alignment (1-based closed interval). umi_count: type: integer description: > Number of distinct UMIs observed across all sequences (Rearrangement records) in this clone. clone_count: type: integer description: > Absolute count of the size (number of members) of this clone in the repertoire. This could simply be the number of sequences (Rearrangement records) observed in this clone, the number of distinct cell barcodes (unique cell_id values), or a more sophisticated calculation appropriate to the experimental protocol. Absolute count is provided versus a frequency so that downstream analysis tools can perform their own normalization. seed_id: type: string description: sequence_id of the seed sequence. Empty string (or null) if there is no seed sequence. # 1-to-n relationship for a clone to its trees. Tree: type: object required: - tree_id - clone_id - newick properties: tree_id: type: string description: Identifier for the tree. x-airr: identifier: true clone_id: type: string description: Identifier for the clone. newick: type: string description: Newick string of the tree edges. nodes: type: object description: Dictionary of nodes in the tree, keyed by sequence_id string additionalProperties: $ref: '#/Node' # 1-to-n relationship between a tree and its nodes Node: type: object required: - sequence_id properties: sequence_id: type: string description: > Identifier for this node that matches the identifier in the newick string and, where possible, the sequence_id in the source repertoire. x-airr: identifier: true sequence_alignment: type: string description: > Nucleotide sequence of the node, aligned to the germline_alignment for this clone, including including any indel corrections or spacers. junction: type: string description: > Junction region nucleotide sequence for the node, where the junction is defined as the CDR3 plus the two flanking conserved codons. junction_aa: type: string description: > Amino acid translation of the junction. # The cell object acts as point of reference for all data that can be related # to an individual cell, either by direct observation or inference. Cell: type: object required: - cell_id - rearrangements - repertoire_id - virtual_pairing properties: cell_id: type: string description: > Identifier defining the cell of origin for the query sequence. title: Cell index example: W06_046_091 x-airr: identifier: true miairr: defined nullable: false adc-query-support: true name: Cell index rearrangements: type: array description: > Array of sequence identifiers defined for the Rearrangement object title: Cell-associated rearrangements items: type: string example: [id1, id2] #empty vs NULL? x-airr: miairr: defined nullable: true adc-query-support: true name: Cell-associated rearrangements receptors: type: array description: > Array of receptor identifiers defined for the Receptor object title: Cell-associated receptors items: type: string example: [id1, id2] #empty vs NULL? x-airr: miairr: defined nullable: true adc-query-support: true name: Cell-associated receptors repertoire_id: type: string description: Identifier to the associated repertoire in study metadata. title: Parental repertoire of cell x-airr: miairr: defined nullable: true adc-query-support: true name: Parental repertoire of cell data_processing_id: type: string description: Identifier of the data processing object in the repertoire metadata for this clone. title: Data processing for cell x-airr: miairr: defined nullable: true adc-query-support: true name: Data processing for cell expression_study_method: type: string enum: - flow_cytometry - single-cell_transcriptome - null description: > Keyword describing the methodology used to assess expression. This values for this field MUST come from a controlled vocabulary. x-airr: miairr: defined nullable: true adc-query-support: true expression_raw_doi: type: string description: > DOI of raw data set containing the current event x-airr: miairr: defined nullable: true adc-query-support: true expression_index: type: string description: > Index addressing the current event within the raw data set. x-airr: miairr: defined nullable: true virtual_pairing: type: boolean description: > boolean to indicate if pairing was inferred. title: Virtual pairing x-airr: miairr: defined nullable: true adc-query-support: true name: Virtual pairing # The CellExpression object acts as a container to hold a single expression level measurement from # an experiment. Expression data is associated with a cell_id and the related repertoire_id and # data_processing_id as cell_id is not guaranteed to be unique outside the data processing for # a single repertoire. CellExpression: type: object required: - expression_id - repertoire_id - data_processing_id - cell_id - property - property_type - value properties: expression_id: type: string description: > Identifier of this expression property measurement. title: Expression property measurement identifier x-airr: identifier: true miairr: defined nullable: false adc-query-support: true name: Expression measurement identifier cell_id: type: string description: > Identifier of the cell to which this expression data is related. title: Cell identifier example: W06_046_091 x-airr: miairr: defined nullable: false adc-query-support: true name: Cell identifier repertoire_id: type: string description: Identifier for the associated repertoire in study metadata. title: Parental repertoire of cell x-airr: miairr: defined nullable: true adc-query-support: true name: Parental repertoire of cell data_processing_id: type: string description: Identifier of the data processing object in the repertoire metadata for this clone. title: Data processing for cell x-airr: miairr: defined nullable: true adc-query-support: true name: Data processing for cell property_type: type: string description: > Keyword describing the property type and detection method used to measure the property value. The following keywords are recommended, but custom property types are also valid: "mrna_expression_by_read_count", "protein_expression_by_fluorescence_intensity", "antigen_bait_binding_by_fluorescence_intensity", "protein_expression_by_dna_barcode_count" and "antigen_bait_binding_by_dna_barcode_count". title: Property type and detection method x-airr: miairr: defined nullable: false adc-query-support: true name: Property type and detection method property: $ref: '#/Ontology' title: Property information description: > Name of the property observed, typically a gene or antibody identifier (and label) from a canonical resource such as Ensembl (e.g. ENSG00000275747, IGHV3-79) or Antibody Registry (ABREG:1236456, Purified anti-mouse/rat/human CD27 antibody). example: id: ENSG:ENSG00000275747 label: IGHV3-79 x-airr: miairr: defined adc-query-support: true format: ontology name: Property information value: type: number description: Level at which the property was observed in the experiment (non-normalized). title: Property value example: 3 x-airr: miairr: defined nullable: true adc-query-support: true name: Property value # The Receptor object hold information about a receptor and its reactivity. # Receptor: type: object required: - receptor_id - receptor_hash - receptor_type - receptor_variable_domain_1_aa - receptor_variable_domain_1_locus - receptor_variable_domain_2_aa - receptor_variable_domain_2_locus properties: receptor_id: type: string description: ID of the current Receptor object, unique within the local repository. title: Receptor ID example: TCR-MM-012345 x-airr: identifier: true nullable: false adc-query-support: true receptor_hash: type: string description: > The SHA256 hash of the receptor amino acid sequence, calculated on the concatenated ``receptor_variable_domain_*_aa`` sequences and represented as base16-encoded string. title: Receptor hash ID example: aa1c4b77a6f4927611ab39f5267415beaa0ba07a952c233d803b07e52261f026 x-airr: nullable: false adc-query-support: true receptor_type: type: string enum: - Ig - TCR description: The top-level receptor type, either Immunoglobulin (Ig) or T Cell Receptor (TCR). x-airr: nullable: false adc-query-support: true receptor_variable_domain_1_aa: type: string description: > Complete amino acid sequence of the mature variable domain of the Ig heavy, TCR beta or TCR delta chain. The mature variable domain is defined as encompassing all AA from and including first AA after the the signal peptide to and including the last AA that is completely encoded by the J gene. example: > QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLEWIGRIDPNSGGTKYNEKFKSKATLTVDKPSSTAYMQLSSLTSEDSAVYYCARYDYYGSSYFDYWGQGTTLTVSS x-airr: nullable: false adc-query-support: true receptor_variable_domain_1_locus: type: string enum: - IGH - TRB - TRD description: Locus from which the variable domain in receptor_variable_domain_1_aa originates example: IGH x-airr: nullable: false adc-query-support: true receptor_variable_domain_2_aa: type: string description: > Complete amino acid sequence of the mature variable domain of the Ig light, TCR alpha or TCR gamma chain. The mature variable domain is defined as encompassing all AA from and including first AA after the the signal peptide to and including the last AA that is completely encoded by the J gene. example: > QAVVTQESALTTSPGETVTLTCRSSTGAVTTSNYANWVQEKPDHLFTGLIGGTNNRAPGVPARFSGSLIGDKAALTITGAQTEDEAIYFCALWYSNHWVFGGGTKLTVL x-airr: nullable: false adc-query-support: true receptor_variable_domain_2_locus: type: string enum: - IGI - IGK - IGL - TRA - TRG description: Locus from which the variable domain in receptor_variable_domain_2_aa originates example: IGL x-airr: nullable: false adc-query-support: true receptor_ref: type: array description: Array of receptor identifiers defined for the Receptor object title: Receptor cross-references items: type: string example: ["IEDB_RECEPTOR:10"] x-airr: nullable: true adc-query-support: true reactivity_measurements: type: array description: Records of reactivity measurement items: $ref: '#/ReceptorReactivity' x-airr: nullable: true ReceptorReactivity: type: object required: - ligand_type - antigen_type - antigen - reactivity_method - reactivity_readout - reactivity_value - reactivity_unit properties: ligand_type: type: string enum: - "MHC:peptide" - "MHC:non-peptide" - protein - peptide - non-peptidic description: Classification of ligand binding to receptor example: non-peptide x-airr: nullable: false antigen_type: type: string enum: - protein - peptide - non-peptidic description: > The type of antigen before processing by the immune system. example: protein x-airr: nullable: false antigen: $ref: '#/Ontology' description: > The substance against which the receptor was tested. This can be any substance that stimulates an adaptive immune response in the host, either through antibody production or by T cell activation after presentation via an MHC molecule. title: Antigen example: id: UNIPROT:P19597 label: Circumsporozoite protein x-airr: nullable: false adc-query-support: true format: ontology antigen_source_species: $ref: '#/Ontology' description: The species from which the antigen was isolated title: Source species of antigen example: id: NCBITAXON:5843 label: Plasmodium falciparum NF54 x-airr: nullable: true format: ontology ontology: draft: true top_node: id: NCBITAXON:1 label: root peptide_start: type: integer description: Start position of the peptide within the reference protein sequence x-airr: nullable: true peptide_end: type: integer description: End position of the peptide within the reference protein sequence x-airr: nullable: true mhc_class: type: string enum: - MHC-I - MHC-II - MHC-nonclassical - null description: Class of MHC molecule, only present for MHC:x ligand types example: MHC-II x-airr: nullable: true mhc_gene_1: $ref: '#/Ontology' description: The MHC gene to which the mhc_allele_1 belongs title: MHC gene 1 example: id: MRO:0000055 label: HLA-DRA x-airr: nullable: true format: ontology ontology: draft: true top_node: id: MRO:0000004 label: MHC gene mhc_allele_1: type: string description: Allele designation of the MHC alpha chain example: HLA-DRA x-airr: nullable: true mhc_gene_2: $ref: '#/Ontology' description: The MHC gene to which the mhc_allele_2 belongs title: MHC gene 2 example: id: MRO:0000057 label: HLA-DRB1 x-airr: nullable: true format: ontology ontology: draft: true top_node: id: MRO:0000004 label: MHC gene mhc_allele_2: type: string description: > Allele designation of the MHC class II beta chain or the invariant beta2-microglobin chain example: HLA-DRB1*04:01 x-airr: nullable: true reactivity_method: type: string enum: - SPR - ITC - ELISA - cytometry - biological_activity description: The methodology used to assess expression (assay implemented in experiment) x-airr: nullable: false reactivity_readout: type: string enum: - binding_strength - cytokine_release - dissociation_constant_kd - on_rate - off_rate - pathogen_inhibition description: Reactivity measurement read-out example: cytokine release x-airr: nullable: false reactivity_value: type: number description: The absolute (processed) value of the measurement example: 162.26 x-airr: nullable: false reactivity_unit: type: string description: The unit of the measurement example: pg/ml x-airr: nullable: false ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1661708083.0 airr-1.5.1/airr/tools.py0000644000076500000240000002504214302723463014474 0ustar00vandej27staff""" AIRR tools and utilities """ # Copyright (c) 2018 AIRR Community # # This file is part of the AIRR Community Standards. # # Author: Scott Christley # Author: Jason Anthony Vander Heiden # Date: March 29, 2018 # # This library is free software; you can redistribute it and/or modify # it under the terms of the Creative Commons Attribution 4.0 License. # # This library is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # Creative Commons Attribution 4.0 License for more details. # System imports import argparse import sys from warnings import warn # Local imports from airr import __version__ import airr.interface # internal wrapper function before calling merge interface method def merge_cmd(out_file, airr_files, drop=False, debug=False): """ Merge one or more AIRR rearrangements files Arguments: out_file (str): output file name. airr_files (list): list of input files to merge. drop (bool): drop flag. If True then drop fields that do not exist in all input files, otherwise combine fields from all input files. debug (bool): debug flag. If True print debugging information to standard error. Returns: bool: True if files were successfully merged, otherwise False. """ return airr.interface.merge_rearrangement(out_file, airr_files, drop=drop, debug=debug) # internal wrapper function before calling validate interface method def validate_rearrangement_cmd(airr_files, debug=True): """ Validates one or more AIRR rearrangements files Arguments: airr_files (list): list of input files to validate. debug (bool): debug flag. If True print debugging information to standard error. Returns: boolean: True if all files passed validation, otherwise False """ valid = [] for f in airr_files: try: v = airr.interface.validate_rearrangement(f, debug=debug) valid.append(v) except Exception as e: sys.stderr.write('%s\n' % e) sys.stderr.write('Validation failed for file: %s\n\n' % f) valid.append(False) else: if not v: sys.stderr.write('Validation failed for file: %s\n\n' % f) return all(valid) def validate_airr_cmd(airr_files, debug=True): """ Validates one or more AIRR Data Model files Arguments: airr_files (list): list of input files to validate. debug (bool): debug flag. If True print debugging information to standard error. Returns: boolean: True if all files passed validation, otherwise False """ valid = [] for f in airr_files: if debug: sys.stderr.write('Validating: %s\n' % f) try: data = airr.interface.read_airr(f, validate=False, debug=debug) v = airr.interface.validate_airr(data, debug=debug) valid.append(v) except Exception as e: sys.stderr.write('%s\n' % e) sys.stderr.write('Validation failed for file: %s\n\n' % f) valid.append(False) return all(valid) #### Deprecated #### # internal wrapper function before calling validate interface method def validate_repertoire_cmd(airr_files, debug=True): """ Validates one or more AIRR repertoire metadata files Arguments: airr_files (list): list of input files to validate. debug (bool): debug flag. If True print debugging information to standard error. Returns: boolean: True if all files passed validation, otherwise False """ # Deprecation warn('validate_repertoire_cmd is deprecated and will be removed in a future release.\nUse =validate_airr_cmd instead.\n', DeprecationWarning, stacklevel=2) valid = [] for f in airr_files: try: v = airr.interface.validate_repertoire(f, debug=debug) valid.append(v) except Exception as e: sys.stderr.write('%s\n' % e) sys.stderr.write('Validation failed for file: %s\n\n' % f) valid.append(False) return all(valid) def define_args(): """ Define commandline arguments Returns: argparse.ArgumentParser: argument parser. """ parser = argparse.ArgumentParser(add_help=False, description='AIRR Community Standards utility commands.') group_help = parser.add_argument_group('help') group_help.add_argument('-h', '--help', action='help', help='show this help message and exit') group_help.add_argument('--version', action='version', version='%(prog)s:' + ' %s' % __version__) # Setup subparsers subparsers = parser.add_subparsers(title='subcommands', dest='command', metavar='', help='Database operation') # TODO: This is a temporary fix for Python issue 9253 subparsers.required = True # Define arguments common to all subcommands common_parser = argparse.ArgumentParser(add_help=False) common_help = common_parser.add_argument_group('help') common_help.add_argument('--version', action='version', version='%(prog)s:' + ' %s' % __version__) common_help.add_argument('-h', '--help', action='help', help='show this help message and exit') # TODO: workflow provenance # group_prov = common_parser.add_argument_group('provenance') # group_prov.add_argument('-p', '--provenance', action='store', dest='prov_file', default=None, # help='''File name for storing workflow provenance. If specified, airr-tools # will record provenance for all activities performed.''') # TODO: study metadata # group_meta = common_parser.add_argument_group('study metadata') # group_meta.add_argument('-m', '--metadata', action='store', dest='metadata_file', default=None, # help='''File name containing study metadata.''') # Subparser to merge files parser_merge = subparsers.add_parser('merge', parents=[common_parser], add_help=False, help='Merge AIRR rearrangement files.', description='Merge AIRR rearrangement files.') group_merge = parser_merge.add_argument_group('merge arguments') group_merge.add_argument('-o', action='store', dest='out_file', required=True, help='''Output file name.''') group_merge.add_argument('--drop', action='store_true', dest='drop', help='''If specified, drop fields that do not exist in all input files. Otherwise, include all columns in all files and fill missing data with empty strings.''') group_merge.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, help='A list of AIRR rearrangement files.') parser_merge.set_defaults(func=merge_cmd) # Subparser to validate files parser_validate = subparsers.add_parser('validate', parents=[common_parser], add_help=False, help='Validate files for AIRR Standards compliance.', description='Validate files for AIRR Standards compliance.') validate_subparser = parser_validate.add_subparsers(title='subcommands', metavar='', help='Database operation') # Subparser to validate rearrangement files parser_validate = validate_subparser.add_parser('rearrangement', parents=[common_parser], add_help=False, help='Validate AIRR rearrangement files.', description='Validate AIRR rearrangement files.') group_validate = parser_validate.add_argument_group('validate arguments') group_validate.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, help='A list of AIRR rearrangement files.') parser_validate.set_defaults(func=validate_rearrangement_cmd) # Subparser to validate AIRR Data Model files parser_validate = validate_subparser.add_parser('airr', parents=[common_parser], add_help=False, help='Validate AIRR Data Model files.', description='Validate AIRR Data Model files.') group_validate = parser_validate.add_argument_group('validate arguments') group_validate.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, help='A list of AIRR Data Model files.') parser_validate.set_defaults(func=validate_airr_cmd) # Subparser to validate repertoire files parser_validate = validate_subparser.add_parser('repertoire', parents=[common_parser], add_help=False, help='Validate AIRR repertoire metadata files.', description='Validate AIRR repertoire metadata files.') group_validate = parser_validate.add_argument_group('validate arguments') group_validate.add_argument('-a', nargs='+', action='store', dest='airr_files', required=True, help='A list of AIRR repertoire metadata files.') parser_validate.set_defaults(func=validate_repertoire_cmd) return parser def main(): """ Utility commands for AIRR Community Standards files """ # Define argument parsers and print help if subcommand not specified parser = define_args() if len(sys.argv) == 1: parser.print_help() sys.exit(1) # Parse arguments args = parser.parse_args() args_dict = args.__dict__.copy() del args_dict['command'] del args_dict['func'] # Deprecation warnings if args.func is validate_repertoire_cmd: print('The "validate repertoire" subcommand is deprecated and will be removed in a future release.', '\nUse the "validate airr" subcommand instead.\n') # Call tool function result = args.func(**args_dict) # set return code to non-zero if error occurred if args.__dict__['command'] == 'validate' or args.__dict__['command'] == 'merge': if not result: sys.exit(1) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1619984621.0 airr-1.5.1/requirements.txt0000644000076500000240000000011114043600355015273 0ustar00vandej27staffpandas>=0.24.0 pyyaml>=3.12 yamlordereddictloader>=0.4.0 setuptools>=2.0 ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1717370576.6360633 airr-1.5.1/setup.cfg0000644000076500000240000000030014627177321013642 0ustar00vandej27staff[versioneer] VCS = git style = pep440 versionfile_source = airr/_version.py versionfile_build = airr/_version.py tag_prefix = v parentdir_prefix = airr- [egg_info] tag_build = tag_date = 0 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1691442592.0 airr-1.5.1/setup.py0000644000076500000240000000266214464256640013551 0ustar00vandej27staff""" AIRR community formats for adaptive immune receptor data. """ import sys import versioneer try: from setuptools import setup, find_packages except ImportError: sys.exit('setuptools is required.') with open('README.rst', 'r') as ip: long_description = ip.read() # Parse requirements with open('requirements.txt') as req: install_requires = req.read().splitlines() # Setup setup(name='airr', version=versioneer.get_version(), cmdclass=versioneer.get_cmdclass(), author='AIRR Community', author_email='', description='AIRR Community Data Representation Standard reference library for antibody and TCR sequencing data.', long_description=long_description, zip_safe=False, license='CC BY 4.0', url='http://docs.airr-community.org', keywords=['AIRR', 'bioinformatics', 'sequencing', 'immunoglobulin', 'antibody', 'adaptive immunity', 'T cell', 'B cell', 'BCR', 'TCR'], install_requires=install_requires, packages=find_packages(), package_data={'airr': ['specs/*.yaml']}, entry_points={'console_scripts': ['airr-tools=airr.tools:main']}, classifiers=['Intended Audience :: Science/Research', 'Natural Language :: English', 'Operating System :: OS Independent', 'Programming Language :: Python :: 3', 'Topic :: Scientific/Engineering :: Bio-Informatics']) ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1717370576.6340427 airr-1.5.1/tests/0000755000076500000240000000000014627177321013172 5ustar00vandej27staff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1616358979.0 airr-1.5.1/tests/__init__.py0000644000076500000240000000000014025727103015260 0ustar00vandej27staff././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1717370576.635599 airr-1.5.1/tests/data/0000755000076500000240000000000014627177321014103 5ustar00vandej27staff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/tests/data/bad_genotype_set.json0000644000076500000240000000274314627177010020312 0ustar00vandej27staff{ "GenotypeSet": [{ "receptor_genotype_set_id": "1", "genotype_class_list": [ { "receptor_genotype_id": "1", "locus": "IGH", "documented_alleles": [ { "label": "IGHV1-69*01", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 1 }, { "label": "IGHV1-69*02", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 2 }, { "label": "IGHV1-69*02", "name": "1234", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 2 } ], "undocumented_alleles": [ { "allele_name": "IGHD3-1*01_S1234", "sequence": "agtagtagtagt", "phasing": 1 } ], "deleted_genes": [ { "label": "IGHV3-30-3", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": "1" } ], "inference_process": "repertoire_sequencing" } ] }] }././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/tests/data/bad_germline_set.json0000644000076500000240000003311714627177010020261 0ustar00vandej27staff{ "GermlineSet": [{ "germline_set_id": "OGRDB:G00007", "author": "William Lees", "lab_name": "", "lab_address": "Birkbeck College, University of London, Malet Street, London", "release_version": 1, "release_description": "", "release_date": "2021-11-24", "germline_set_name": "CAST IGH", "germline_set_ref": "OGRDB:G00007.1", "pub_ids": "", "species": ["Mouse"], "species_subgroup": "CAST_EiJ", "species_subgroup_type": "strain", "locus": "IGH", "allele_descriptions": [ { "allele_description_id": "OGRDB:A00301", "maintainer": "William Lees", "acknowledgements": [], "lab_address": "Birkbeck College, University of London, Malet Street, London", "release_version": 1, "release_date": "24-Nov-2021", "release_description": "First release", "label": "IGHV-2DBF", "sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGATGGTAGTGGCACCTACTATCTGGACTCCTTGAAGAGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "coding_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGA...GGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTC............AGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "aliases": [ "watson_et_al:CAST_EiJ_IGHV5-3" ], "locus": "IGH", "chromosome": null, "sequence_type": "V", "functional": true, "inference_type": "rearranged_only", "species": "Mouse", "species_subgroup": "CAST_EiJ", "species_subgroup_type": "strain", "status": "active", "gene_designation": null, "subgroup_designation": null, "allele_designation": null, "gene_start": null, "gene_end": null, "utr_5_prime_start": null, "utr_5_prime_end": null, "leader_1_start": null, "leader_1_end": null, "leader_2_start": null, "leader_2_end": null, "v_rs_start": null, "v_rs_end": null, "v_gene_delineations": [ { "sequence_delineation_id": "1", "delineation_scheme": "IMGT", "fwr1_start": 1, "fwr1_end": 78, "cdr1_start": 79, "cdr1_end": 114, "fwr2_start": 115, "fwr2_end": 165, "cdr2_start": 166, "cdr2_end": 195, "fwr3_start": 196, "fwr3_end": 312, "cdr3_start": 313, "alignment": [ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100", "101", "102", "103", "104" ] } ], "unrearranged_support": [], "rearranged_support": [], "paralogs": [], "notes": "Imported to OGRDB with the following notes: watson_et_al: CAST_EiJ_IGHV5-3", "curational_tags": null }, { "allele_description_id": "OGRDB:A00314", "maintainer": "William Lees", "acknowledgements": [], "lab_address": "Birkbeck College, University of London, Malet Street, London", "release_version": 1, "release_date": "24-Nov-2021", "release_description": "First release", "label": "IGHV-2ETO", "sequence": "CAAGTTACTCTAAAAGAGTCTGGCCCTGGGATATTGAAGCCCTCACAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTACTAATATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCATCCCTGAAGAGCCGGCTAACAATCTCCAAGGATACCTCCAGAAACCAGGTATTCCTCAAGATCACCAGTGTGGACACTGCAGATACTGCCACTTACTACTGTGCTC", "coding_sequence": "CAAGTTACTCTAAAAGAGTCTGGCCCT...GGGATATTGAAGCCCTCACAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGC......ACTACTAATATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGAT.........GATGATAAGTACTATAACCCATCCCTGAAG...AGCCGGCTAACAATCTCCAAGGATACCTCCAGAAACCAGGTATTCCTCAAGATCACCAGTGTGGACACTGCAGATACTGCCACTTACTACTGTGCTC", "aliases": [ "watson_et_al:CAST_EiJ_IGHV8-2" ], "locus": "IGH", "chromosome": null, "sequence_type": "V", "functional": true, "inference_type": "rearranged_only", "species": "Mouse", "species_subgroup": "CAST_EiJ", "species_subgroup_type": "strain", "status": "active", "gene_designation": null, "subgroup_designation": null, "allele_designation": null, "gene_start": null, "gene_end": null, "utr_5_prime_start": null, "utr_5_prime_end": null, "leader_1_start": null, "leader_1_end": null, "leader_2_start": null, "leader_2_end": null, "v_rs_start": null, "v_rs_end": null, "v_gene_delineations": [ { "sequence_delineation_id": "1", "delineation_scheme": "IMGT", "fwr1_start": 1, "fwr1_end": 78, "cdr1_start": 79, "cdr1_end": 114, "fwr2_start": 115, "fwr2_end": 165, "cdr2_start": 166, "cdr2_end": 195, "fwr3_start": 196, "fwr3_end": 312, "cdr3_start": 313, "alignment": [ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100", "101", "102", "103", "104" ] } ], "unrearranged_support": [], "rearranged_support": [], "paralogs": [], "notes": "Imported to OGRDB with the following notes: watson_et_al: CAST_EiJ_IGHV8-2", "curational_tags": null } ], "notes": "" }] } ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1661708083.0 airr-1.5.1/tests/data/bad_rearrangement.tsv0000644000076500000240000001346014302723463020277 0ustar00vandej27staffrearrangement_id rearrangement_set_id sequence_id wrong_name rev_comp productive sequence_alignment germline_alignment v_call d_call j_call c_call junction junction_length junction_aa v_score d_score j_score c_score v_cigar d_cigar j_cigar c_cigar v_identity v_evalue d_identity d_evalue j_identity j_evalue v_sequence_start v_sequence_end v_germline_start v_germline_end d_sequence_start d_sequence_end d_germline_start d_germline_end j_sequence_start j_sequence_end j_germline_start j_germline_end np1_length np2_length duplicate_count IVKNQEJ01BVGQ6 1 IVKNQEJ01BVGQ6 GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACTGTCTCCTCA T T IGHV4-31*03 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGG 36 CASGVAGTFDYW 430 16.4 75.8 22N1S275= 11N280S8= 6N292S32=1X9= 1 1E-122 1 2.7 0.9762 6E-18 0 275 0 317 279 287 10 18 291 333 5 47 4 4 1247 IVKNQEJ01AQVWS 1 IVKNQEJ01AQVWS GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA T T IGHV4-31*03 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGG 36 CASGVAGTFDYW 420 16.4 83.8 22N1S156=1X10=1X17=1X89= 11N280S8= 6N292S42= 0.9891 8E-120 1 2.7 1 2E-20 0 275 0 317 279 287 10 18 291 333 5 47 4 4 4 IVKNQEJ01AOYFZ 1 IVKNQEJ01AOYFZ GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGTAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA T F IGHV4-31*03 IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGTAACTTTTGACTACTGG 37 CASGVAGNF*LLX 430 20.4 83.8 22N1S275= 11N280S10= 6N293S42= 1 1E-122 1 0.17 1 2E-20 0 275 0 317 279 289 10 20 292 334 5 47 4 3 92 IVKNQEJ01EI5S4 1 IVKNQEJ01EI5S4 GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA T T IGHV4-31*03 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGG 36 CASGVAGTFDYW 430 16.4 83.8 22N1S275= 11N280S8= 6N292S42= 1 1E-122 1 2.7 1 2E-20 0 275 0 317 279 287 10 18 291 333 5 47 4 4 2913 IVKNQEJ01DGRRI 1 IVKNQEJ01DGRRI GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGTCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA T T IGHV4-34*09 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGG 36 CASGVAGTFDYW 389 16.4 83.8 22N1S23=2X85=1X15=1X1=1X3=1X2=1X1=1X5=1X6=1X118= 11N274S8= 6N286S42= 0.9628 2E-110 1 2.6 1 2E-20 0 269 0 317 273 281 10 18 285 327 5 47 4 4 1 IVKNQEJ01APN5N 1 IVKNQEJ01APN5N GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTAGGGCCAGGGAACCCTGGTCACTGTCTCCTCA T F IGHV4-31*03 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTAG 36 CASGVAGTFDY* 430 16.4 67.9 22N1S275= 11N280S8= 6N292S10=1X21=1X9= 1 1E-122 1 2.7 0.9524 1E-15 0 275 0 317 279 287 10 18 291 333 5 47 4 4 1 IVKNQEJ01B0TT2 1 IVKNQEJ01B0TT2 GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGTAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACTGTCTCCTCA T F IGHV4-31*03 IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGTAACTTTTGACTACTGG 37 CASGVAGNF*LLX 430 20.4 75.8 22N1S275= 11N280S10= 6N293S32=1X9= 1 1E-122 1 0.17 0.9762 6E-18 0 275 0 317 279 289 10 20 292 334 5 47 4 3 30 IVKNQEJ01AIS74 1 IVKNQEJ01AIS74 GGCGCAGGACTGTTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGGCGGGGTGGCTGGTAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA T F IGHV4-31*03 IGHD6-19*01 IGHJ4*02 TGTGCGAGGCGGGGTGGCTGGTAACTTTTGACTACTGG 38 CARRGGW*LLTTG 424 20.4 83.8 22N1S3=1X8=1X262= 11N281S10= 6N294S42= 0.9927 9E-121 1 0.17 1 2E-20 0 275 0 317 280 290 10 20 293 335 5 47 5 3 4 IVKNQEJ01AJ44V 1 IVKNQEJ01AJ44V GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACTGTCTCCTCA T T IGHV4-59*06 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGG 36 CASGVAGTFDYW 386 16.4 75.8 22N1S45=1X5=2X6=1X3=1X5=1X22=1X4=1X1=1X1=1X165= 11N274S8= 6N286S32=1X9= 0.9625 2E-109 1 2.6 0.9762 5E-18 0 267 0 315 273 281 10 18 285 327 5 47 6 4 12 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/tests/data/bad_repertoire.yaml0000644000076500000240000001715614627177010017762 0ustar00vandej27staff# # Example metadata # Repertoire: - repertoire_id: 1841923116114776551-242ac11c-0001-012 study: study_id: PRJNA300878 study_title: "Homo sapiens B and T cell repertoire - MZ twins" study_description: "The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level." lab_name: "Mark M. Davis" lab_address: "Stanford University" submitted_by: "Florian Rubelt" pub_ids: "PMID:27005435" subject: subject_id: TW01A synthetic: false species: id: "NCBITaxon_9606" value: "Homo sapiens" sex: female age_min: 27 age_max: 27 age_unit: id: UO_0000036 value: year linked_subjects: TW01B link_type: twin sample: - sample_id: TW01A_B_naive tissue: PBMC cell_subset: "Naive B cell" cell_phenotype: "expression of CD20 and the absence of CD27" cell_species: id: "NCBITaxon_9606" value: "Homo sapiens" single_cell: false cell_isolation: FACS template_class: RNA pcr_target: - pcr_target_locus: IGH sequencing_platform: "Illumina MiSeq" read_length: "300" sequencing_files: file_type: fastq filename: SRR2905656_R1.fastq.gz read_direction: forward paired_filename: SRR2905656_R2.fastq.gz paired_read_direction: reverse data_processing: - data_processing_id: 3059369183532618216-242ac11b-0001-007 analysis_provenance_id: 6623294219256599016-242ac11c-0001-012 - repertoire_id: 1602908186092376551-242ac11c-0001-012 study: study_id: PRJNA300878 study_title: "Homo sapiens B and T cell repertoire - MZ twins" study_description: "The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level." lab_name: "Mark M. Davis" lab_address: "Stanford University" submitted_by: "Florian Rubelt" pub_ids: "PMID:27005435" subject: subject_id: TW01A synthetic: false species: id: "NCBITaxon_9606" value: "Homo sapiens" sex: female age_min: 27 age_max: 27 age_unit: id: UO_0000036 value: year linked_subjects: TW01B link_type: twin sample: - sample_id: TW01A_B_memory tissue: PBMC cell_subset: "Memory B cell" cell_phenotype: "expression of CD20 and CD27" cell_species: id: "NCBITaxon_9606" value: "Homo sapiens" single_cell: false cell_isolation: FACS template_class: RNA pcr_target: - pcr_target_locus: IGH sequencing_platform: "Illumina MiSeq" read_length: "300" sequencing_files: file_type: fastq filename: SRR2905655_R1.fastq.gz read_direction: forward paired_filename: SRR2905655_R2.fastq.gz paired_read_direction: reverse data_processing: - data_processing_id: 3059369183532618216-242ac11b-0001-007 analysis_provenance_id: 6623294219256599016-242ac11c-0001-012 - repertoire_id: 2366080924918616551-242ac11c-0001-012 study: study_id: PRJNA300878 study_title: "Homo sapiens B and T cell repertoire - MZ twins" study_description: "The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level." lab_name: "Mark M. Davis" lab_address: "Stanford University" submitted_by: "Florian Rubelt" pub_ids: "PMID:27005435" subject: subject_id: TW01A synthetic: false species: id: "NCBITaxon_9606" value: "Homo sapiens" sex: female age_min: 27 age_max: 27 age_unit: id: UO_0000036 value: year linked_subjects: TW01B link_type: twin sample: - sample_id: TW01A_T_naive_CD4 tissue: PBMC cell_subset: "Naive CD4+ T cell" cell_phenotype: "expression of CD8 and absence of CD4 and CD45RO" cell_species: id: "NCBITaxon_9606" value: "Homo sapiens" single_cell: false cell_isolation: FACS template_class: RNA pcr_target: - pcr_target_locus: TRB sequencing_platform: "Illumina MiSeq" read_length: "300" sequencing_files: file_type: fastq filename: SRR2905659_R1.fastq.gz read_direction: forward paired_filename: SRR2905659_R2.fastq.gz paired_read_direction: reverse data_processing: - data_processing_id: 651223970338378216-242ac11b-0001-007 analysis_provenance_id: 4625424004665971176-242ac11c-0001-012 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1661708083.0 airr-1.5.1/tests/data/extra_rearrangement.tsv0000644000076500000240000000032314302723463020666 0ustar00vandej27staffsequence_id sequence rev_comp productive v_call d_call j_call sequence_alignment germline_alignment junction junction junction_aa v_cigar d_cigar j_cigar 1 2 F F 5 6 7 8 9 10 11 12 13 14 15 not_in_header not_in ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/tests/data/good_combined_airr.json0000644000076500000240000012157314627177010020607 0ustar00vandej27staff{ "Repertoire": [ { "repertoire_id": "1841923116114776551-242ac11c-0001-012", "study": { "study_id": "PRJNA300878", "study_title": "Homo sapiens B and T cell repertoire - MZ twins", "study_type": { "id": null, "label": null }, "study_description": "The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level.", "study_contact": "Mark M. Davis, mmdavis@stanford.edu, ORCID:0000-0001-6868-657X", "inclusion_exclusion_criteria": null, "lab_name": "Mark M. Davis", "lab_address": "Stanford University", "submitted_by": "Florian Rubelt", "pub_ids": "PMID:27005435", "collected_by": null, "grants": null, "keywords_study": [ "contains_ig", "contains_tr" ] }, "subject": { "subject_id": "TW01A", "synthetic": false, "species": { "id": "NCBITaxon_9606", "label": "Homo sapiens" }, "sex": "female", "age_min": 27, "age_max": 27, "age_unit": { "id": "UO_0000036", "label": "year" }, "age_event": null, "ancestry_population": null, "ethnicity": null, "race": null, "strain_name": null, "linked_subjects": "TW01B", "link_type": "twin", "diagnosis": [ { "study_group_description": null, "disease_diagnosis": { "id": null, "label": null }, "disease_length": null, "disease_stage": null, "prior_therapies": null, "immunogen": null, "intervention": null, "medical_history": null } ], "genotype": { "receptor_genotype_set": { "receptor_genotype_set_id": "1", "genotype_class_list": [ { "receptor_genotype_id": "1", "locus": "IGH", "documented_alleles": [ { "label": "IGHV1-69*01", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 1 }, { "label": "IGHV1-69*02", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 2 } ], "undocumented_alleles": [ { "allele_name": "IGHD3-1*01_S1234", "sequence": "agtagtagtagt", "phasing": 1 } ], "deleted_genes": [ { "label": "IGHV3-30-3", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 1 } ], "inference_process": "repertoire_sequencing" } ] }, "mhc_genotype_set": { "mhc_genotype_set_id": "this is a unique identifier", "mhc_genotype_list": [ { "mhc_genotype_id": "unique", "mhc_class": "MHC-I", "mhc_genotyping_method": "pcr_low_resolution", "mhc_alleles": [ { "allele_designation": "01:01", "gene": { "id": "MRO-0000046", "label": "HLA-A" }, "reference_set_ref": "blah" } ] } ] } } }, "sample": [ { "sample_id": "TW01A_B_naive", "sample_processing_id": null, "sample_type": "peripheral venous puncture", "tissue": { "id": "UBERON_0000178", "label": "blood" }, "tissue_processing": "Ficoll gradient", "cell_subset": { "id": "CL_0000788", "label": "naive B cell" }, "cell_phenotype": "expression of CD20 and the absence of CD27", "cell_species": { "id": "NCBITaxon_9606", "label": "Homo sapiens" }, "single_cell": false, "cell_isolation": "FACS", "template_class": "RNA", "pcr_target": [ { "pcr_target_locus": "IGH", "forward_pcr_primer_target_location": null, "reverse_pcr_primer_target_location": null } ], "sequencing_platform": "Illumina MiSeq", "sequencing_files": { "sequencing_data_id": "SRR2905656", "file_type": "fastq", "filename": "SRR2905656_R1.fastq.gz", "read_direction": "forward", "read_length": 300, "paired_filename": "SRR2905656_R2.fastq.gz", "paired_read_direction": "reverse", "paired_read_length": 300 }, "anatomic_site": null, "disease_state_sample": null, "collection_time_point_relative": null, "collection_time_point_relative_unit": { "id": null, "label": null }, "collection_time_point_reference": null, "biomaterial_provider": null, "cell_number": null, "cells_per_reaction": null, "cell_storage": false, "cell_quality": null, "cell_processing_protocol": null, "template_quality": null, "template_amount": null, "template_amount_unit": { "id": null, "label": null }, "library_generation_method": "RT(oligo-dT)+PCR", "library_generation_protocol": null, "library_generation_kit_version": null, "complete_sequences": "partial", "physical_linkage": "none", "sequencing_run_id": null, "total_reads_passing_qc_filter": null, "sequencing_facility": null, "sequencing_run_date": null, "sequencing_kit": null } ], "data_processing": [ { "data_processing_id": "3059369183532618216-242ac11b-0001-007", "primary_annotation": true, "software_versions": null, "paired_reads_assembly": null, "quality_thresholds": null, "primer_match_cutoffs": null, "collapsing_method": null, "data_processing_protocols": null, "data_processing_files": null, "germline_database": null, "analysis_provenance_id": "6623294219256599016-242ac11c-0001-012" } ] }, { "repertoire_id": "1602908186092376551-242ac11c-0001-012", "study": { "study_id": "PRJNA300878", "study_title": "Homo sapiens B and T cell repertoire - MZ twins", "study_type": { "id": null, "label": null }, "study_description": "The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level.", "study_contact": "Mark M. Davis, mmdavis@stanford.edu, ORCID:0000-0001-6868-657X", "inclusion_exclusion_criteria": null, "lab_name": "Mark M. Davis", "lab_address": "Stanford University", "submitted_by": "Florian Rubelt", "pub_ids": "PMID:27005435", "collected_by": null, "grants": null, "keywords_study": [ "contains_ig", "contains_tr" ] }, "subject": { "subject_id": "TW01A", "synthetic": false, "species": { "id": "NCBITaxon_9606", "label": "Homo sapiens" }, "sex": "female", "age_min": 27, "age_max": 27, "age_unit": { "id": "UO_0000036", "label": "year" }, "age_event": null, "ancestry_population": null, "ethnicity": null, "race": null, "strain_name": null, "linked_subjects": "TW01B", "link_type": "twin", "diagnosis": [ { "study_group_description": null, "disease_diagnosis": { "id": null, "label": null }, "disease_length": null, "disease_stage": null, "prior_therapies": null, "immunogen": null, "intervention": null, "medical_history": null } ] }, "sample": [ { "sample_id": "TW01A_B_memory", "sample_processing_id": null, "sample_type": "peripheral venous puncture", "tissue": { "id": "UBERON_0000178", "label": "blood" }, "tissue_processing": "Ficoll gradient", "cell_subset": { "id": "CL_0000787", "label": "memory B cell" }, "cell_phenotype": "expression of CD20 and CD27", "cell_species": { "id": "NCBITaxon_9606", "label": "Homo sapiens" }, "single_cell": false, "cell_isolation": "FACS", "template_class": "RNA", "pcr_target": [ { "pcr_target_locus": "IGH", "forward_pcr_primer_target_location": null, "reverse_pcr_primer_target_location": null } ], "sequencing_platform": "Illumina MiSeq", "sequencing_files": { "sequencing_data_id": "SRR2905655", "file_type": "fastq", "filename": "SRR2905655_R1.fastq.gz", "read_direction": "forward", "read_length": 300, "paired_filename": "SRR2905655_R2.fastq.gz", "paired_read_direction": "reverse", "paired_read_length": 300 }, "anatomic_site": null, "disease_state_sample": null, "collection_time_point_relative": null, "collection_time_point_relative_unit": { "id": null, "label": null }, "collection_time_point_reference": null, "biomaterial_provider": null, "cell_number": null, "cells_per_reaction": null, "cell_storage": false, "cell_quality": null, "cell_processing_protocol": null, "template_quality": null, "template_amount": null, "template_amount_unit": { "id": null, "label": null }, "library_generation_method": "RT(oligo-dT)+PCR", "library_generation_protocol": null, "library_generation_kit_version": null, "complete_sequences": "partial", "physical_linkage": "none", "sequencing_run_id": null, "total_reads_passing_qc_filter": null, "sequencing_facility": null, "sequencing_run_date": null, "sequencing_kit": null } ], "data_processing": [ { "data_processing_id": "3059369183532618216-242ac11b-0001-007", "primary_annotation": true, "software_versions": null, "paired_reads_assembly": null, "quality_thresholds": null, "primer_match_cutoffs": null, "collapsing_method": null, "data_processing_protocols": null, "data_processing_files": null, "germline_database": null, "analysis_provenance_id": "6623294219256599016-242ac11c-0001-012" } ] }, { "repertoire_id": "2366080924918616551-242ac11c-0001-012", "study": { "study_id": "PRJNA300878", "study_title": "Homo sapiens B and T cell repertoire - MZ twins", "study_type": { "id": null, "label": null }, "study_description": "The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level.", "study_contact": "Mark M. Davis, mmdavis@stanford.edu, ORCID:0000-0001-6868-657X", "inclusion_exclusion_criteria": null, "lab_name": "Mark M. Davis", "lab_address": "Stanford University", "submitted_by": "Florian Rubelt", "pub_ids": "PMID:27005435", "collected_by": null, "grants": null, "keywords_study": [ "contains_ig", "contains_tr" ] }, "subject": { "subject_id": "TW01A", "synthetic": false, "species": { "id": "NCBITaxon_9606", "label": "Homo sapiens" }, "sex": "female", "age_min": 27, "age_max": 27, "age_unit": { "id": "UO_0000036", "label": "year" }, "age_event": null, "ancestry_population": null, "ethnicity": null, "race": null, "strain_name": null, "linked_subjects": "TW01B", "link_type": "twin", "diagnosis": [ { "study_group_description": null, "disease_diagnosis": { "id": null, "label": null }, "disease_length": null, "disease_stage": null, "prior_therapies": null, "immunogen": null, "intervention": null, "medical_history": null } ] }, "sample": [ { "sample_id": "TW01A_T_naive_CD4", "sample_processing_id": null, "sample_type": "peripheral venous puncture", "tissue": { "id": "UBERON_0000178", "label": "blood" }, "tissue_processing": "Ficoll gradient", "cell_subset": { "id": "CL_0000895", "label": "naive thymus-derived CD4-positive, alpha-beta T cell" }, "cell_phenotype": "expression of CD8 and absence of CD4 and CD45RO", "cell_species": { "id": "NCBITaxon_9606", "label": "Homo sapiens" }, "single_cell": false, "cell_isolation": "FACS", "template_class": "RNA", "pcr_target": [ { "pcr_target_locus": "TRB", "forward_pcr_primer_target_location": null, "reverse_pcr_primer_target_location": null } ], "sequencing_platform": "Illumina MiSeq", "sequencing_files": { "sequencing_data_id": "SRR2905659", "file_type": "fastq", "filename": "SRR2905659_R1.fastq.gz", "read_direction": "forward", "read_length": 300, "paired_filename": "SRR2905659_R2.fastq.gz", "paired_read_direction": "reverse", "paired_read_length": 300 }, "anatomic_site": null, "disease_state_sample": null, "collection_time_point_relative": null, "collection_time_point_relative_unit": { "id": null, "label": null }, "collection_time_point_reference": null, "biomaterial_provider": null, "cell_number": null, "cells_per_reaction": null, "cell_storage": false, "cell_quality": null, "cell_processing_protocol": null, "template_quality": null, "template_amount": null, "template_amount_unit": { "id": null, "label": null }, "library_generation_method": "RT(oligo-dT)+PCR", "library_generation_protocol": null, "library_generation_kit_version": null, "complete_sequences": "partial", "physical_linkage": "none", "sequencing_run_id": null, "total_reads_passing_qc_filter": null, "sequencing_facility": null, "sequencing_run_date": null, "sequencing_kit": null } ], "data_processing": [ { "data_processing_id": "651223970338378216-242ac11b-0001-007", "primary_annotation": true, "software_versions": null, "paired_reads_assembly": null, "quality_thresholds": null, "primer_match_cutoffs": null, "collapsing_method": null, "data_processing_protocols": null, "data_processing_files": null, "germline_database": null, "analysis_provenance_id": "4625424004665971176-242ac11c-0001-012" } ] } ], "GermlineSet": [{ "germline_set_id": "OGRDB:G00007", "author": "William Lees", "lab_name": "", "lab_address": "Birkbeck College, University of London, Malet Street, London", "acknowledgements": [], "release_version": 1, "release_description": "", "release_date": "2021-11-24", "germline_set_name": "CAST IGH", "germline_set_ref": "OGRDB:G00007.1", "pub_ids": "", "species": { "id": "NCBITAXON:10090", "label": "Mus musculus" }, "species_subgroup": "CAST_EiJ", "species_subgroup_type": "strain", "locus": "IGH", "allele_descriptions": [ { "allele_description_id": "OGRDB:A00301", "allele_description_ref": "OGRDB:Mouse_IGH:IGHV-2DBF", "maintainer": "William Lees", "acknowledgements": [], "lab_address": "Birkbeck College, University of London, Malet Street, London", "release_version": 1, "release_date": "24-Nov-2021", "release_description": "First release", "label": "IGHV-2DBF", "sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "coding_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "aliases": [ "watson_et_al:CAST_EiJ_IGHV5-3" ], "locus": "IGH", "chromosome": null, "sequence_type": "V", "functional": true, "inference_type": "rearranged_only", "species": { "id": "NCBITAXON:10090", "label": "Mus musculus" }, "species_subgroup": "CAST_EiJ", "species_subgroup_type": "strain", "status": "active", "gene_designation": null, "subgroup_designation": null, "allele_designation": null, "gene_start": null, "gene_end": null, "utr_5_prime_start": null, "utr_5_prime_end": null, "leader_1_start": null, "leader_1_end": null, "leader_2_start": null, "leader_2_end": null, "v_rs_start": null, "v_rs_end": null, "v_gene_delineations": [ { "sequence_delineation_id": "1", "delineation_scheme": "IMGT", "aligned_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGA...GGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTC............AGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "unaligned_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "fwr1_start": 1, "fwr1_end": 75, "cdr1_start": 76, "cdr1_end": 110, "fwr2_start": 111, "fwr2_end": 150, "cdr2_start": 151, "cdr2_end": 160, "fwr3_start": 161, "fwr3_end": 294, "cdr3_start": 295, "alignment": [ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100", "101", "102", "103", "104" ] } ], "unrearranged_support": [], "rearranged_support": [], "paralogs": [], "curation": "Imported to OGRDB with the following notes: watson_et_al: CAST_EiJ_IGHV5-3", "curational_tags": null }, { "allele_description_id": "OGRDB:A00314", "allele_description_ref": "OGRDB:Mouse_IGH:IGHV-2ETO", "maintainer": "William Lees", "acknowledgements": [], "lab_address": "Birkbeck College, University of London, Malet Street, London", "release_version": 1, "release_date": "24-Nov-2021", "release_description": "First release", "label": "IGHV-2ETO", "sequence": "CAAGTTACTCTAAAAGAGTCTGGCCCTGGGATATTGAAGCCCTCACAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTACTAATATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCATCCCTGAAGAGCCGGCTAACAATCTCCAAGGATACCTCCAGAAACCAGGTATTCCTCAAGATCACCAGTGTGGACACTGCAGATACTGCCACTTACTACTGTGCTC", "coding_sequence": "CAAGTTACTCTAAAAGAGTCTGGCCCTGGGATATTGAAGCCCTCACAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTACTAATATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCATCCCTGAAGAGCCGGCTAACAATCTCCAAGGATACCTCCAGAAACCAGGTATTCCTCAAGATCACCAGTGTGGACACTGCAGATACTGCCACTTACTACTGTGCTC", "aliases": [ "watson_et_al:CAST_EiJ_IGHV8-2" ], "locus": "IGH", "chromosome": null, "sequence_type": "V", "functional": true, "inference_type": "rearranged_only", "species": { "id": "NCBITAXON:10090", "label": "Mus musculus" }, "species_subgroup": "CAST_EiJ", "species_subgroup_type": "strain", "status": "active", "gene_designation": null, "subgroup_designation": null, "allele_designation": null, "gene_start": null, "gene_end": null, "utr_5_prime_start": null, "utr_5_prime_end": null, "leader_1_start": null, "leader_1_end": null, "leader_2_start": null, "leader_2_end": null, "v_rs_start": null, "v_rs_end": null, "v_gene_delineations": [ { "sequence_delineation_id": "1", "delineation_scheme": "IMGT", "aligned_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGA...GGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTC............AGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "unaligned_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "fwr1_start": 1, "fwr1_end": 75, "cdr1_start": 76, "cdr1_end": 110, "fwr2_start": 111, "fwr2_end": 150, "cdr2_start": 151, "cdr2_end": 160, "fwr3_start": 161, "fwr3_end": 294, "cdr3_start": 295, "alignment": [ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100", "101", "102", "103", "104" ] } ], "unrearranged_support": [], "rearranged_support": [], "paralogs": [], "curation": "Imported to OGRDB with the following notes: watson_et_al: CAST_EiJ_IGHV8-2", "curational_tags": null } ], "curation": null }], "GenotypeSet": [{ "receptor_genotype_set_id": "1", "genotype_class_list": [ { "receptor_genotype_id": "1", "locus": "IGH", "documented_alleles": [ { "label": "IGHV1-69*01", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 1 }, { "label": "IGHV1-69*02", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 2 } ], "undocumented_alleles": [ { "allele_name": "IGHD3-1*01_S1234", "sequence": "agtagtagtagt", "phasing": 1 } ], "deleted_genes": [ { "label": "IGHV3-30-3", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 1 } ], "inference_process": "repertoire_sequencing" } ] }] } ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/tests/data/good_combined_airr.yaml0000644000076500000240000006110214627177010020567 0ustar00vandej27staffRepertoire: - repertoire_id: 1841923116114776551-242ac11c-0001-012 study: study_id: PRJNA300878 study_title: Homo sapiens B and T cell repertoire - MZ twins study_type: id: label: study_description: The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level. study_contact: Mark M. Davis, mmdavis@stanford.edu, ORCID:0000-0001-6868-657X inclusion_exclusion_criteria: lab_name: Mark M. Davis lab_address: Stanford University submitted_by: Florian Rubelt pub_ids: PMID:27005435 collected_by: grants: keywords_study: - contains_ig - contains_tr subject: subject_id: TW01A synthetic: false species: id: NCBITaxon_9606 label: Homo sapiens sex: female age_min: 27 age_max: 27 age_unit: id: UO_0000036 label: year age_event: ancestry_population: ethnicity: race: strain_name: linked_subjects: TW01B link_type: twin diagnosis: - study_group_description: disease_diagnosis: id: label: disease_length: disease_stage: prior_therapies: immunogen: intervention: medical_history: genotype: receptor_genotype_set: receptor_genotype_set_id: '1' genotype_class_list: - receptor_genotype_id: '1' locus: IGH documented_alleles: - label: IGHV1-69*01 germline_set_ref: IMGT:Homo sapiens:2022.1.31 phasing: 1 - label: IGHV1-69*02 germline_set_ref: IMGT:Homo sapiens:2022.1.31 phasing: 2 undocumented_alleles: - allele_name: IGHD3-1*01_S1234 sequence: agtagtagtagt phasing: 1 deleted_genes: - label: IGHV3-30-3 germline_set_ref: IMGT:Homo sapiens:2022.1.31 phasing: 1 inference_process: repertoire_sequencing mhc_genotype_set: mhc_genotype_set_id: "this is a unique identifier" mhc_genotype_list: - mhc_genotype_id: unique mhc_class: MHC-I mhc_genotyping_method: pcr_low_resolution mhc_alleles: - allele_designation: "01:01" gene: id: "MRO-0000046" label: "HLA-A" reference_set_ref: blah sample: - sample_id: TW01A_B_naive sample_processing_id: sample_type: peripheral venous puncture tissue: id: UBERON_0000178 label: blood tissue_processing: Ficoll gradient cell_subset: id: CL_0000788 label: naive B cell cell_phenotype: expression of CD20 and the absence of CD27 cell_species: id: NCBITaxon_9606 label: Homo sapiens single_cell: false cell_isolation: FACS template_class: RNA pcr_target: - pcr_target_locus: IGH forward_pcr_primer_target_location: reverse_pcr_primer_target_location: sequencing_platform: Illumina MiSeq sequencing_files: sequencing_data_id: SRR2905656 file_type: fastq filename: SRR2905656_R1.fastq.gz read_direction: forward read_length: 300 paired_filename: SRR2905656_R2.fastq.gz paired_read_direction: reverse paired_read_length: 300 anatomic_site: disease_state_sample: collection_time_point_relative: collection_time_point_relative_unit: id: label: collection_time_point_reference: biomaterial_provider: cell_number: cells_per_reaction: cell_storage: false cell_quality: cell_processing_protocol: template_quality: template_amount: template_amount_unit: id: label: library_generation_method: RT(oligo-dT)+PCR library_generation_protocol: library_generation_kit_version: complete_sequences: partial physical_linkage: none sequencing_run_id: total_reads_passing_qc_filter: sequencing_facility: sequencing_run_date: sequencing_kit: data_processing: - data_processing_id: 3059369183532618216-242ac11b-0001-007 primary_annotation: true software_versions: paired_reads_assembly: quality_thresholds: primer_match_cutoffs: collapsing_method: data_processing_protocols: data_processing_files: germline_database: analysis_provenance_id: 6623294219256599016-242ac11c-0001-012 - repertoire_id: 1602908186092376551-242ac11c-0001-012 study: study_id: PRJNA300878 study_title: Homo sapiens B and T cell repertoire - MZ twins study_type: id: label: study_description: The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level. study_contact: Mark M. Davis, mmdavis@stanford.edu, ORCID:0000-0001-6868-657X inclusion_exclusion_criteria: lab_name: Mark M. Davis lab_address: Stanford University submitted_by: Florian Rubelt pub_ids: PMID:27005435 collected_by: grants: keywords_study: - contains_ig - contains_tr subject: subject_id: TW01A synthetic: false species: id: NCBITaxon_9606 label: Homo sapiens sex: female age_min: 27 age_max: 27 age_unit: id: UO_0000036 label: year age_event: ancestry_population: ethnicity: race: strain_name: linked_subjects: TW01B link_type: twin diagnosis: - study_group_description: disease_diagnosis: id: label: disease_length: disease_stage: prior_therapies: immunogen: intervention: medical_history: sample: - sample_id: TW01A_B_memory sample_processing_id: sample_type: peripheral venous puncture tissue: id: UBERON_0000178 label: blood tissue_processing: Ficoll gradient cell_subset: id: CL_0000787 label: memory B cell cell_phenotype: expression of CD20 and CD27 cell_species: id: NCBITaxon_9606 label: Homo sapiens single_cell: false cell_isolation: FACS template_class: RNA pcr_target: - pcr_target_locus: IGH forward_pcr_primer_target_location: reverse_pcr_primer_target_location: sequencing_platform: Illumina MiSeq sequencing_files: sequencing_data_id: SRR2905655 file_type: fastq filename: SRR2905655_R1.fastq.gz read_direction: forward read_length: 300 paired_filename: SRR2905655_R2.fastq.gz paired_read_direction: reverse paired_read_length: 300 anatomic_site: disease_state_sample: collection_time_point_relative: collection_time_point_relative_unit: id: label: collection_time_point_reference: biomaterial_provider: cell_number: cells_per_reaction: cell_storage: false cell_quality: cell_processing_protocol: template_quality: template_amount: template_amount_unit: id: label: library_generation_method: RT(oligo-dT)+PCR library_generation_protocol: library_generation_kit_version: complete_sequences: partial physical_linkage: none sequencing_run_id: total_reads_passing_qc_filter: sequencing_facility: sequencing_run_date: sequencing_kit: data_processing: - data_processing_id: 3059369183532618216-242ac11b-0001-007 primary_annotation: true software_versions: paired_reads_assembly: quality_thresholds: primer_match_cutoffs: collapsing_method: data_processing_protocols: data_processing_files: germline_database: analysis_provenance_id: 6623294219256599016-242ac11c-0001-012 - repertoire_id: 2366080924918616551-242ac11c-0001-012 study: study_id: PRJNA300878 study_title: Homo sapiens B and T cell repertoire - MZ twins study_type: id: label: study_description: The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level. study_contact: Mark M. Davis, mmdavis@stanford.edu, ORCID:0000-0001-6868-657X inclusion_exclusion_criteria: lab_name: Mark M. Davis lab_address: Stanford University submitted_by: Florian Rubelt pub_ids: PMID:27005435 collected_by: grants: keywords_study: - contains_ig - contains_tr subject: subject_id: TW01A synthetic: false species: id: NCBITaxon_9606 label: Homo sapiens sex: female age_min: 27 age_max: 27 age_unit: id: UO_0000036 label: year age_event: ancestry_population: ethnicity: race: strain_name: linked_subjects: TW01B link_type: twin diagnosis: - study_group_description: disease_diagnosis: id: label: disease_length: disease_stage: prior_therapies: immunogen: intervention: medical_history: sample: - sample_id: TW01A_T_naive_CD4 sample_processing_id: sample_type: peripheral venous puncture tissue: id: UBERON_0000178 label: blood tissue_processing: Ficoll gradient cell_subset: id: CL_0000895 label: naive thymus-derived CD4-positive, alpha-beta T cell cell_phenotype: expression of CD8 and absence of CD4 and CD45RO cell_species: id: NCBITaxon_9606 label: Homo sapiens single_cell: false cell_isolation: FACS template_class: RNA pcr_target: - pcr_target_locus: TRB forward_pcr_primer_target_location: reverse_pcr_primer_target_location: sequencing_platform: Illumina MiSeq sequencing_files: sequencing_data_id: SRR2905659 file_type: fastq filename: SRR2905659_R1.fastq.gz read_direction: forward read_length: 300 paired_filename: SRR2905659_R2.fastq.gz paired_read_direction: reverse paired_read_length: 300 anatomic_site: disease_state_sample: collection_time_point_relative: collection_time_point_relative_unit: id: label: collection_time_point_reference: biomaterial_provider: cell_number: cells_per_reaction: cell_storage: false cell_quality: cell_processing_protocol: template_quality: template_amount: template_amount_unit: id: label: library_generation_method: RT(oligo-dT)+PCR library_generation_protocol: library_generation_kit_version: complete_sequences: partial physical_linkage: none sequencing_run_id: total_reads_passing_qc_filter: sequencing_facility: sequencing_run_date: sequencing_kit: data_processing: - data_processing_id: 651223970338378216-242ac11b-0001-007 primary_annotation: true software_versions: paired_reads_assembly: quality_thresholds: primer_match_cutoffs: collapsing_method: data_processing_protocols: data_processing_files: germline_database: analysis_provenance_id: 4625424004665971176-242ac11c-0001-012 GermlineSet: - acknowledgements: [] allele_descriptions: - acknowledgements: [] aliases: - watson_et_al:CAST_EiJ_IGHV5-3 allele_description_id: OGRDB:A00301 allele_description_ref: OGRDB:Mouse_IGH:IGHV-2DBF allele_designation: null chromosome: null coding_sequence: GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA curation: 'Imported to OGRDB with the following notes: watson_et_al: CAST_EiJ_IGHV5-3' curational_tags: null functional: true gene_designation: null gene_end: null gene_start: null inference_type: rearranged_only lab_address: Birkbeck College, University of London, Malet Street, London label: IGHV-2DBF leader_1_end: null leader_1_start: null leader_2_end: null leader_2_start: null locus: IGH maintainer: William Lees paralogs: [] rearranged_support: [] release_date: 24-Nov-2021 release_description: First release release_version: 1 sequence: GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA sequence_type: V species: id: NCBITAXON:10090 label: Mus musculus species_subgroup: CAST_EiJ species_subgroup_type: strain status: active subgroup_designation: null unrearranged_support: [] utr_5_prime_end: null utr_5_prime_start: null v_gene_delineations: - aligned_sequence: GAAGTGAAGCTGGTGGAGTCTGAGGGA...GGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTC............AGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA alignment: - '1' - '2' - '3' - '4' - '5' - '6' - '7' - '8' - '9' - '10' - '11' - '12' - '13' - '14' - '15' - '16' - '17' - '18' - '19' - '20' - '21' - '22' - '23' - '24' - '25' - '26' - '27' - '28' - '29' - '30' - '31' - '32' - '33' - '34' - '35' - '36' - '37' - '38' - '39' - '40' - '41' - '42' - '43' - '44' - '45' - '46' - '47' - '48' - '49' - '50' - '51' - '52' - '53' - '54' - '55' - '56' - '57' - '58' - '59' - '60' - '61' - '62' - '63' - '64' - '65' - '66' - '67' - '68' - '69' - '70' - '71' - '72' - '73' - '74' - '75' - '76' - '77' - '78' - '79' - '80' - '81' - '82' - '83' - '84' - '85' - '86' - '87' - '88' - '89' - '90' - '91' - '92' - '93' - '94' - '95' - '96' - '97' - '98' - '99' - '100' - '101' - '102' - '103' - '104' cdr1_end: 110 cdr1_start: 76 cdr2_end: 160 cdr2_start: 151 cdr3_start: 295 delineation_scheme: IMGT fwr1_end: 75 fwr1_start: 1 fwr2_end: 150 fwr2_start: 111 fwr3_end: 294 fwr3_start: 161 sequence_delineation_id: '1' unaligned_sequence: GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA v_rs_end: null v_rs_start: null - acknowledgements: [] aliases: - watson_et_al:CAST_EiJ_IGHV8-2 allele_description_id: OGRDB:A00314 allele_description_ref: OGRDB:Mouse_IGH:IGHV-2ETO allele_designation: null chromosome: null coding_sequence: CAAGTTACTCTAAAAGAGTCTGGCCCTGGGATATTGAAGCCCTCACAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTACTAATATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCATCCCTGAAGAGCCGGCTAACAATCTCCAAGGATACCTCCAGAAACCAGGTATTCCTCAAGATCACCAGTGTGGACACTGCAGATACTGCCACTTACTACTGTGCTC curation: 'Imported to OGRDB with the following notes: watson_et_al: CAST_EiJ_IGHV8-2' curational_tags: null functional: true gene_designation: null gene_end: null gene_start: null inference_type: rearranged_only lab_address: Birkbeck College, University of London, Malet Street, London label: IGHV-2ETO leader_1_end: null leader_1_start: null leader_2_end: null leader_2_start: null locus: IGH maintainer: William Lees paralogs: [] rearranged_support: [] release_date: 24-Nov-2021 release_description: First release release_version: 1 sequence: CAAGTTACTCTAAAAGAGTCTGGCCCTGGGATATTGAAGCCCTCACAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTACTAATATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCATCCCTGAAGAGCCGGCTAACAATCTCCAAGGATACCTCCAGAAACCAGGTATTCCTCAAGATCACCAGTGTGGACACTGCAGATACTGCCACTTACTACTGTGCTC sequence_type: V species: id: NCBITAXON:10090 label: Mus musculus species_subgroup: CAST_EiJ species_subgroup_type: strain status: active subgroup_designation: null unrearranged_support: [] utr_5_prime_end: null utr_5_prime_start: null v_gene_delineations: - aligned_sequence: GAAGTGAAGCTGGTGGAGTCTGAGGGA...GGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTC............AGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA alignment: - '1' - '2' - '3' - '4' - '5' - '6' - '7' - '8' - '9' - '10' - '11' - '12' - '13' - '14' - '15' - '16' - '17' - '18' - '19' - '20' - '21' - '22' - '23' - '24' - '25' - '26' - '27' - '28' - '29' - '30' - '31' - '32' - '33' - '34' - '35' - '36' - '37' - '38' - '39' - '40' - '41' - '42' - '43' - '44' - '45' - '46' - '47' - '48' - '49' - '50' - '51' - '52' - '53' - '54' - '55' - '56' - '57' - '58' - '59' - '60' - '61' - '62' - '63' - '64' - '65' - '66' - '67' - '68' - '69' - '70' - '71' - '72' - '73' - '74' - '75' - '76' - '77' - '78' - '79' - '80' - '81' - '82' - '83' - '84' - '85' - '86' - '87' - '88' - '89' - '90' - '91' - '92' - '93' - '94' - '95' - '96' - '97' - '98' - '99' - '100' - '101' - '102' - '103' - '104' cdr1_end: 110 cdr1_start: 76 cdr2_end: 160 cdr2_start: 151 cdr3_start: 295 delineation_scheme: IMGT fwr1_end: 75 fwr1_start: 1 fwr2_end: 150 fwr2_start: 111 fwr3_end: 294 fwr3_start: 161 sequence_delineation_id: '1' unaligned_sequence: GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA v_rs_end: null v_rs_start: null author: William Lees curation: null germline_set_id: OGRDB:G00007 germline_set_name: CAST IGH germline_set_ref: OGRDB:G00007.1 lab_address: Birkbeck College, University of London, Malet Street, London lab_name: '' locus: IGH pub_ids: '' release_date: '2021-11-24' release_description: '' release_version: 1 species: id: NCBITAXON:10090 label: Mus musculus species_subgroup: CAST_EiJ species_subgroup_type: strain GenotypeSet: - receptor_genotype_set_id: '1' genotype_class_list: - receptor_genotype_id: '1' locus: IGH documented_alleles: - label: IGHV1-69*01 germline_set_ref: IMGT:Homo sapiens:2022.1.31 phasing: 1 - label: IGHV1-69*02 germline_set_ref: IMGT:Homo sapiens:2022.1.31 phasing: 2 undocumented_alleles: - allele_name: IGHD3-1*01_S1234 sequence: agtagtagtagt phasing: 1 deleted_genes: - label: IGHV3-30-3 germline_set_ref: IMGT:Homo sapiens:2022.1.31 phasing: 1 inference_process: repertoire_sequencing ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/tests/data/good_genotype_set.json0000644000076500000240000000235414627177010020512 0ustar00vandej27staff{ "GenotypeSet": [{ "receptor_genotype_set_id": "1", "genotype_class_list": [ { "receptor_genotype_id": "1", "locus": "IGH", "documented_alleles": [ { "label": "IGHV1-69*01", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 1 }, { "label": "IGHV1-69*02", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 2 } ], "undocumented_alleles": [ { "allele_name": "IGHD3-1*01_S1234", "sequence": "agtagtagtagt", "phasing": 1 } ], "deleted_genes": [ { "label": "IGHV3-30-3", "germline_set_ref": "IMGT:Homo sapiens:2022.1.31", "phasing": 1 } ], "inference_process": "repertoire_sequencing" } ] }] }././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/tests/data/good_germline_set.json0000644000076500000240000003641014627177010020462 0ustar00vandej27staff{ "GermlineSet": [{ "germline_set_id": "OGRDB:G00007", "author": "William Lees", "lab_name": "", "lab_address": "Birkbeck College, University of London, Malet Street, London", "acknowledgements": [], "release_version": 1, "release_description": "", "release_date": "2021-11-24", "germline_set_name": "CAST IGH", "germline_set_ref": "OGRDB:G00007.1", "pub_ids": "", "species": { "id": "NCBITAXON:10090", "label": "Mus musculus" }, "species_subgroup": "CAST_EiJ", "species_subgroup_type": "strain", "locus": "IGH", "allele_descriptions": [ { "allele_description_id": "OGRDB:A00301", "allele_description_ref": "OGRDB:Mouse_IGH:IGHV-2DBF", "maintainer": "William Lees", "acknowledgements": [], "lab_address": "Birkbeck College, University of London, Malet Street, London", "release_version": 1, "release_date": "24-Nov-2021", "release_description": "First release", "label": "IGHV-2DBF", "sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "coding_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "aliases": [ "watson_et_al:CAST_EiJ_IGHV5-3" ], "locus": "IGH", "chromosome": null, "sequence_type": "V", "functional": true, "inference_type": "rearranged_only", "species": { "id": "NCBITAXON:10090", "label": "Mus musculus" }, "species_subgroup": "CAST_EiJ", "species_subgroup_type": "strain", "status": "active", "gene_designation": null, "subgroup_designation": null, "allele_designation": null, "gene_start": null, "gene_end": null, "utr_5_prime_start": null, "utr_5_prime_end": null, "leader_1_start": null, "leader_1_end": null, "leader_2_start": null, "leader_2_end": null, "v_rs_start": null, "v_rs_end": null, "v_gene_delineations": [ { "sequence_delineation_id": "1", "delineation_scheme": "IMGT", "aligned_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGA...GGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTC............AGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "unaligned_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "fwr1_start": 1, "fwr1_end": 75, "cdr1_start": 76, "cdr1_end": 110, "fwr2_start": 111, "fwr2_end": 150, "cdr2_start": 151, "cdr2_end": 160, "fwr3_start": 161, "fwr3_end": 294, "cdr3_start": 295, "alignment": [ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100", "101", "102", "103", "104" ] } ], "unrearranged_support": [], "rearranged_support": [], "paralogs": [], "curation": "Imported to OGRDB with the following notes: watson_et_al: CAST_EiJ_IGHV5-3", "curational_tags": null }, { "allele_description_id": "OGRDB:A00314", "allele_description_ref": "OGRDB:Mouse_IGH:IGHV-2ETO", "maintainer": "William Lees", "acknowledgements": [], "lab_address": "Birkbeck College, University of London, Malet Street, London", "release_version": 1, "release_date": "24-Nov-2021", "release_description": "First release", "label": "IGHV-2ETO", "sequence": "CAAGTTACTCTAAAAGAGTCTGGCCCTGGGATATTGAAGCCCTCACAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTACTAATATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCATCCCTGAAGAGCCGGCTAACAATCTCCAAGGATACCTCCAGAAACCAGGTATTCCTCAAGATCACCAGTGTGGACACTGCAGATACTGCCACTTACTACTGTGCTC", "coding_sequence": "CAAGTTACTCTAAAAGAGTCTGGCCCTGGGATATTGAAGCCCTCACAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTACTAATATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCATCCCTGAAGAGCCGGCTAACAATCTCCAAGGATACCTCCAGAAACCAGGTATTCCTCAAGATCACCAGTGTGGACACTGCAGATACTGCCACTTACTACTGTGCTC", "aliases": [ "watson_et_al:CAST_EiJ_IGHV8-2" ], "locus": "IGH", "chromosome": null, "sequence_type": "V", "functional": true, "inference_type": "rearranged_only", "species": { "id": "NCBITAXON:10090", "label": "Mus musculus" }, "species_subgroup": "CAST_EiJ", "species_subgroup_type": "strain", "status": "active", "gene_designation": null, "subgroup_designation": null, "allele_designation": null, "gene_start": null, "gene_end": null, "utr_5_prime_start": null, "utr_5_prime_end": null, "leader_1_start": null, "leader_1_end": null, "leader_2_start": null, "leader_2_end": null, "v_rs_start": null, "v_rs_end": null, "v_gene_delineations": [ { "sequence_delineation_id": "1", "delineation_scheme": "IMGT", "aligned_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGA...GGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTC............AGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "unaligned_sequence": "GAAGTGAAGCTGGTGGAGTCTGAGGGAGGCTTAGTGCAGCCTGGAAGTTCCATGAAACTCTCCTGCACAGCCTCTGGATTCACTTTCAGTGACTATTACATGGCTTGGGTCCGCCAGGTTCCAGAAAAGGGTCTAGAATGGGTTGCAAACATTAATTATGAT......GGTAGTGGCACCTACTATCTGGACTCCTTGAAG...AGCCGTTTCATCATCTCGAGAGACAATGCAAAGAACATTCTATACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCACGTATTACTGTGCAA", "fwr1_start": 1, "fwr1_end": 75, "cdr1_start": 76, "cdr1_end": 110, "fwr2_start": 111, "fwr2_end": 150, "cdr2_start": 151, "cdr2_end": 160, "fwr3_start": 161, "fwr3_end": 294, "cdr3_start": 295, "alignment": [ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100", "101", "102", "103", "104" ] } ], "unrearranged_support": [], "rearranged_support": [], "paralogs": [], "curation": "Imported to OGRDB with the following notes: watson_et_al: CAST_EiJ_IGHV8-2", "curational_tags": null } ], "curation": null }] } ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1661708083.0 airr-1.5.1/tests/data/good_rearrangement.tsv0000644000076500000240000001345614302723463020506 0ustar00vandej27staffrearrangement_id rearrangement_set_id sequence_id sequence rev_comp productive sequence_alignment germline_alignment v_call d_call j_call c_call junction junction_length junction_aa v_score d_score j_score c_score v_cigar d_cigar j_cigar c_cigar v_identity v_evalue d_identity d_evalue j_identity j_evalue v_sequence_start v_sequence_end v_germline_start v_germline_end d_sequence_start d_sequence_end d_germline_start d_germline_end j_sequence_start j_sequence_end j_germline_start j_germline_end np1_length np2_length duplicate_count IVKNQEJ01BVGQ6 1 IVKNQEJ01BVGQ6 GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACTGTCTCCTCA T T IGHV4-31*03 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGG 36 CASGVAGTFDYW 430 16.4 75.8 22N1S275= 11N280S8= 6N292S32=1X9= 1 1E-122 1 2.7 0.9762 6E-18 0 275 0 317 279 287 10 18 291 333 5 47 4 4 1247 IVKNQEJ01AQVWS 1 IVKNQEJ01AQVWS GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA T T IGHV4-31*03 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGG 36 CASGVAGTFDYW 420 16.4 83.8 22N1S156=1X10=1X17=1X89= 11N280S8= 6N292S42= 0.9891 8E-120 1 2.7 1 2E-20 0 275 0 317 279 287 10 18 291 333 5 47 4 4 4 IVKNQEJ01AOYFZ 1 IVKNQEJ01AOYFZ GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGTAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA T F IGHV4-31*03 IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGTAACTTTTGACTACTGG 37 CASGVAGNF*LLX 430 20.4 83.8 22N1S275= 11N280S10= 6N293S42= 1 1E-122 1 0.17 1 2E-20 0 275 0 317 279 289 10 20 292 334 5 47 4 3 92 IVKNQEJ01EI5S4 1 IVKNQEJ01EI5S4 GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA T T IGHV4-31*03 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGG 36 CASGVAGTFDYW 430 16.4 83.8 22N1S275= 11N280S8= 6N292S42= 1 1E-122 1 2.7 1 2E-20 0 275 0 317 279 287 10 18 291 333 5 47 4 4 2913 IVKNQEJ01DGRRI 1 IVKNQEJ01DGRRI GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGTCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA T T IGHV4-34*09 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGG 36 CASGVAGTFDYW 389 16.4 83.8 22N1S23=2X85=1X15=1X1=1X3=1X2=1X1=1X5=1X6=1X118= 11N274S8= 6N286S42= 0.9628 2E-110 1 2.6 1 2E-20 0 269 0 317 273 281 10 18 285 327 5 47 4 4 1 IVKNQEJ01APN5N 1 IVKNQEJ01APN5N GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTAGGGCCAGGGAACCCTGGTCACTGTCTCCTCA T F IGHV4-31*03 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTAG 36 CASGVAGTFDY* 430 16.4 67.9 22N1S275= 11N280S8= 6N292S10=1X21=1X9= 1 1E-122 1 2.7 0.9524 1E-15 0 275 0 317 279 287 10 18 291 333 5 47 4 4 1 IVKNQEJ01B0TT2 1 IVKNQEJ01B0TT2 GGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGTAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACTGTCTCCTCA T F IGHV4-31*03 IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGTAACTTTTGACTACTGG 37 CASGVAGNF*LLX 430 20.4 75.8 22N1S275= 11N280S10= 6N293S32=1X9= 1 1E-122 1 0.17 0.9762 6E-18 0 275 0 317 279 289 10 20 292 334 5 47 4 3 30 IVKNQEJ01AIS74 1 IVKNQEJ01AIS74 GGCGCAGGACTGTTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGGCGGGGTGGCTGGTAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA T F IGHV4-31*03 IGHD6-19*01 IGHJ4*02 TGTGCGAGGCGGGGTGGCTGGTAACTTTTGACTACTGG 38 CARRGGW*LLTTG 424 20.4 83.8 22N1S3=1X8=1X262= 11N281S10= 6N294S42= 0.9927 9E-121 1 0.17 1 2E-20 0 275 0 317 280 290 10 20 293 335 5 47 5 3 4 IVKNQEJ01AJ44V 1 IVKNQEJ01AJ44V GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGGGGCCAGGGAACCCTGGTCACTGTCTCCTCA T T IGHV4-59*06 IGHD1-7*01,IGHD6-19*01 IGHJ4*02 TGTGCGAGCGGGGTGGCTGGAACTTTTGACTACTGG 36 CASGVAGTFDYW 386 16.4 75.8 22N1S45=1X5=2X6=1X3=1X5=1X22=1X4=1X1=1X1=1X165= 11N274S8= 6N286S32=1X9= 0.9625 2E-109 1 2.6 0.9762 5E-18 0 267 0 315 273 281 10 18 285 327 5 47 6 4 12 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1717370376.0 airr-1.5.1/tests/data/good_repertoire.yaml0000644000076500000240000003556514627177010020170 0ustar00vandej27staff# # Example metadata # Repertoire: - repertoire_id: 1841923116114776551-242ac11c-0001-012 study: study_id: PRJNA300878 study_title: "Homo sapiens B and T cell repertoire - MZ twins" study_type: id: null label: null study_description: "The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level." study_contact: "Mark M. Davis, mmdavis@stanford.edu, ORCID:0000-0001-6868-657X" inclusion_exclusion_criteria: null lab_name: "Mark M. Davis" lab_address: "Stanford University" submitted_by: "Florian Rubelt" pub_ids: "PMID:27005435" collected_by: null grants: null keywords_study: - "contains_ig" - "contains_tr" subject: subject_id: TW01A synthetic: false species: id: "NCBITaxon_9606" label: "Homo sapiens" sex: female age_min: 27 age_max: 27 age_unit: id: UO_0000036 label: year age_event: null ancestry_population: null ethnicity: null race: null strain_name: null linked_subjects: TW01B link_type: twin diagnosis: - study_group_description: null disease_diagnosis: id: null label: null disease_length: null disease_stage: null prior_therapies: null immunogen: null intervention: null medical_history: null sample: - sample_id: TW01A_B_naive sample_processing_id: null sample_type: "peripheral venous puncture" tissue: id: "UBERON_0000178" label: "blood" tissue_processing: "Ficoll gradient" cell_subset: id: "CL_0000788" label: "naive B cell" cell_phenotype: "expression of CD20 and the absence of CD27" cell_species: id: "NCBITaxon_9606" label: "Homo sapiens" single_cell: false cell_isolation: FACS template_class: RNA pcr_target: - pcr_target_locus: IGH forward_pcr_primer_target_location: null reverse_pcr_primer_target_location: null sequencing_platform: "Illumina MiSeq" sequencing_files: sequencing_data_id: SRA:SRR2905656 file_type: fastq filename: SRR2905656_R1.fastq.gz read_direction: forward read_length: 300 paired_filename: SRR2905656_R2.fastq.gz paired_read_direction: reverse paired_read_length: 300 index_filename: SRR2905656_R3.fastq.gz index_length: 8 anatomic_site: null disease_state_sample: null collection_time_point_relative: null collection_time_point_relative_unit: id: null label: null collection_time_point_reference: null biomaterial_provider: null cell_number: null cells_per_reaction: null cell_storage: false cell_quality: null cell_processing_protocol: null template_quality: null template_amount: null template_amount_unit: id: null label: null library_generation_method: "RT(oligo-dT)+PCR" library_generation_protocol: null library_generation_kit_version: null complete_sequences: "partial" physical_linkage: "none" sequencing_run_id: null total_reads_passing_qc_filter: null sequencing_facility: null sequencing_run_date: null sequencing_kit: null data_processing: - data_processing_id: 3059369183532618216-242ac11b-0001-007 primary_annotation: true software_versions: null paired_reads_assembly: null quality_thresholds: null primer_match_cutoffs: null collapsing_method: null data_processing_protocols: null data_processing_files: null germline_database: null analysis_provenance_id: 6623294219256599016-242ac11c-0001-012 - repertoire_id: 1602908186092376551-242ac11c-0001-012 study: study_id: PRJNA300878 study_title: "Homo sapiens B and T cell repertoire - MZ twins" study_type: id: null label: null study_description: "The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level." study_contact: "Mark M. Davis, mmdavis@stanford.edu, ORCID:0000-0001-6868-657X" inclusion_exclusion_criteria: null lab_name: "Mark M. Davis" lab_address: "Stanford University" submitted_by: "Florian Rubelt" pub_ids: "PMID:27005435" collected_by: null grants: null keywords_study: - "contains_ig" - "contains_tr" subject: subject_id: TW01A synthetic: false species: id: "NCBITaxon_9606" label: "Homo sapiens" sex: female age_min: 27 age_max: 27 age_unit: id: UO_0000036 label: year age_event: null ancestry_population: null ethnicity: null race: null strain_name: null linked_subjects: TW01B link_type: twin diagnosis: - study_group_description: null disease_diagnosis: id: null label: null disease_length: null disease_stage: null prior_therapies: null immunogen: null intervention: null medical_history: null sample: - sample_id: TW01A_B_memory sample_processing_id: null sample_type: "peripheral venous puncture" tissue: id: "UBERON_0000178" label: "blood" tissue_processing: "Ficoll gradient" cell_subset: id: "CL_0000787" label: "memory B cell" cell_phenotype: "expression of CD20 and CD27" cell_species: id: "NCBITaxon_9606" label: "Homo sapiens" single_cell: false cell_isolation: FACS template_class: RNA pcr_target: - pcr_target_locus: IGH forward_pcr_primer_target_location: null reverse_pcr_primer_target_location: null sequencing_platform: "Illumina MiSeq" sequencing_files: sequencing_data_id: SRA:SRR2905655 file_type: fastq filename: SRR2905655_R1.fastq.gz read_direction: forward read_length: 300 paired_filename: SRR2905655_R2.fastq.gz paired_read_direction: reverse paired_read_length: 300 index_filename: SRR2905655_R3.fastq.gz index_length: 8 anatomic_site: null disease_state_sample: null collection_time_point_relative: null collection_time_point_relative_unit: id: null label: null collection_time_point_reference: null biomaterial_provider: null cell_number: null cells_per_reaction: null cell_storage: false cell_quality: null cell_processing_protocol: null template_quality: null template_amount: null template_amount_unit: id: null label: null library_generation_method: "RT(oligo-dT)+PCR" library_generation_protocol: null library_generation_kit_version: null complete_sequences: "partial" physical_linkage: "none" sequencing_run_id: null total_reads_passing_qc_filter: null sequencing_facility: null sequencing_run_date: null sequencing_kit: null data_processing: - data_processing_id: 3059369183532618216-242ac11b-0001-007 primary_annotation: true software_versions: null paired_reads_assembly: null quality_thresholds: null primer_match_cutoffs: null collapsing_method: null data_processing_protocols: null data_processing_files: null germline_database: null analysis_provenance_id: 6623294219256599016-242ac11c-0001-012 - repertoire_id: 2366080924918616551-242ac11c-0001-012 study: study_id: PRJNA300878 study_title: "Homo sapiens B and T cell repertoire - MZ twins" study_type: id: null label: null study_description: "The adaptive immune system's capability to protect the body requires a highly diverse lymphocyte antigen receptor repertoire. However, the influence of individual genetic and epigenetic differences on these repertoires is not typically measured. By leveraging the unique characteristics of B, CD4+ T, and CD8+ T lymphocyte subsets isolated from monozygotic twins, we have quantified the impact of heritable factors on both the V(D)J recombination process and thymic selection in the case of T cell receptors, and show that the repertoires of both naive and antigen experienced cells are subject to biases resulting from differences in recombination. We show that biases in V(D)J usage, as well as biased N/P additions, contribute to significant variation in the CDR3 region. Moreover, we show that the relative usage of V and J gene segments is chromosomally biased, with approximately 1.5 times as many rearrangements originating from a single chromosome. These data refine our understanding of the heritable mechanisms affecting the repertoire, and show that biases are evident on a chromosome-wide level." study_contact: "Mark M. Davis, mmdavis@stanford.edu, ORCID:0000-0001-6868-657X" inclusion_exclusion_criteria: null lab_name: "Mark M. Davis" lab_address: "Stanford University" submitted_by: "Florian Rubelt" pub_ids: "PMID:27005435" collected_by: null grants: null keywords_study: - "contains_ig" - "contains_tr" subject: subject_id: TW01A synthetic: false species: id: "NCBITaxon_9606" label: "Homo sapiens" sex: female age_min: 27 age_max: 27 age_unit: id: UO_0000036 label: year age_event: null ancestry_population: null ethnicity: null race: null strain_name: null linked_subjects: TW01B link_type: twin diagnosis: - study_group_description: null disease_diagnosis: id: null label: null disease_length: null disease_stage: null prior_therapies: null immunogen: null intervention: null medical_history: null sample: - sample_id: TW01A_T_naive_CD4 sample_processing_id: null sample_type: "peripheral venous puncture" tissue: id: "UBERON_0000178" label: "blood" tissue_processing: "Ficoll gradient" cell_subset: id: "CL_0000895" label: "naive thymus-derived CD4-positive, alpha-beta T cell" cell_phenotype: "expression of CD8 and absence of CD4 and CD45RO" cell_species: id: "NCBITaxon_9606" label: "Homo sapiens" single_cell: false cell_isolation: FACS template_class: RNA pcr_target: - pcr_target_locus: TRB forward_pcr_primer_target_location: null reverse_pcr_primer_target_location: null sequencing_platform: "Illumina MiSeq" sequencing_files: sequencing_data_id: SRA:SRR2905659 file_type: fastq filename: SRR2905659_R1.fastq.gz read_direction: forward read_length: 300 paired_filename: SRR2905659_R2.fastq.gz paired_read_direction: reverse paired_read_length: 300 index_filename: SRR2905659_R3.fastq.gz index_length: 8 anatomic_site: null disease_state_sample: null collection_time_point_relative: null collection_time_point_relative_unit: id: null label: null collection_time_point_reference: null biomaterial_provider: null cell_number: null cells_per_reaction: null cell_storage: false cell_quality: null cell_processing_protocol: null template_quality: null template_amount: null template_amount_unit: id: null label: null library_generation_method: "RT(oligo-dT)+PCR" library_generation_protocol: null library_generation_kit_version: null complete_sequences: "partial" physical_linkage: "none" sequencing_run_id: null total_reads_passing_qc_filter: null sequencing_facility: null sequencing_run_date: null sequencing_kit: null data_processing: - data_processing_id: 651223970338378216-242ac11b-0001-007 primary_annotation: true software_versions: null paired_reads_assembly: null quality_thresholds: null primer_match_cutoffs: null collapsing_method: null data_processing_protocols: null data_processing_files: null germline_database: null analysis_provenance_id: 4625424004665971176-242ac11c-0001-012 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1691444069.0 airr-1.5.1/tests/test_interface.py0000644000076500000240000003571214464261545016554 0ustar00vandej27staff""" Unit tests for interface """ # System imports import os import time import unittest import jsondiff import sys # airr imports import airr from airr.schema import ValidationError # Paths test_path = os.path.dirname(os.path.realpath(__file__)) data_path = os.path.join(test_path, 'data') class TestInferface(unittest.TestCase): def setUp(self): print('-------> %s()' % self.id()) # Test data self.rearrangement_good = os.path.join(data_path, 'good_rearrangement.tsv') self.rearrangement_bad = os.path.join(data_path, 'bad_rearrangement.tsv') self.rep_good = os.path.join(data_path, 'good_repertoire.yaml') self.rep_bad = os.path.join(data_path, 'bad_repertoire.yaml') self.germline_good = os.path.join(data_path, 'good_germline_set.json') self.germline_bad = os.path.join(data_path, 'bad_germline_set.json') self.genotype_good = os.path.join(data_path, 'good_genotype_set.json') self.genotype_bad = os.path.join(data_path, 'bad_genotype_set.json') self.combined_yaml = os.path.join(data_path, 'good_combined_airr.yaml') self.combined_json = os.path.join(data_path, 'good_combined_airr.json') # Output data self.output_rep = os.path.join(data_path, 'output_rep.json') self.output_good = os.path.join(data_path, 'output_data.json') self.output_blank = os.path.join(data_path, 'output_blank.json') # Expected output self.shape_good = (9, 44) self.shape_bad = (9, 44) # Start timer self.start = time.time() def tearDown(self): t = time.time() - self.start print('<- %.3f %s()' % (t, self.id())) # @unittest.skip('-> load(): skipped\n') def test_load_rearrangement(self): # Good data result = airr.load_rearrangement(self.rearrangement_good) self.assertTupleEqual(result.shape, self.shape_good, 'load(): good data failed') # Bad data result = airr.load_rearrangement(self.rearrangement_bad) self.assertTupleEqual(result.shape, self.shape_bad, 'load(): bad data failed') # @unittest.skip('-> repertoire_template(): skipped\n') def test_repertoire_template(self): try: with self.assertWarns(DeprecationWarning, msg='repertoire_template(): failed to issue DeprecationWarning'): rep = airr.repertoire_template() airr.write_airr(self.output_blank, {'Repertoire': rep}, validate=False, debug=True) except: pass # @unittest.skip('-> schema.template(): skipped\n') def test_schema_template(self): # Repertoire template try: data = airr.schema.RepertoireSchema.template() valid = airr.schema.RepertoireSchema.validate_object(data) self.assertTrue(valid, 'Schema.template("Repertoire"): repertoire template failed validation') except: self.assertTrue(False, 'Schema.template("Repertoire"): repertoire template failed validation') # GermlineSet template try: data = airr.schema.GermlineSetSchema.template() valid = airr.schema.GermlineSetSchema.validate_object(data) self.assertTrue(valid, 'Schema.template("GermlineSet"): repertoire template failed validation') except: self.assertTrue(False, 'Schema.template("GermlineSet"): repertoire template failed validation') # GenotypeSet template try: data = airr.schema.GenotypeSetSchema.template() valid = airr.schema.GenotypeSetSchema.validate_object(data) self.assertTrue(valid, 'Schema.template("GenotypeSet"): repertoire template failed validation') except: self.assertTrue(False, 'Schema.template("GenotypeSet"): repertoire template failed validation') # @unittest.skip('-> validate(): skipped\n') def test_validate_rearrangement(self): # Good data try: result = airr.validate_rearrangement(self.rearrangement_good) self.assertTrue(result, 'validate(): good data failed') except: self.assertTrue(False, 'validate(): good data failed') # Bad data try: result = airr.validate_rearrangement(self.rearrangement_bad) self.assertFalse(result, 'validate(): bad data failed') except Exception as inst: print(type(inst)) raise inst # @unittest.skip('-> read_airr(): skipped\n') def test_read_airr(self): # Good data print('--> Good data') try: data = airr.read_airr(self.rep_good, validate=True, debug=True) except: self.fail('read_airr(): good data failed') # Bad data print('--> Bad data') with self.assertRaises(ValidationError, msg="read_airr(): bad data passed validation"): data = airr.read_airr(self.rep_bad, validate=True, debug=True) # Combined yaml print('--> Combined YAML') try: data_yaml = airr.read_airr(self.combined_yaml, validate=True, debug=True) except: self.fail('read_airr(): combined yaml failed') # Combined json print('--> Combined JSON') try: data_json = airr.read_airr(self.combined_json, validate=True, debug=True) except: self.fail('read_airr(): combined json failed') # Check equality of yaml and json self.assertDictEqual(data_yaml, data_json, msg="read_airr(): yaml and json imports are not equal") # @unittest.skip('-> validate_airr(): skipped\n') def test_validate_airr(self): # Good data print('--> Good data') # As array try: data = airr.read_airr(self.rep_good, validate=True, debug=True) valid = airr.validate_airr(data, debug=True) self.assertTrue(valid, 'validate_airr(): good data array failed') except: self.assertTrue(False, 'validate_airr(): good data array failed') # As dict try: array = airr.read_airr(self.rep_good, validate=False, debug=False) data = {'Repertoire': {x['repertoire_id']: x for x in array['Repertoire']}} valid = airr.validate_airr(data, debug=True) self.assertTrue(valid, 'validate_airr(): good data dict failed') except: self.assertTrue(False, 'validate_airr(): good data dict failed') # Bad data print('--> Bad data') # As array try: data = airr.read_airr(self.rep_bad, validate=True, debug=True) valid = airr.validate_airr(data, debug=True) self.assertFalse(valid, 'validate_airr(): bad data array failed') except ValidationError: pass except Exception as inst: print(type(inst)) raise inst # As dict try: array = airr.read_airr(self.rep_bad, validate=False, debug=False) data = {'Repertoire': {x['repertoire_id']: x for x in array['Repertoire']}} valid = airr.validate_airr(data, debug=True) self.assertFalse(valid, 'validate_airr(): bad data dict failed') except ValidationError: pass except Exception as inst: print(type(inst)) raise inst # @unittest.skip('-> load_repertoire(): skipped\n') def test_load_repertoire(self): # Good data try: with self.assertWarns(DeprecationWarning, msg='load_repertoire(): failed to issue DeprecationWarning'): data = airr.load_repertoire(self.rep_good, validate=True, debug=True) except: self.assertTrue(False, 'load_repertoire(): good data failed') # Bad data try: with self.assertWarns(DeprecationWarning, msg='load_repertoire(): failed to issue DeprecationWarning'): data = airr.load_repertoire(self.rep_bad, validate=True, debug=True) self.assertFalse(True, 'load_repertoire(): bad data passed') except ValidationError: pass except Exception as inst: print(type(inst)) raise inst # @unittest.skip('-> write_repertoire(): skipped\n') def test_write_repertoire(self): # Good data try: with self.assertWarns(DeprecationWarning, msg='load_repertoire(): failed to issue DeprecationWarning'): data = airr.load_repertoire(self.rep_good, validate=True, debug=True) with self.assertWarns(DeprecationWarning, msg='write_repertoire(): failed to issue DeprecationWarning'): result = airr.write_repertoire(self.output_rep, data['Repertoire'], debug=True) with self.assertWarns(DeprecationWarning, msg='load_repertoire(): failed to issue DeprecationWarning'): # verify we can read it obj = airr.load_repertoire(self.output_rep, validate=True, debug=True) # is the data identical? if jsondiff.diff(obj['Repertoire'], data['Repertoire']) != {}: print('Output data does not match', file=sys.stderr) print(jsondiff.diff(obj, data), file=sys.stderr) self.assertTrue(False, 'write_repertoire(): Output data does not match') except: self.assertTrue(False, 'write_repertoire(): good data failed') # @unittest.skip('-> load_germline(): skipped\n') def test_read_germline(self): # Good data try: result = airr.read_airr(self.germline_good, validate=True, debug=True) except ValidationError: self.assertTrue(False, 'load_germline(): good data failed') # Bad data try: result = airr.read_airr(self.germline_bad, validate=True, debug=True) self.assertFalse(True, 'load_germline(): bad data succeeded') except ValidationError: pass # @unittest.skip('-> validate_germline(): skipped\n') def test_validate_germline(self): # Good data print('--> Good data') try: result = airr.read_airr(self.germline_good, validate=True, debug=True) valid = airr.validate_airr(result, debug=True) self.assertTrue(valid, 'validate_germline(): good data failed') except ValidationError: self.assertTrue(False, 'validate_germline(): good data failed') # Bad data print('--> Bad data') try: result = airr.read_airr(self.germline_bad, validate=True, debug=True) valid = airr.validate_airr(result, debug=True) self.assertFalse(valid, 'validate_germline(): bad data succeeded') except ValidationError: pass # @unittest.skip('-> load_genotype(): skipped\n') def test_read_genotype(self): # Good data print('--> Good data') try: result = airr.read_airr(self.genotype_good, validate=True, debug=True) except ValidationError: self.assertTrue(False, 'load_genotype(): good data failed') # Bad data print('--> Bad data') try: result = airr.read_airr(self.genotype_bad, validate=True, debug=True) self.assertFalse(True, 'load_genotype(): bad data succeeded') except ValidationError: pass # @unittest.skip('-> validate_genotype(): skipped\n') def test_validate_genotype(self): # Good data print('--> Good data') try: result = airr.read_airr(self.genotype_good, validate=True, debug=True) valid = airr.validate_airr(result, debug=True) self.assertTrue(valid, 'validate_genotype(): good data failed') except ValidationError: self.assertTrue(False, 'validate_genotype(): good data failed') # Bad data print('--> Bad data') try: result = airr.read_airr(self.genotype_bad, validate=True, debug=True) valid = airr.validate_airr(result, debug=True) self.assertFalse(valid, 'validate_genotype(): bad data succeeded') except ValidationError: pass # @unittest.skip('-> load_genotype(): skipped\n') def test_write_airr(self): # Good data as array try: repertoire_data = airr.read_airr(self.rep_good, validate=True, debug=True) germline_data = airr.read_airr(self.germline_good, validate=True, debug=True) genotype_data = airr.read_airr(self.genotype_good, validate=True, debug=True) # combine together and write obj = {} obj['Repertoire'] = repertoire_data['Repertoire'] obj['GermlineSet'] = germline_data['GermlineSet'] obj['GenotypeSet'] = genotype_data['GenotypeSet'] airr.write_airr(self.output_good, obj, validate=True, debug=True) # verify we can read it data = airr.read_airr(self.output_good, validate=True, debug=True) # is the data identical? del data['Info'] if jsondiff.diff(obj, data) != {}: print('Output data does not match', file=sys.stderr) print(jsondiff.diff(obj, data), file=sys.stderr) self.assertTrue(False, 'write_airr_data(): Output data does not match') except Exception as inst: self.assertTrue(False, 'write_airr_data(): good data failed') print(type(inst)) raise inst # Good data as dict try: # Load data repertoire_array = airr.read_airr(self.rep_good, validate=True, debug=True) germline_array = airr.read_airr(self.germline_good, validate=True, debug=True) genotype_array = airr.read_airr(self.genotype_good, validate=True, debug=True) # Build keyed representation repertoire_data = {'Repertoire': {x['repertoire_id']: x for x in repertoire_array['Repertoire']}} germline_data = {'GermlineSet': {x['germline_set_id']: x for x in germline_array['GermlineSet']}} genotype_data = {'GenotypeSet': {x['receptor_genotype_set_id']: x for x in genotype_array['GenotypeSet']}} # combine together and write obj = {} obj['Repertoire'] = repertoire_data['Repertoire'] obj['GermlineSet'] = germline_data['GermlineSet'] obj['GenotypeSet'] = genotype_data['GenotypeSet'] airr.write_airr(self.output_good, obj, validate=True, debug=True) # verify we can read it data = airr.read_airr(self.output_good, validate=True, debug=True) # is the data identical? del data['Info'] if jsondiff.diff(obj, data) != {}: print('Output data does not match', file=sys.stderr) print(jsondiff.diff(obj, data), file=sys.stderr) self.assertTrue(False, 'write_airr_data(): Output data does not match') except Exception as inst: self.assertTrue(False, 'write_airr_data(): good data failed') print(type(inst)) raise inst if __name__ == '__main__': unittest.main() ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1661708083.0 airr-1.5.1/tests/test_io.py0000644000076500000240000000374014302723463015210 0ustar00vandej27staff""" Unit tests for interface """ # System imports import os import time import unittest # Load imports from airr.io import * # Paths test_path = os.path.dirname(os.path.realpath(__file__)) data_path = os.path.join(test_path, 'data') class TestRearrangementReader(unittest.TestCase): def setUp(self): print('-------> %s()' % self.id()) # Test data self.data_good = os.path.join(data_path, 'good_rearrangement.tsv') self.data_bad = os.path.join(data_path, 'bad_rearrangement.tsv') self.data_extra = os.path.join(data_path, 'extra_rearrangement.tsv') # Start timer self.start = time.time() def tearDown(self): t = time.time() - self.start print('<- %.3f %s()' % (t, self.id())) # @unittest.skip('-> validate(): skipped\n') def test_validate(self): # Good data try: with open(self.data_good, 'r') as handle: reader = RearrangementReader(handle, validate=True) for r in reader: pass except: self.assertTrue(False, 'validate(): good data failed') # Bad data try: with open(self.data_bad, 'r') as handle: reader = RearrangementReader(handle, validate=True) for r in reader: pass self.assertFalse(True, 'validate(): bad data failed') except ValidationError: pass except Exception as inst: print(type(inst)) raise inst # Extra data try: with open(self.data_extra, 'r') as handle: reader = RearrangementReader(handle, validate=False) for r in reader: pass self.assertFalse(True, 'validate(): extra data failed') except ValueError: pass except Exception as inst: print(type(inst)) raise inst if __name__ == '__main__': unittest.main() ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1716315349.0 airr-1.5.1/versioneer.py0000644000076500000240000025122514623162325014564 0ustar00vandej27staff # Version: 0.29 """The Versioneer - like a rocketeer, but for versions. The Versioneer ============== * like a rocketeer, but for versions! * https://github.com/python-versioneer/python-versioneer * Brian Warner * License: Public Domain (Unlicense) * Compatible with: Python 3.7, 3.8, 3.9, 3.10, 3.11 and pypy3 * [![Latest Version][pypi-image]][pypi-url] * [![Build Status][travis-image]][travis-url] This is a tool for managing a recorded version number in setuptools-based python projects. The goal is to remove the tedious and error-prone "update the embedded version string" step from your release process. Making a new release should be as easy as recording a new tag in your version-control system, and maybe making new tarballs. ## Quick Install Versioneer provides two installation modes. The "classic" vendored mode installs a copy of versioneer into your repository. The experimental build-time dependency mode is intended to allow you to skip this step and simplify the process of upgrading. ### Vendored mode * `pip install versioneer` to somewhere in your $PATH * A [conda-forge recipe](https://github.com/conda-forge/versioneer-feedstock) is available, so you can also use `conda install -c conda-forge versioneer` * add a `[tool.versioneer]` section to your `pyproject.toml` or a `[versioneer]` section to your `setup.cfg` (see [Install](INSTALL.md)) * Note that you will need to add `tomli; python_version < "3.11"` to your build-time dependencies if you use `pyproject.toml` * run `versioneer install --vendor` in your source tree, commit the results * verify version information with `python setup.py version` ### Build-time dependency mode * `pip install versioneer` to somewhere in your $PATH * A [conda-forge recipe](https://github.com/conda-forge/versioneer-feedstock) is available, so you can also use `conda install -c conda-forge versioneer` * add a `[tool.versioneer]` section to your `pyproject.toml` or a `[versioneer]` section to your `setup.cfg` (see [Install](INSTALL.md)) * add `versioneer` (with `[toml]` extra, if configuring in `pyproject.toml`) to the `requires` key of the `build-system` table in `pyproject.toml`: ```toml [build-system] requires = ["setuptools", "versioneer[toml]"] build-backend = "setuptools.build_meta" ``` * run `versioneer install --no-vendor` in your source tree, commit the results * verify version information with `python setup.py version` ## Version Identifiers Source trees come from a variety of places: * a version-control system checkout (mostly used by developers) * a nightly tarball, produced by build automation * a snapshot tarball, produced by a web-based VCS browser, like github's "tarball from tag" feature * a release tarball, produced by "setup.py sdist", distributed through PyPI Within each source tree, the version identifier (either a string or a number, this tool is format-agnostic) can come from a variety of places: * ask the VCS tool itself, e.g. "git describe" (for checkouts), which knows about recent "tags" and an absolute revision-id * the name of the directory into which the tarball was unpacked * an expanded VCS keyword ($Id$, etc) * a `_version.py` created by some earlier build step For released software, the version identifier is closely related to a VCS tag. Some projects use tag names that include more than just the version string (e.g. "myproject-1.2" instead of just "1.2"), in which case the tool needs to strip the tag prefix to extract the version identifier. For unreleased software (between tags), the version identifier should provide enough information to help developers recreate the same tree, while also giving them an idea of roughly how old the tree is (after version 1.2, before version 1.3). Many VCS systems can report a description that captures this, for example `git describe --tags --dirty --always` reports things like "0.7-1-g574ab98-dirty" to indicate that the checkout is one revision past the 0.7 tag, has a unique revision id of "574ab98", and is "dirty" (it has uncommitted changes). The version identifier is used for multiple purposes: * to allow the module to self-identify its version: `myproject.__version__` * to choose a name and prefix for a 'setup.py sdist' tarball ## Theory of Operation Versioneer works by adding a special `_version.py` file into your source tree, where your `__init__.py` can import it. This `_version.py` knows how to dynamically ask the VCS tool for version information at import time. `_version.py` also contains `$Revision$` markers, and the installation process marks `_version.py` to have this marker rewritten with a tag name during the `git archive` command. As a result, generated tarballs will contain enough information to get the proper version. To allow `setup.py` to compute a version too, a `versioneer.py` is added to the top level of your source tree, next to `setup.py` and the `setup.cfg` that configures it. This overrides several distutils/setuptools commands to compute the version when invoked, and changes `setup.py build` and `setup.py sdist` to replace `_version.py` with a small static file that contains just the generated version data. ## Installation See [INSTALL.md](./INSTALL.md) for detailed installation instructions. ## Version-String Flavors Code which uses Versioneer can learn about its version string at runtime by importing `_version` from your main `__init__.py` file and running the `get_versions()` function. From the "outside" (e.g. in `setup.py`), you can import the top-level `versioneer.py` and run `get_versions()`. Both functions return a dictionary with different flavors of version information: * `['version']`: A condensed version string, rendered using the selected style. This is the most commonly used value for the project's version string. The default "pep440" style yields strings like `0.11`, `0.11+2.g1076c97`, or `0.11+2.g1076c97.dirty`. See the "Styles" section below for alternative styles. * `['full-revisionid']`: detailed revision identifier. For Git, this is the full SHA1 commit id, e.g. "1076c978a8d3cfc70f408fe5974aa6c092c949ac". * `['date']`: Date and time of the latest `HEAD` commit. For Git, it is the commit date in ISO 8601 format. This will be None if the date is not available. * `['dirty']`: a boolean, True if the tree has uncommitted changes. Note that this is only accurate if run in a VCS checkout, otherwise it is likely to be False or None * `['error']`: if the version string could not be computed, this will be set to a string describing the problem, otherwise it will be None. It may be useful to throw an exception in setup.py if this is set, to avoid e.g. creating tarballs with a version string of "unknown". Some variants are more useful than others. Including `full-revisionid` in a bug report should allow developers to reconstruct the exact code being tested (or indicate the presence of local changes that should be shared with the developers). `version` is suitable for display in an "about" box or a CLI `--version` output: it can be easily compared against release notes and lists of bugs fixed in various releases. The installer adds the following text to your `__init__.py` to place a basic version in `YOURPROJECT.__version__`: from ._version import get_versions __version__ = get_versions()['version'] del get_versions ## Styles The setup.cfg `style=` configuration controls how the VCS information is rendered into a version string. The default style, "pep440", produces a PEP440-compliant string, equal to the un-prefixed tag name for actual releases, and containing an additional "local version" section with more detail for in-between builds. For Git, this is TAG[+DISTANCE.gHEX[.dirty]] , using information from `git describe --tags --dirty --always`. For example "0.11+2.g1076c97.dirty" indicates that the tree is like the "1076c97" commit but has uncommitted changes (".dirty"), and that this commit is two revisions ("+2") beyond the "0.11" tag. For released software (exactly equal to a known tag), the identifier will only contain the stripped tag, e.g. "0.11". Other styles are available. See [details.md](details.md) in the Versioneer source tree for descriptions. ## Debugging Versioneer tries to avoid fatal errors: if something goes wrong, it will tend to return a version of "0+unknown". To investigate the problem, run `setup.py version`, which will run the version-lookup code in a verbose mode, and will display the full contents of `get_versions()` (including the `error` string, which may help identify what went wrong). ## Known Limitations Some situations are known to cause problems for Versioneer. This details the most significant ones. More can be found on Github [issues page](https://github.com/python-versioneer/python-versioneer/issues). ### Subprojects Versioneer has limited support for source trees in which `setup.py` is not in the root directory (e.g. `setup.py` and `.git/` are *not* siblings). The are two common reasons why `setup.py` might not be in the root: * Source trees which contain multiple subprojects, such as [Buildbot](https://github.com/buildbot/buildbot), which contains both "master" and "slave" subprojects, each with their own `setup.py`, `setup.cfg`, and `tox.ini`. Projects like these produce multiple PyPI distributions (and upload multiple independently-installable tarballs). * Source trees whose main purpose is to contain a C library, but which also provide bindings to Python (and perhaps other languages) in subdirectories. Versioneer will look for `.git` in parent directories, and most operations should get the right version string. However `pip` and `setuptools` have bugs and implementation details which frequently cause `pip install .` from a subproject directory to fail to find a correct version string (so it usually defaults to `0+unknown`). `pip install --editable .` should work correctly. `setup.py install` might work too. Pip-8.1.1 is known to have this problem, but hopefully it will get fixed in some later version. [Bug #38](https://github.com/python-versioneer/python-versioneer/issues/38) is tracking this issue. The discussion in [PR #61](https://github.com/python-versioneer/python-versioneer/pull/61) describes the issue from the Versioneer side in more detail. [pip PR#3176](https://github.com/pypa/pip/pull/3176) and [pip PR#3615](https://github.com/pypa/pip/pull/3615) contain work to improve pip to let Versioneer work correctly. Versioneer-0.16 and earlier only looked for a `.git` directory next to the `setup.cfg`, so subprojects were completely unsupported with those releases. ### Editable installs with setuptools <= 18.5 `setup.py develop` and `pip install --editable .` allow you to install a project into a virtualenv once, then continue editing the source code (and test) without re-installing after every change. "Entry-point scripts" (`setup(entry_points={"console_scripts": ..})`) are a convenient way to specify executable scripts that should be installed along with the python package. These both work as expected when using modern setuptools. When using setuptools-18.5 or earlier, however, certain operations will cause `pkg_resources.DistributionNotFound` errors when running the entrypoint script, which must be resolved by re-installing the package. This happens when the install happens with one version, then the egg_info data is regenerated while a different version is checked out. Many setup.py commands cause egg_info to be rebuilt (including `sdist`, `wheel`, and installing into a different virtualenv), so this can be surprising. [Bug #83](https://github.com/python-versioneer/python-versioneer/issues/83) describes this one, but upgrading to a newer version of setuptools should probably resolve it. ## Updating Versioneer To upgrade your project to a new release of Versioneer, do the following: * install the new Versioneer (`pip install -U versioneer` or equivalent) * edit `setup.cfg` and `pyproject.toml`, if necessary, to include any new configuration settings indicated by the release notes. See [UPGRADING](./UPGRADING.md) for details. * re-run `versioneer install --[no-]vendor` in your source tree, to replace `SRC/_version.py` * commit any changed files ## Future Directions This tool is designed to make it easily extended to other version-control systems: all VCS-specific components are in separate directories like src/git/ . The top-level `versioneer.py` script is assembled from these components by running make-versioneer.py . In the future, make-versioneer.py will take a VCS name as an argument, and will construct a version of `versioneer.py` that is specific to the given VCS. It might also take the configuration arguments that are currently provided manually during installation by editing setup.py . Alternatively, it might go the other direction and include code from all supported VCS systems, reducing the number of intermediate scripts. ## Similar projects * [setuptools_scm](https://github.com/pypa/setuptools_scm/) - a non-vendored build-time dependency * [minver](https://github.com/jbweston/miniver) - a lightweight reimplementation of versioneer * [versioningit](https://github.com/jwodder/versioningit) - a PEP 518-based setuptools plugin ## License To make Versioneer easier to embed, all its code is dedicated to the public domain. The `_version.py` that it creates is also in the public domain. Specifically, both are released under the "Unlicense", as described in https://unlicense.org/. [pypi-image]: https://img.shields.io/pypi/v/versioneer.svg [pypi-url]: https://pypi.python.org/pypi/versioneer/ [travis-image]: https://img.shields.io/travis/com/python-versioneer/python-versioneer.svg [travis-url]: https://travis-ci.com/github/python-versioneer/python-versioneer """ # pylint:disable=invalid-name,import-outside-toplevel,missing-function-docstring # pylint:disable=missing-class-docstring,too-many-branches,too-many-statements # pylint:disable=raise-missing-from,too-many-lines,too-many-locals,import-error # pylint:disable=too-few-public-methods,redefined-outer-name,consider-using-with # pylint:disable=attribute-defined-outside-init,too-many-arguments import configparser import errno import json import os import re import subprocess import sys from pathlib import Path from typing import Any, Callable, cast, Dict, List, Optional, Tuple, Union from typing import NoReturn import functools have_tomllib = True if sys.version_info >= (3, 11): import tomllib else: try: import tomli as tomllib except ImportError: have_tomllib = False class VersioneerConfig: """Container for Versioneer configuration parameters.""" VCS: str style: str tag_prefix: str versionfile_source: str versionfile_build: Optional[str] parentdir_prefix: Optional[str] verbose: Optional[bool] def get_root() -> str: """Get the project root directory. We require that all commands are run from the project root, i.e. the directory that contains setup.py, setup.cfg, and versioneer.py . """ root = os.path.realpath(os.path.abspath(os.getcwd())) setup_py = os.path.join(root, "setup.py") pyproject_toml = os.path.join(root, "pyproject.toml") versioneer_py = os.path.join(root, "versioneer.py") if not ( os.path.exists(setup_py) or os.path.exists(pyproject_toml) or os.path.exists(versioneer_py) ): # allow 'python path/to/setup.py COMMAND' root = os.path.dirname(os.path.realpath(os.path.abspath(sys.argv[0]))) setup_py = os.path.join(root, "setup.py") pyproject_toml = os.path.join(root, "pyproject.toml") versioneer_py = os.path.join(root, "versioneer.py") if not ( os.path.exists(setup_py) or os.path.exists(pyproject_toml) or os.path.exists(versioneer_py) ): err = ("Versioneer was unable to run the project root directory. " "Versioneer requires setup.py to be executed from " "its immediate directory (like 'python setup.py COMMAND'), " "or in a way that lets it use sys.argv[0] to find the root " "(like 'python path/to/setup.py COMMAND').") raise VersioneerBadRootError(err) try: # Certain runtime workflows (setup.py install/develop in a setuptools # tree) execute all dependencies in a single python process, so # "versioneer" may be imported multiple times, and python's shared # module-import table will cache the first one. So we can't use # os.path.dirname(__file__), as that will find whichever # versioneer.py was first imported, even in later projects. my_path = os.path.realpath(os.path.abspath(__file__)) me_dir = os.path.normcase(os.path.splitext(my_path)[0]) vsr_dir = os.path.normcase(os.path.splitext(versioneer_py)[0]) if me_dir != vsr_dir and "VERSIONEER_PEP518" not in globals(): print("Warning: build in %s is using versioneer.py from %s" % (os.path.dirname(my_path), versioneer_py)) except NameError: pass return root def get_config_from_root(root: str) -> VersioneerConfig: """Read the project setup.cfg file to determine Versioneer config.""" # This might raise OSError (if setup.cfg is missing), or # configparser.NoSectionError (if it lacks a [versioneer] section), or # configparser.NoOptionError (if it lacks "VCS="). See the docstring at # the top of versioneer.py for instructions on writing your setup.cfg . root_pth = Path(root) pyproject_toml = root_pth / "pyproject.toml" setup_cfg = root_pth / "setup.cfg" section: Union[Dict[str, Any], configparser.SectionProxy, None] = None if pyproject_toml.exists() and have_tomllib: try: with open(pyproject_toml, 'rb') as fobj: pp = tomllib.load(fobj) section = pp['tool']['versioneer'] except (tomllib.TOMLDecodeError, KeyError) as e: print(f"Failed to load config from {pyproject_toml}: {e}") print("Try to load it from setup.cfg") if not section: parser = configparser.ConfigParser() with open(setup_cfg) as cfg_file: parser.read_file(cfg_file) parser.get("versioneer", "VCS") # raise error if missing section = parser["versioneer"] # `cast`` really shouldn't be used, but its simplest for the # common VersioneerConfig users at the moment. We verify against # `None` values elsewhere where it matters cfg = VersioneerConfig() cfg.VCS = section['VCS'] cfg.style = section.get("style", "") cfg.versionfile_source = cast(str, section.get("versionfile_source")) cfg.versionfile_build = section.get("versionfile_build") cfg.tag_prefix = cast(str, section.get("tag_prefix")) if cfg.tag_prefix in ("''", '""', None): cfg.tag_prefix = "" cfg.parentdir_prefix = section.get("parentdir_prefix") if isinstance(section, configparser.SectionProxy): # Make sure configparser translates to bool cfg.verbose = section.getboolean("verbose") else: cfg.verbose = section.get("verbose") return cfg class NotThisMethod(Exception): """Exception raised if a method is not valid for the current scenario.""" # these dictionaries contain VCS-specific tools LONG_VERSION_PY: Dict[str, str] = {} HANDLERS: Dict[str, Dict[str, Callable]] = {} def register_vcs_handler(vcs: str, method: str) -> Callable: # decorator """Create decorator to mark a method as the handler of a VCS.""" def decorate(f: Callable) -> Callable: """Store f in HANDLERS[vcs][method].""" HANDLERS.setdefault(vcs, {})[method] = f return f return decorate def run_command( commands: List[str], args: List[str], cwd: Optional[str] = None, verbose: bool = False, hide_stderr: bool = False, env: Optional[Dict[str, str]] = None, ) -> Tuple[Optional[str], Optional[int]]: """Call the given command(s).""" assert isinstance(commands, list) process = None popen_kwargs: Dict[str, Any] = {} if sys.platform == "win32": # This hides the console window if pythonw.exe is used startupinfo = subprocess.STARTUPINFO() startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW popen_kwargs["startupinfo"] = startupinfo for command in commands: try: dispcmd = str([command] + args) # remember shell=False, so use git.cmd on windows, not just git process = subprocess.Popen([command] + args, cwd=cwd, env=env, stdout=subprocess.PIPE, stderr=(subprocess.PIPE if hide_stderr else None), **popen_kwargs) break except OSError as e: if e.errno == errno.ENOENT: continue if verbose: print("unable to run %s" % dispcmd) print(e) return None, None else: if verbose: print("unable to find command, tried %s" % (commands,)) return None, None stdout = process.communicate()[0].strip().decode() if process.returncode != 0: if verbose: print("unable to run %s (error)" % dispcmd) print("stdout was %s" % stdout) return None, process.returncode return stdout, process.returncode LONG_VERSION_PY['git'] = r''' # This file helps to compute a version number in source trees obtained from # git-archive tarball (such as those provided by githubs download-from-tag # feature). Distribution tarballs (built by setup.py sdist) and build # directories (produced by setup.py build) will contain a much shorter file # that just contains the computed version number. # This file is released into the public domain. # Generated by versioneer-0.29 # https://github.com/python-versioneer/python-versioneer """Git implementation of _version.py.""" import errno import os import re import subprocess import sys from typing import Any, Callable, Dict, List, Optional, Tuple import functools def get_keywords() -> Dict[str, str]: """Get the keywords needed to look up the version information.""" # these strings will be replaced by git during git-archive. # setup.py/versioneer.py will grep for the variable names, so they must # each be defined on a line of their own. _version.py will just call # get_keywords(). git_refnames = "%(DOLLAR)sFormat:%%d%(DOLLAR)s" git_full = "%(DOLLAR)sFormat:%%H%(DOLLAR)s" git_date = "%(DOLLAR)sFormat:%%ci%(DOLLAR)s" keywords = {"refnames": git_refnames, "full": git_full, "date": git_date} return keywords class VersioneerConfig: """Container for Versioneer configuration parameters.""" VCS: str style: str tag_prefix: str parentdir_prefix: str versionfile_source: str verbose: bool def get_config() -> VersioneerConfig: """Create, populate and return the VersioneerConfig() object.""" # these strings are filled in when 'setup.py versioneer' creates # _version.py cfg = VersioneerConfig() cfg.VCS = "git" cfg.style = "%(STYLE)s" cfg.tag_prefix = "%(TAG_PREFIX)s" cfg.parentdir_prefix = "%(PARENTDIR_PREFIX)s" cfg.versionfile_source = "%(VERSIONFILE_SOURCE)s" cfg.verbose = False return cfg class NotThisMethod(Exception): """Exception raised if a method is not valid for the current scenario.""" LONG_VERSION_PY: Dict[str, str] = {} HANDLERS: Dict[str, Dict[str, Callable]] = {} def register_vcs_handler(vcs: str, method: str) -> Callable: # decorator """Create decorator to mark a method as the handler of a VCS.""" def decorate(f: Callable) -> Callable: """Store f in HANDLERS[vcs][method].""" if vcs not in HANDLERS: HANDLERS[vcs] = {} HANDLERS[vcs][method] = f return f return decorate def run_command( commands: List[str], args: List[str], cwd: Optional[str] = None, verbose: bool = False, hide_stderr: bool = False, env: Optional[Dict[str, str]] = None, ) -> Tuple[Optional[str], Optional[int]]: """Call the given command(s).""" assert isinstance(commands, list) process = None popen_kwargs: Dict[str, Any] = {} if sys.platform == "win32": # This hides the console window if pythonw.exe is used startupinfo = subprocess.STARTUPINFO() startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW popen_kwargs["startupinfo"] = startupinfo for command in commands: try: dispcmd = str([command] + args) # remember shell=False, so use git.cmd on windows, not just git process = subprocess.Popen([command] + args, cwd=cwd, env=env, stdout=subprocess.PIPE, stderr=(subprocess.PIPE if hide_stderr else None), **popen_kwargs) break except OSError as e: if e.errno == errno.ENOENT: continue if verbose: print("unable to run %%s" %% dispcmd) print(e) return None, None else: if verbose: print("unable to find command, tried %%s" %% (commands,)) return None, None stdout = process.communicate()[0].strip().decode() if process.returncode != 0: if verbose: print("unable to run %%s (error)" %% dispcmd) print("stdout was %%s" %% stdout) return None, process.returncode return stdout, process.returncode def versions_from_parentdir( parentdir_prefix: str, root: str, verbose: bool, ) -> Dict[str, Any]: """Try to determine the version from the parent directory name. Source tarballs conventionally unpack into a directory that includes both the project name and a version string. We will also support searching up two directory levels for an appropriately named parent directory """ rootdirs = [] for _ in range(3): dirname = os.path.basename(root) if dirname.startswith(parentdir_prefix): return {"version": dirname[len(parentdir_prefix):], "full-revisionid": None, "dirty": False, "error": None, "date": None} rootdirs.append(root) root = os.path.dirname(root) # up a level if verbose: print("Tried directories %%s but none started with prefix %%s" %% (str(rootdirs), parentdir_prefix)) raise NotThisMethod("rootdir doesn't start with parentdir_prefix") @register_vcs_handler("git", "get_keywords") def git_get_keywords(versionfile_abs: str) -> Dict[str, str]: """Extract version information from the given file.""" # the code embedded in _version.py can just fetch the value of these # keywords. When used from setup.py, we don't want to import _version.py, # so we do it with a regexp instead. This function is not used from # _version.py. keywords: Dict[str, str] = {} try: with open(versionfile_abs, "r") as fobj: for line in fobj: if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["refnames"] = mo.group(1) if line.strip().startswith("git_full ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["full"] = mo.group(1) if line.strip().startswith("git_date ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["date"] = mo.group(1) except OSError: pass return keywords @register_vcs_handler("git", "keywords") def git_versions_from_keywords( keywords: Dict[str, str], tag_prefix: str, verbose: bool, ) -> Dict[str, Any]: """Get version information from git keywords.""" if "refnames" not in keywords: raise NotThisMethod("Short version file found") date = keywords.get("date") if date is not None: # Use only the last line. Previous lines may contain GPG signature # information. date = date.splitlines()[-1] # git-2.2.0 added "%%cI", which expands to an ISO-8601 -compliant # datestamp. However we prefer "%%ci" (which expands to an "ISO-8601 # -like" string, which we must then edit to make compliant), because # it's been around since git-1.5.3, and it's too difficult to # discover which version we're using, or to work around using an # older one. date = date.strip().replace(" ", "T", 1).replace(" ", "", 1) refnames = keywords["refnames"].strip() if refnames.startswith("$Format"): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") refs = {r.strip() for r in refnames.strip("()").split(",")} # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " tags = {r[len(TAG):] for r in refs if r.startswith(TAG)} if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %%d # expansion behaves like git log --decorate=short and strips out the # refs/heads/ and refs/tags/ prefixes that would let us distinguish # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". tags = {r for r in refs if re.search(r'\d', r)} if verbose: print("discarding '%%s', no digits" %% ",".join(refs - tags)) if verbose: print("likely tags: %%s" %% ",".join(sorted(tags))) for ref in sorted(tags): # sorting will prefer e.g. "2.0" over "2.0rc1" if ref.startswith(tag_prefix): r = ref[len(tag_prefix):] # Filter out refs that exactly match prefix or that don't start # with a number once the prefix is stripped (mostly a concern # when prefix is '') if not re.match(r'\d', r): continue if verbose: print("picking %%s" %% r) return {"version": r, "full-revisionid": keywords["full"].strip(), "dirty": False, "error": None, "date": date} # no suitable tags, so version is "0+unknown", but full hex is still there if verbose: print("no suitable tags, using unknown + full revision id") return {"version": "0+unknown", "full-revisionid": keywords["full"].strip(), "dirty": False, "error": "no suitable tags", "date": None} @register_vcs_handler("git", "pieces_from_vcs") def git_pieces_from_vcs( tag_prefix: str, root: str, verbose: bool, runner: Callable = run_command ) -> Dict[str, Any]: """Get version from 'git describe' in the root of the source tree. This only gets called if the git-archive 'subst' keywords were *not* expanded, and _version.py hasn't already been rewritten with a short version string, meaning we're inside a checked out source tree. """ GITS = ["git"] if sys.platform == "win32": GITS = ["git.cmd", "git.exe"] # GIT_DIR can interfere with correct operation of Versioneer. # It may be intended to be passed to the Versioneer-versioned project, # but that should not change where we get our version from. env = os.environ.copy() env.pop("GIT_DIR", None) runner = functools.partial(runner, env=env) _, rc = runner(GITS, ["rev-parse", "--git-dir"], cwd=root, hide_stderr=not verbose) if rc != 0: if verbose: print("Directory %%s not under git control" %% root) raise NotThisMethod("'git rev-parse --git-dir' returned error") # if there is a tag matching tag_prefix, this yields TAG-NUM-gHEX[-dirty] # if there isn't one, this yields HEX[-dirty] (no NUM) describe_out, rc = runner(GITS, [ "describe", "--tags", "--dirty", "--always", "--long", "--match", f"{tag_prefix}[[:digit:]]*" ], cwd=root) # --long was added in git-1.5.5 if describe_out is None: raise NotThisMethod("'git describe' failed") describe_out = describe_out.strip() full_out, rc = runner(GITS, ["rev-parse", "HEAD"], cwd=root) if full_out is None: raise NotThisMethod("'git rev-parse' failed") full_out = full_out.strip() pieces: Dict[str, Any] = {} pieces["long"] = full_out pieces["short"] = full_out[:7] # maybe improved later pieces["error"] = None branch_name, rc = runner(GITS, ["rev-parse", "--abbrev-ref", "HEAD"], cwd=root) # --abbrev-ref was added in git-1.6.3 if rc != 0 or branch_name is None: raise NotThisMethod("'git rev-parse --abbrev-ref' returned error") branch_name = branch_name.strip() if branch_name == "HEAD": # If we aren't exactly on a branch, pick a branch which represents # the current commit. If all else fails, we are on a branchless # commit. branches, rc = runner(GITS, ["branch", "--contains"], cwd=root) # --contains was added in git-1.5.4 if rc != 0 or branches is None: raise NotThisMethod("'git branch --contains' returned error") branches = branches.split("\n") # Remove the first line if we're running detached if "(" in branches[0]: branches.pop(0) # Strip off the leading "* " from the list of branches. branches = [branch[2:] for branch in branches] if "master" in branches: branch_name = "master" elif not branches: branch_name = None else: # Pick the first branch that is returned. Good or bad. branch_name = branches[0] pieces["branch"] = branch_name # parse describe_out. It will be like TAG-NUM-gHEX[-dirty] or HEX[-dirty] # TAG might have hyphens. git_describe = describe_out # look for -dirty suffix dirty = git_describe.endswith("-dirty") pieces["dirty"] = dirty if dirty: git_describe = git_describe[:git_describe.rindex("-dirty")] # now we have TAG-NUM-gHEX or HEX if "-" in git_describe: # TAG-NUM-gHEX mo = re.search(r'^(.+)-(\d+)-g([0-9a-f]+)$', git_describe) if not mo: # unparsable. Maybe git-describe is misbehaving? pieces["error"] = ("unable to parse git-describe output: '%%s'" %% describe_out) return pieces # tag full_tag = mo.group(1) if not full_tag.startswith(tag_prefix): if verbose: fmt = "tag '%%s' doesn't start with prefix '%%s'" print(fmt %% (full_tag, tag_prefix)) pieces["error"] = ("tag '%%s' doesn't start with prefix '%%s'" %% (full_tag, tag_prefix)) return pieces pieces["closest-tag"] = full_tag[len(tag_prefix):] # distance: number of commits since tag pieces["distance"] = int(mo.group(2)) # commit: short hex revision ID pieces["short"] = mo.group(3) else: # HEX: no tags pieces["closest-tag"] = None out, rc = runner(GITS, ["rev-list", "HEAD", "--left-right"], cwd=root) pieces["distance"] = len(out.split()) # total number of commits # commit date: see ISO-8601 comment in git_versions_from_keywords() date = runner(GITS, ["show", "-s", "--format=%%ci", "HEAD"], cwd=root)[0].strip() # Use only the last line. Previous lines may contain GPG signature # information. date = date.splitlines()[-1] pieces["date"] = date.strip().replace(" ", "T", 1).replace(" ", "", 1) return pieces def plus_or_dot(pieces: Dict[str, Any]) -> str: """Return a + if we don't already have one, else return a .""" if "+" in pieces.get("closest-tag", ""): return "." return "+" def render_pep440(pieces: Dict[str, Any]) -> str: """Build up version string, with post-release "local version identifier". Our goal: TAG[+DISTANCE.gHEX[.dirty]] . Note that if you get a tagged build and then dirty it, you'll get TAG+0.gHEX.dirty Exceptions: 1: no tags. git_describe was just HEX. 0+untagged.DISTANCE.gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += plus_or_dot(pieces) rendered += "%%d.g%%s" %% (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0+untagged.%%d.g%%s" %% (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" return rendered def render_pep440_branch(pieces: Dict[str, Any]) -> str: """TAG[[.dev0]+DISTANCE.gHEX[.dirty]] . The ".dev0" means not master branch. Note that .dev0 sorts backwards (a feature branch will appear "older" than the master branch). Exceptions: 1: no tags. 0[.dev0]+untagged.DISTANCE.gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: if pieces["branch"] != "master": rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "%%d.g%%s" %% (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0" if pieces["branch"] != "master": rendered += ".dev0" rendered += "+untagged.%%d.g%%s" %% (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" return rendered def pep440_split_post(ver: str) -> Tuple[str, Optional[int]]: """Split pep440 version string at the post-release segment. Returns the release segments before the post-release and the post-release version number (or -1 if no post-release segment is present). """ vc = str.split(ver, ".post") return vc[0], int(vc[1] or 0) if len(vc) == 2 else None def render_pep440_pre(pieces: Dict[str, Any]) -> str: """TAG[.postN.devDISTANCE] -- No -dirty. Exceptions: 1: no tags. 0.post0.devDISTANCE """ if pieces["closest-tag"]: if pieces["distance"]: # update the post release segment tag_version, post_version = pep440_split_post(pieces["closest-tag"]) rendered = tag_version if post_version is not None: rendered += ".post%%d.dev%%d" %% (post_version + 1, pieces["distance"]) else: rendered += ".post0.dev%%d" %% (pieces["distance"]) else: # no commits, use the tag as the version rendered = pieces["closest-tag"] else: # exception #1 rendered = "0.post0.dev%%d" %% pieces["distance"] return rendered def render_pep440_post(pieces: Dict[str, Any]) -> str: """TAG[.postDISTANCE[.dev0]+gHEX] . The ".dev0" means dirty. Note that .dev0 sorts backwards (a dirty tree will appear "older" than the corresponding clean one), but you shouldn't be releasing software with -dirty anyways. Exceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "g%%s" %% pieces["short"] else: # exception #1 rendered = "0.post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += "+g%%s" %% pieces["short"] return rendered def render_pep440_post_branch(pieces: Dict[str, Any]) -> str: """TAG[.postDISTANCE[.dev0]+gHEX[.dirty]] . The ".dev0" means not master branch. Exceptions: 1: no tags. 0.postDISTANCE[.dev0]+gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%%d" %% pieces["distance"] if pieces["branch"] != "master": rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "g%%s" %% pieces["short"] if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0.post%%d" %% pieces["distance"] if pieces["branch"] != "master": rendered += ".dev0" rendered += "+g%%s" %% pieces["short"] if pieces["dirty"]: rendered += ".dirty" return rendered def render_pep440_old(pieces: Dict[str, Any]) -> str: """TAG[.postDISTANCE[.dev0]] . The ".dev0" means dirty. Exceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" else: # exception #1 rendered = "0.post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" return rendered def render_git_describe(pieces: Dict[str, Any]) -> str: """TAG[-DISTANCE-gHEX][-dirty]. Like 'git describe --tags --dirty --always'. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += "-%%d-g%%s" %% (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render_git_describe_long(pieces: Dict[str, Any]) -> str: """TAG-DISTANCE-gHEX[-dirty]. Like 'git describe --tags --dirty --always -long'. The distance/hash is unconditional. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] rendered += "-%%d-g%%s" %% (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render(pieces: Dict[str, Any], style: str) -> Dict[str, Any]: """Render the given version pieces into the requested style.""" if pieces["error"]: return {"version": "unknown", "full-revisionid": pieces.get("long"), "dirty": None, "error": pieces["error"], "date": None} if not style or style == "default": style = "pep440" # the default if style == "pep440": rendered = render_pep440(pieces) elif style == "pep440-branch": rendered = render_pep440_branch(pieces) elif style == "pep440-pre": rendered = render_pep440_pre(pieces) elif style == "pep440-post": rendered = render_pep440_post(pieces) elif style == "pep440-post-branch": rendered = render_pep440_post_branch(pieces) elif style == "pep440-old": rendered = render_pep440_old(pieces) elif style == "git-describe": rendered = render_git_describe(pieces) elif style == "git-describe-long": rendered = render_git_describe_long(pieces) else: raise ValueError("unknown style '%%s'" %% style) return {"version": rendered, "full-revisionid": pieces["long"], "dirty": pieces["dirty"], "error": None, "date": pieces.get("date")} def get_versions() -> Dict[str, Any]: """Get version information or return default if unable to do so.""" # I am in _version.py, which lives at ROOT/VERSIONFILE_SOURCE. If we have # __file__, we can work backwards from there to the root. Some # py2exe/bbfreeze/non-CPython implementations don't do __file__, in which # case we can only use expanded keywords. cfg = get_config() verbose = cfg.verbose try: return git_versions_from_keywords(get_keywords(), cfg.tag_prefix, verbose) except NotThisMethod: pass try: root = os.path.realpath(__file__) # versionfile_source is the relative path from the top of the source # tree (where the .git directory might live) to this file. Invert # this to find the root from __file__. for _ in cfg.versionfile_source.split('/'): root = os.path.dirname(root) except NameError: return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to find root of source tree", "date": None} try: pieces = git_pieces_from_vcs(cfg.tag_prefix, root, verbose) return render(pieces, cfg.style) except NotThisMethod: pass try: if cfg.parentdir_prefix: return versions_from_parentdir(cfg.parentdir_prefix, root, verbose) except NotThisMethod: pass return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to compute version", "date": None} ''' @register_vcs_handler("git", "get_keywords") def git_get_keywords(versionfile_abs: str) -> Dict[str, str]: """Extract version information from the given file.""" # the code embedded in _version.py can just fetch the value of these # keywords. When used from setup.py, we don't want to import _version.py, # so we do it with a regexp instead. This function is not used from # _version.py. keywords: Dict[str, str] = {} try: with open(versionfile_abs, "r") as fobj: for line in fobj: if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["refnames"] = mo.group(1) if line.strip().startswith("git_full ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["full"] = mo.group(1) if line.strip().startswith("git_date ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["date"] = mo.group(1) except OSError: pass return keywords @register_vcs_handler("git", "keywords") def git_versions_from_keywords( keywords: Dict[str, str], tag_prefix: str, verbose: bool, ) -> Dict[str, Any]: """Get version information from git keywords.""" if "refnames" not in keywords: raise NotThisMethod("Short version file found") date = keywords.get("date") if date is not None: # Use only the last line. Previous lines may contain GPG signature # information. date = date.splitlines()[-1] # git-2.2.0 added "%cI", which expands to an ISO-8601 -compliant # datestamp. However we prefer "%ci" (which expands to an "ISO-8601 # -like" string, which we must then edit to make compliant), because # it's been around since git-1.5.3, and it's too difficult to # discover which version we're using, or to work around using an # older one. date = date.strip().replace(" ", "T", 1).replace(" ", "", 1) refnames = keywords["refnames"].strip() if refnames.startswith("$Format"): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") refs = {r.strip() for r in refnames.strip("()").split(",")} # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " tags = {r[len(TAG):] for r in refs if r.startswith(TAG)} if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %d # expansion behaves like git log --decorate=short and strips out the # refs/heads/ and refs/tags/ prefixes that would let us distinguish # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". tags = {r for r in refs if re.search(r'\d', r)} if verbose: print("discarding '%s', no digits" % ",".join(refs - tags)) if verbose: print("likely tags: %s" % ",".join(sorted(tags))) for ref in sorted(tags): # sorting will prefer e.g. "2.0" over "2.0rc1" if ref.startswith(tag_prefix): r = ref[len(tag_prefix):] # Filter out refs that exactly match prefix or that don't start # with a number once the prefix is stripped (mostly a concern # when prefix is '') if not re.match(r'\d', r): continue if verbose: print("picking %s" % r) return {"version": r, "full-revisionid": keywords["full"].strip(), "dirty": False, "error": None, "date": date} # no suitable tags, so version is "0+unknown", but full hex is still there if verbose: print("no suitable tags, using unknown + full revision id") return {"version": "0+unknown", "full-revisionid": keywords["full"].strip(), "dirty": False, "error": "no suitable tags", "date": None} @register_vcs_handler("git", "pieces_from_vcs") def git_pieces_from_vcs( tag_prefix: str, root: str, verbose: bool, runner: Callable = run_command ) -> Dict[str, Any]: """Get version from 'git describe' in the root of the source tree. This only gets called if the git-archive 'subst' keywords were *not* expanded, and _version.py hasn't already been rewritten with a short version string, meaning we're inside a checked out source tree. """ GITS = ["git"] if sys.platform == "win32": GITS = ["git.cmd", "git.exe"] # GIT_DIR can interfere with correct operation of Versioneer. # It may be intended to be passed to the Versioneer-versioned project, # but that should not change where we get our version from. env = os.environ.copy() env.pop("GIT_DIR", None) runner = functools.partial(runner, env=env) _, rc = runner(GITS, ["rev-parse", "--git-dir"], cwd=root, hide_stderr=not verbose) if rc != 0: if verbose: print("Directory %s not under git control" % root) raise NotThisMethod("'git rev-parse --git-dir' returned error") # if there is a tag matching tag_prefix, this yields TAG-NUM-gHEX[-dirty] # if there isn't one, this yields HEX[-dirty] (no NUM) describe_out, rc = runner(GITS, [ "describe", "--tags", "--dirty", "--always", "--long", "--match", f"{tag_prefix}[[:digit:]]*" ], cwd=root) # --long was added in git-1.5.5 if describe_out is None: raise NotThisMethod("'git describe' failed") describe_out = describe_out.strip() full_out, rc = runner(GITS, ["rev-parse", "HEAD"], cwd=root) if full_out is None: raise NotThisMethod("'git rev-parse' failed") full_out = full_out.strip() pieces: Dict[str, Any] = {} pieces["long"] = full_out pieces["short"] = full_out[:7] # maybe improved later pieces["error"] = None branch_name, rc = runner(GITS, ["rev-parse", "--abbrev-ref", "HEAD"], cwd=root) # --abbrev-ref was added in git-1.6.3 if rc != 0 or branch_name is None: raise NotThisMethod("'git rev-parse --abbrev-ref' returned error") branch_name = branch_name.strip() if branch_name == "HEAD": # If we aren't exactly on a branch, pick a branch which represents # the current commit. If all else fails, we are on a branchless # commit. branches, rc = runner(GITS, ["branch", "--contains"], cwd=root) # --contains was added in git-1.5.4 if rc != 0 or branches is None: raise NotThisMethod("'git branch --contains' returned error") branches = branches.split("\n") # Remove the first line if we're running detached if "(" in branches[0]: branches.pop(0) # Strip off the leading "* " from the list of branches. branches = [branch[2:] for branch in branches] if "master" in branches: branch_name = "master" elif not branches: branch_name = None else: # Pick the first branch that is returned. Good or bad. branch_name = branches[0] pieces["branch"] = branch_name # parse describe_out. It will be like TAG-NUM-gHEX[-dirty] or HEX[-dirty] # TAG might have hyphens. git_describe = describe_out # look for -dirty suffix dirty = git_describe.endswith("-dirty") pieces["dirty"] = dirty if dirty: git_describe = git_describe[:git_describe.rindex("-dirty")] # now we have TAG-NUM-gHEX or HEX if "-" in git_describe: # TAG-NUM-gHEX mo = re.search(r'^(.+)-(\d+)-g([0-9a-f]+)$', git_describe) if not mo: # unparsable. Maybe git-describe is misbehaving? pieces["error"] = ("unable to parse git-describe output: '%s'" % describe_out) return pieces # tag full_tag = mo.group(1) if not full_tag.startswith(tag_prefix): if verbose: fmt = "tag '%s' doesn't start with prefix '%s'" print(fmt % (full_tag, tag_prefix)) pieces["error"] = ("tag '%s' doesn't start with prefix '%s'" % (full_tag, tag_prefix)) return pieces pieces["closest-tag"] = full_tag[len(tag_prefix):] # distance: number of commits since tag pieces["distance"] = int(mo.group(2)) # commit: short hex revision ID pieces["short"] = mo.group(3) else: # HEX: no tags pieces["closest-tag"] = None out, rc = runner(GITS, ["rev-list", "HEAD", "--left-right"], cwd=root) pieces["distance"] = len(out.split()) # total number of commits # commit date: see ISO-8601 comment in git_versions_from_keywords() date = runner(GITS, ["show", "-s", "--format=%ci", "HEAD"], cwd=root)[0].strip() # Use only the last line. Previous lines may contain GPG signature # information. date = date.splitlines()[-1] pieces["date"] = date.strip().replace(" ", "T", 1).replace(" ", "", 1) return pieces def do_vcs_install(versionfile_source: str, ipy: Optional[str]) -> None: """Git-specific installation logic for Versioneer. For Git, this means creating/changing .gitattributes to mark _version.py for export-subst keyword substitution. """ GITS = ["git"] if sys.platform == "win32": GITS = ["git.cmd", "git.exe"] files = [versionfile_source] if ipy: files.append(ipy) if "VERSIONEER_PEP518" not in globals(): try: my_path = __file__ if my_path.endswith((".pyc", ".pyo")): my_path = os.path.splitext(my_path)[0] + ".py" versioneer_file = os.path.relpath(my_path) except NameError: versioneer_file = "versioneer.py" files.append(versioneer_file) present = False try: with open(".gitattributes", "r") as fobj: for line in fobj: if line.strip().startswith(versionfile_source): if "export-subst" in line.strip().split()[1:]: present = True break except OSError: pass if not present: with open(".gitattributes", "a+") as fobj: fobj.write(f"{versionfile_source} export-subst\n") files.append(".gitattributes") run_command(GITS, ["add", "--"] + files) def versions_from_parentdir( parentdir_prefix: str, root: str, verbose: bool, ) -> Dict[str, Any]: """Try to determine the version from the parent directory name. Source tarballs conventionally unpack into a directory that includes both the project name and a version string. We will also support searching up two directory levels for an appropriately named parent directory """ rootdirs = [] for _ in range(3): dirname = os.path.basename(root) if dirname.startswith(parentdir_prefix): return {"version": dirname[len(parentdir_prefix):], "full-revisionid": None, "dirty": False, "error": None, "date": None} rootdirs.append(root) root = os.path.dirname(root) # up a level if verbose: print("Tried directories %s but none started with prefix %s" % (str(rootdirs), parentdir_prefix)) raise NotThisMethod("rootdir doesn't start with parentdir_prefix") SHORT_VERSION_PY = """ # This file was generated by 'versioneer.py' (0.29) from # revision-control system data, or from the parent directory name of an # unpacked source archive. Distribution tarballs contain a pre-generated copy # of this file. import json version_json = ''' %s ''' # END VERSION_JSON def get_versions(): return json.loads(version_json) """ def versions_from_file(filename: str) -> Dict[str, Any]: """Try to determine the version from _version.py if present.""" try: with open(filename) as f: contents = f.read() except OSError: raise NotThisMethod("unable to read _version.py") mo = re.search(r"version_json = '''\n(.*)''' # END VERSION_JSON", contents, re.M | re.S) if not mo: mo = re.search(r"version_json = '''\r\n(.*)''' # END VERSION_JSON", contents, re.M | re.S) if not mo: raise NotThisMethod("no version_json in _version.py") return json.loads(mo.group(1)) def write_to_version_file(filename: str, versions: Dict[str, Any]) -> None: """Write the given version number to the given _version.py file.""" contents = json.dumps(versions, sort_keys=True, indent=1, separators=(",", ": ")) with open(filename, "w") as f: f.write(SHORT_VERSION_PY % contents) print("set %s to '%s'" % (filename, versions["version"])) def plus_or_dot(pieces: Dict[str, Any]) -> str: """Return a + if we don't already have one, else return a .""" if "+" in pieces.get("closest-tag", ""): return "." return "+" def render_pep440(pieces: Dict[str, Any]) -> str: """Build up version string, with post-release "local version identifier". Our goal: TAG[+DISTANCE.gHEX[.dirty]] . Note that if you get a tagged build and then dirty it, you'll get TAG+0.gHEX.dirty Exceptions: 1: no tags. git_describe was just HEX. 0+untagged.DISTANCE.gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += plus_or_dot(pieces) rendered += "%d.g%s" % (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0+untagged.%d.g%s" % (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" return rendered def render_pep440_branch(pieces: Dict[str, Any]) -> str: """TAG[[.dev0]+DISTANCE.gHEX[.dirty]] . The ".dev0" means not master branch. Note that .dev0 sorts backwards (a feature branch will appear "older" than the master branch). Exceptions: 1: no tags. 0[.dev0]+untagged.DISTANCE.gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: if pieces["branch"] != "master": rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "%d.g%s" % (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0" if pieces["branch"] != "master": rendered += ".dev0" rendered += "+untagged.%d.g%s" % (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" return rendered def pep440_split_post(ver: str) -> Tuple[str, Optional[int]]: """Split pep440 version string at the post-release segment. Returns the release segments before the post-release and the post-release version number (or -1 if no post-release segment is present). """ vc = str.split(ver, ".post") return vc[0], int(vc[1] or 0) if len(vc) == 2 else None def render_pep440_pre(pieces: Dict[str, Any]) -> str: """TAG[.postN.devDISTANCE] -- No -dirty. Exceptions: 1: no tags. 0.post0.devDISTANCE """ if pieces["closest-tag"]: if pieces["distance"]: # update the post release segment tag_version, post_version = pep440_split_post(pieces["closest-tag"]) rendered = tag_version if post_version is not None: rendered += ".post%d.dev%d" % (post_version + 1, pieces["distance"]) else: rendered += ".post0.dev%d" % (pieces["distance"]) else: # no commits, use the tag as the version rendered = pieces["closest-tag"] else: # exception #1 rendered = "0.post0.dev%d" % pieces["distance"] return rendered def render_pep440_post(pieces: Dict[str, Any]) -> str: """TAG[.postDISTANCE[.dev0]+gHEX] . The ".dev0" means dirty. Note that .dev0 sorts backwards (a dirty tree will appear "older" than the corresponding clean one), but you shouldn't be releasing software with -dirty anyways. Exceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "g%s" % pieces["short"] else: # exception #1 rendered = "0.post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += "+g%s" % pieces["short"] return rendered def render_pep440_post_branch(pieces: Dict[str, Any]) -> str: """TAG[.postDISTANCE[.dev0]+gHEX[.dirty]] . The ".dev0" means not master branch. Exceptions: 1: no tags. 0.postDISTANCE[.dev0]+gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%d" % pieces["distance"] if pieces["branch"] != "master": rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "g%s" % pieces["short"] if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0.post%d" % pieces["distance"] if pieces["branch"] != "master": rendered += ".dev0" rendered += "+g%s" % pieces["short"] if pieces["dirty"]: rendered += ".dirty" return rendered def render_pep440_old(pieces: Dict[str, Any]) -> str: """TAG[.postDISTANCE[.dev0]] . The ".dev0" means dirty. Exceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" else: # exception #1 rendered = "0.post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" return rendered def render_git_describe(pieces: Dict[str, Any]) -> str: """TAG[-DISTANCE-gHEX][-dirty]. Like 'git describe --tags --dirty --always'. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += "-%d-g%s" % (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render_git_describe_long(pieces: Dict[str, Any]) -> str: """TAG-DISTANCE-gHEX[-dirty]. Like 'git describe --tags --dirty --always -long'. The distance/hash is unconditional. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] rendered += "-%d-g%s" % (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render(pieces: Dict[str, Any], style: str) -> Dict[str, Any]: """Render the given version pieces into the requested style.""" if pieces["error"]: return {"version": "unknown", "full-revisionid": pieces.get("long"), "dirty": None, "error": pieces["error"], "date": None} if not style or style == "default": style = "pep440" # the default if style == "pep440": rendered = render_pep440(pieces) elif style == "pep440-branch": rendered = render_pep440_branch(pieces) elif style == "pep440-pre": rendered = render_pep440_pre(pieces) elif style == "pep440-post": rendered = render_pep440_post(pieces) elif style == "pep440-post-branch": rendered = render_pep440_post_branch(pieces) elif style == "pep440-old": rendered = render_pep440_old(pieces) elif style == "git-describe": rendered = render_git_describe(pieces) elif style == "git-describe-long": rendered = render_git_describe_long(pieces) else: raise ValueError("unknown style '%s'" % style) return {"version": rendered, "full-revisionid": pieces["long"], "dirty": pieces["dirty"], "error": None, "date": pieces.get("date")} class VersioneerBadRootError(Exception): """The project root directory is unknown or missing key files.""" def get_versions(verbose: bool = False) -> Dict[str, Any]: """Get the project version from whatever source is available. Returns dict with two keys: 'version' and 'full'. """ if "versioneer" in sys.modules: # see the discussion in cmdclass.py:get_cmdclass() del sys.modules["versioneer"] root = get_root() cfg = get_config_from_root(root) assert cfg.VCS is not None, "please set [versioneer]VCS= in setup.cfg" handlers = HANDLERS.get(cfg.VCS) assert handlers, "unrecognized VCS '%s'" % cfg.VCS verbose = verbose or bool(cfg.verbose) # `bool()` used to avoid `None` assert cfg.versionfile_source is not None, \ "please set versioneer.versionfile_source" assert cfg.tag_prefix is not None, "please set versioneer.tag_prefix" versionfile_abs = os.path.join(root, cfg.versionfile_source) # extract version from first of: _version.py, VCS command (e.g. 'git # describe'), parentdir. This is meant to work for developers using a # source checkout, for users of a tarball created by 'setup.py sdist', # and for users of a tarball/zipball created by 'git archive' or github's # download-from-tag feature or the equivalent in other VCSes. get_keywords_f = handlers.get("get_keywords") from_keywords_f = handlers.get("keywords") if get_keywords_f and from_keywords_f: try: keywords = get_keywords_f(versionfile_abs) ver = from_keywords_f(keywords, cfg.tag_prefix, verbose) if verbose: print("got version from expanded keyword %s" % ver) return ver except NotThisMethod: pass try: ver = versions_from_file(versionfile_abs) if verbose: print("got version from file %s %s" % (versionfile_abs, ver)) return ver except NotThisMethod: pass from_vcs_f = handlers.get("pieces_from_vcs") if from_vcs_f: try: pieces = from_vcs_f(cfg.tag_prefix, root, verbose) ver = render(pieces, cfg.style) if verbose: print("got version from VCS %s" % ver) return ver except NotThisMethod: pass try: if cfg.parentdir_prefix: ver = versions_from_parentdir(cfg.parentdir_prefix, root, verbose) if verbose: print("got version from parentdir %s" % ver) return ver except NotThisMethod: pass if verbose: print("unable to compute version") return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to compute version", "date": None} def get_version() -> str: """Get the short version string for this project.""" return get_versions()["version"] def get_cmdclass(cmdclass: Optional[Dict[str, Any]] = None): """Get the custom setuptools subclasses used by Versioneer. If the package uses a different cmdclass (e.g. one from numpy), it should be provide as an argument. """ if "versioneer" in sys.modules: del sys.modules["versioneer"] # this fixes the "python setup.py develop" case (also 'install' and # 'easy_install .'), in which subdependencies of the main project are # built (using setup.py bdist_egg) in the same python process. Assume # a main project A and a dependency B, which use different versions # of Versioneer. A's setup.py imports A's Versioneer, leaving it in # sys.modules by the time B's setup.py is executed, causing B to run # with the wrong versioneer. Setuptools wraps the sub-dep builds in a # sandbox that restores sys.modules to it's pre-build state, so the # parent is protected against the child's "import versioneer". By # removing ourselves from sys.modules here, before the child build # happens, we protect the child from the parent's versioneer too. # Also see https://github.com/python-versioneer/python-versioneer/issues/52 cmds = {} if cmdclass is None else cmdclass.copy() # we add "version" to setuptools from setuptools import Command class cmd_version(Command): description = "report generated version string" user_options: List[Tuple[str, str, str]] = [] boolean_options: List[str] = [] def initialize_options(self) -> None: pass def finalize_options(self) -> None: pass def run(self) -> None: vers = get_versions(verbose=True) print("Version: %s" % vers["version"]) print(" full-revisionid: %s" % vers.get("full-revisionid")) print(" dirty: %s" % vers.get("dirty")) print(" date: %s" % vers.get("date")) if vers["error"]: print(" error: %s" % vers["error"]) cmds["version"] = cmd_version # we override "build_py" in setuptools # # most invocation pathways end up running build_py: # distutils/build -> build_py # distutils/install -> distutils/build ->.. # setuptools/bdist_wheel -> distutils/install ->.. # setuptools/bdist_egg -> distutils/install_lib -> build_py # setuptools/install -> bdist_egg ->.. # setuptools/develop -> ? # pip install: # copies source tree to a tempdir before running egg_info/etc # if .git isn't copied too, 'git describe' will fail # then does setup.py bdist_wheel, or sometimes setup.py install # setup.py egg_info -> ? # pip install -e . and setuptool/editable_wheel will invoke build_py # but the build_py command is not expected to copy any files. # we override different "build_py" commands for both environments if 'build_py' in cmds: _build_py: Any = cmds['build_py'] else: from setuptools.command.build_py import build_py as _build_py class cmd_build_py(_build_py): def run(self) -> None: root = get_root() cfg = get_config_from_root(root) versions = get_versions() _build_py.run(self) if getattr(self, "editable_mode", False): # During editable installs `.py` and data files are # not copied to build_lib return # now locate _version.py in the new build/ directory and replace # it with an updated value if cfg.versionfile_build: target_versionfile = os.path.join(self.build_lib, cfg.versionfile_build) print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, versions) cmds["build_py"] = cmd_build_py if 'build_ext' in cmds: _build_ext: Any = cmds['build_ext'] else: from setuptools.command.build_ext import build_ext as _build_ext class cmd_build_ext(_build_ext): def run(self) -> None: root = get_root() cfg = get_config_from_root(root) versions = get_versions() _build_ext.run(self) if self.inplace: # build_ext --inplace will only build extensions in # build/lib<..> dir with no _version.py to write to. # As in place builds will already have a _version.py # in the module dir, we do not need to write one. return # now locate _version.py in the new build/ directory and replace # it with an updated value if not cfg.versionfile_build: return target_versionfile = os.path.join(self.build_lib, cfg.versionfile_build) if not os.path.exists(target_versionfile): print(f"Warning: {target_versionfile} does not exist, skipping " "version update. This can happen if you are running build_ext " "without first running build_py.") return print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, versions) cmds["build_ext"] = cmd_build_ext if "cx_Freeze" in sys.modules: # cx_freeze enabled? from cx_Freeze.dist import build_exe as _build_exe # type: ignore # nczeczulin reports that py2exe won't like the pep440-style string # as FILEVERSION, but it can be used for PRODUCTVERSION, e.g. # setup(console=[{ # "version": versioneer.get_version().split("+", 1)[0], # FILEVERSION # "product_version": versioneer.get_version(), # ... class cmd_build_exe(_build_exe): def run(self) -> None: root = get_root() cfg = get_config_from_root(root) versions = get_versions() target_versionfile = cfg.versionfile_source print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, versions) _build_exe.run(self) os.unlink(target_versionfile) with open(cfg.versionfile_source, "w") as f: LONG = LONG_VERSION_PY[cfg.VCS] f.write(LONG % {"DOLLAR": "$", "STYLE": cfg.style, "TAG_PREFIX": cfg.tag_prefix, "PARENTDIR_PREFIX": cfg.parentdir_prefix, "VERSIONFILE_SOURCE": cfg.versionfile_source, }) cmds["build_exe"] = cmd_build_exe del cmds["build_py"] if 'py2exe' in sys.modules: # py2exe enabled? try: from py2exe.setuptools_buildexe import py2exe as _py2exe # type: ignore except ImportError: from py2exe.distutils_buildexe import py2exe as _py2exe # type: ignore class cmd_py2exe(_py2exe): def run(self) -> None: root = get_root() cfg = get_config_from_root(root) versions = get_versions() target_versionfile = cfg.versionfile_source print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, versions) _py2exe.run(self) os.unlink(target_versionfile) with open(cfg.versionfile_source, "w") as f: LONG = LONG_VERSION_PY[cfg.VCS] f.write(LONG % {"DOLLAR": "$", "STYLE": cfg.style, "TAG_PREFIX": cfg.tag_prefix, "PARENTDIR_PREFIX": cfg.parentdir_prefix, "VERSIONFILE_SOURCE": cfg.versionfile_source, }) cmds["py2exe"] = cmd_py2exe # sdist farms its file list building out to egg_info if 'egg_info' in cmds: _egg_info: Any = cmds['egg_info'] else: from setuptools.command.egg_info import egg_info as _egg_info class cmd_egg_info(_egg_info): def find_sources(self) -> None: # egg_info.find_sources builds the manifest list and writes it # in one shot super().find_sources() # Modify the filelist and normalize it root = get_root() cfg = get_config_from_root(root) self.filelist.append('versioneer.py') if cfg.versionfile_source: # There are rare cases where versionfile_source might not be # included by default, so we must be explicit self.filelist.append(cfg.versionfile_source) self.filelist.sort() self.filelist.remove_duplicates() # The write method is hidden in the manifest_maker instance that # generated the filelist and was thrown away # We will instead replicate their final normalization (to unicode, # and POSIX-style paths) from setuptools import unicode_utils normalized = [unicode_utils.filesys_decode(f).replace(os.sep, '/') for f in self.filelist.files] manifest_filename = os.path.join(self.egg_info, 'SOURCES.txt') with open(manifest_filename, 'w') as fobj: fobj.write('\n'.join(normalized)) cmds['egg_info'] = cmd_egg_info # we override different "sdist" commands for both environments if 'sdist' in cmds: _sdist: Any = cmds['sdist'] else: from setuptools.command.sdist import sdist as _sdist class cmd_sdist(_sdist): def run(self) -> None: versions = get_versions() self._versioneer_generated_versions = versions # unless we update this, the command will keep using the old # version self.distribution.metadata.version = versions["version"] return _sdist.run(self) def make_release_tree(self, base_dir: str, files: List[str]) -> None: root = get_root() cfg = get_config_from_root(root) _sdist.make_release_tree(self, base_dir, files) # now locate _version.py in the new base_dir directory # (remembering that it may be a hardlink) and replace it with an # updated value target_versionfile = os.path.join(base_dir, cfg.versionfile_source) print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, self._versioneer_generated_versions) cmds["sdist"] = cmd_sdist return cmds CONFIG_ERROR = """ setup.cfg is missing the necessary Versioneer configuration. You need a section like: [versioneer] VCS = git style = pep440 versionfile_source = src/myproject/_version.py versionfile_build = myproject/_version.py tag_prefix = parentdir_prefix = myproject- You will also need to edit your setup.py to use the results: import versioneer setup(version=versioneer.get_version(), cmdclass=versioneer.get_cmdclass(), ...) Please read the docstring in ./versioneer.py for configuration instructions, edit setup.cfg, and re-run the installer or 'python versioneer.py setup'. """ SAMPLE_CONFIG = """ # See the docstring in versioneer.py for instructions. Note that you must # re-run 'versioneer.py setup' after changing this section, and commit the # resulting files. [versioneer] #VCS = git #style = pep440 #versionfile_source = #versionfile_build = #tag_prefix = #parentdir_prefix = """ OLD_SNIPPET = """ from ._version import get_versions __version__ = get_versions()['version'] del get_versions """ INIT_PY_SNIPPET = """ from . import {0} __version__ = {0}.get_versions()['version'] """ def do_setup() -> int: """Do main VCS-independent setup function for installing Versioneer.""" root = get_root() try: cfg = get_config_from_root(root) except (OSError, configparser.NoSectionError, configparser.NoOptionError) as e: if isinstance(e, (OSError, configparser.NoSectionError)): print("Adding sample versioneer config to setup.cfg", file=sys.stderr) with open(os.path.join(root, "setup.cfg"), "a") as f: f.write(SAMPLE_CONFIG) print(CONFIG_ERROR, file=sys.stderr) return 1 print(" creating %s" % cfg.versionfile_source) with open(cfg.versionfile_source, "w") as f: LONG = LONG_VERSION_PY[cfg.VCS] f.write(LONG % {"DOLLAR": "$", "STYLE": cfg.style, "TAG_PREFIX": cfg.tag_prefix, "PARENTDIR_PREFIX": cfg.parentdir_prefix, "VERSIONFILE_SOURCE": cfg.versionfile_source, }) ipy = os.path.join(os.path.dirname(cfg.versionfile_source), "__init__.py") maybe_ipy: Optional[str] = ipy if os.path.exists(ipy): try: with open(ipy, "r") as f: old = f.read() except OSError: old = "" module = os.path.splitext(os.path.basename(cfg.versionfile_source))[0] snippet = INIT_PY_SNIPPET.format(module) if OLD_SNIPPET in old: print(" replacing boilerplate in %s" % ipy) with open(ipy, "w") as f: f.write(old.replace(OLD_SNIPPET, snippet)) elif snippet not in old: print(" appending to %s" % ipy) with open(ipy, "a") as f: f.write(snippet) else: print(" %s unmodified" % ipy) else: print(" %s doesn't exist, ok" % ipy) maybe_ipy = None # Make VCS-specific changes. For git, this means creating/changing # .gitattributes to mark _version.py for export-subst keyword # substitution. do_vcs_install(cfg.versionfile_source, maybe_ipy) return 0 def scan_setup_py() -> int: """Validate the contents of setup.py against Versioneer's expectations.""" found = set() setters = False errors = 0 with open("setup.py", "r") as f: for line in f.readlines(): if "import versioneer" in line: found.add("import") if "versioneer.get_cmdclass()" in line: found.add("cmdclass") if "versioneer.get_version()" in line: found.add("get_version") if "versioneer.VCS" in line: setters = True if "versioneer.versionfile_source" in line: setters = True if len(found) != 3: print("") print("Your setup.py appears to be missing some important items") print("(but I might be wrong). Please make sure it has something") print("roughly like the following:") print("") print(" import versioneer") print(" setup( version=versioneer.get_version(),") print(" cmdclass=versioneer.get_cmdclass(), ...)") print("") errors += 1 if setters: print("You should remove lines like 'versioneer.VCS = ' and") print("'versioneer.versionfile_source = ' . This configuration") print("now lives in setup.cfg, and should be removed from setup.py") print("") errors += 1 return errors def setup_command() -> NoReturn: """Set up Versioneer and exit with appropriate error code.""" errors = do_setup() errors += scan_setup_py() sys.exit(1 if errors else 0) if __name__ == "__main__": cmd = sys.argv[1] if cmd == "setup": setup_command()