pax_global_header00006660000000000000000000000064134266043320014515gustar00rootroot0000000000000052 comment=6771f4cdd13d0eca74d3ebbaa6290297dd0a381d liac-arff-2.4.0/000077500000000000000000000000001342660433200133445ustar00rootroot00000000000000liac-arff-2.4.0/.gitignore000066400000000000000000000004271342660433200153370ustar00rootroot00000000000000*.py[cod] # C extensions *.so # Packages *.egg *.egg-info dist build eggs parts bin var sdist develop-eggs .installed.cfg lib lib64 # Installer logs pip-log.txt # Unit test / coverage reports .coverage .tox nosetests.xml #Translations *.mo #Mr Developer .mr.developer.cfg liac-arff-2.4.0/.travis.yml000066400000000000000000000004351342660433200154570ustar00rootroot00000000000000language: python matrix: include: - python: 2.7 - python: 3.3 - python: 3.4 - python: 3.5 - python: 3.6 - python: 3.7 dist: xenial sudo: required - python: pypy - python: pypy3 install: python setup.py install script: python setup.py test liac-arff-2.4.0/CHANGES.rst000066400000000000000000000045061342660433200151530ustar00rootroot00000000000000~~~~~~~~~~~~~~~~~~~~~~~ What's New in LIAC-ARFF ~~~~~~~~~~~~~~~~~~~~~~~ LIAC-ARFF 2.4 * enhancement: load data progressively with generator `return_type`. * enhancement: standard Java escape sequences are now decoded in string attributes, and non-printable characters are now encoded with escaping. * fix: match all possible separator spaces to add quotes when encoding into ARFF. These separator spaces will be preserved when decoding the ARFF files. LIAC-ARFF 2.3.1 * maintenance: replace two bare ``raise`` by appropriate ``raise Exception`` statements * maintenance: avoid deprecation warning in Python >= 3.6 LIAC-ARFF 2.3 - enhancement: improvements to loading runtime (issue #76) - fix: several bugs in decoding and encoding quoted and escaped values, particularly in loading sparse ARFF. - fix #52: Circumvent a known bug when loading sparse data written by WEKA LIAC-ARFF 2.2.3 - new: test for python3.7 and pypy3 LIAC-ARFF 2.2.2 - fix: better support for string and nominal features containing escape characters (issue #69). LIAC-ARFF 2.2.1 - fix: better support for string features and nominals containing commas (issue # 64) LIAC-ARFF 2.2 - fix: do not treat quoted questionmarks as missing values (issue #50) - fix: compability issue using zip with python2.7 - fix: categorical quoting if comma is present (issue #15) - fix: remove training comment lines (issue #61) - new: test for python3.5 and python3.6 as well - new: drop python2.6 support LIAC-ARFF 2.1.1 - fix: working for 2.6+ - fix: working for 3.3+ - new: encoder checks if data has all attributes - new: sparse data support LIAC-ARFF 2.1.0 - fix: working for 2.6+ - fix: working for 3.3+ - new: encoder checks if data has all attributes - new: sparse data support LIAC-ARFF 2.0.2 - fix: attribute and relation names now follow the new ARFF specification. - new: encoded nominal values. LIAC-ARFF 2.0.1 - fix: dump now escapes correctly special symbols, such %, ', ", and \. LIAC-ARFF 2.0 - new: ArffEncoder and ArffDecoder helpers which actually do the serialization and loading of ARFF files. - new: UnitTest cases for all classes and functions. - new: Detailed exceptions for many cases. - fix: load, loads, dump, dumps are now simpler. - rem: arfftools.py and the split function. LIAC-ARFF 1.0 First commit. - new: load, loads, dump, dumps functions liac-arff-2.4.0/LICENSE000066400000000000000000000021311342660433200143460ustar00rootroot00000000000000Copyright (c) 2011 Renato de Pontes Pereira, renato.ppontes at gmail dot com Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.liac-arff-2.4.0/README.rst000066400000000000000000000111731342660433200150360ustar00rootroot00000000000000========= LIAC-ARFF ========= .. image:: https://travis-ci.org/renatopp/liac-arff.svg :target: https://travis-ci.org/renatopp/liac-arff The liac-arff module implements functions to read and write ARFF files in Python. It was created in the Connectionist Artificial Intelligence Laboratory (LIAC), which takes place at the Federal University of Rio Grande do Sul (UFRGS), in Brazil. ARFF (Attribute-Relation File Format) is an file format specially created for describe datasets which are used commonly for machine learning experiments and softwares. This file format was created to be used in Weka, the best representative software for machine learning automated experiments. You can clone the `arff-datasets `_ repository for a large set of ARFF files. -------- Features -------- - Read and write ARFF files using python built-in structures, such dictionaries and lists; - Supports `scipy.sparse.coo `_ and lists of dictionaries as used by SVMLight - Supports the following attribute types: NUMERIC, REAL, INTEGER, STRING, and NOMINAL; - Has an interface similar to other built-in modules such as ``json``, or ``zipfile``; - Supports read and write the descriptions of files; - Supports missing values and names with spaces; - Supports unicode values and names; - Fully compatible with Python 2.7+ and Python 3.3+; - Under `MIT License `_ -------------- How To Install -------------- Via pip:: $ pip install liac-arff Via easy_install:: $ easy_install liac-arff Manually:: $ python setup.py install ------------- Documentation ------------- For a complete description of the module, consult the official documentation at http://packages.python.org/liac-arff/ with mirror in http://inf.ufrgs.br/~rppereira/docs/liac-arff/index.html ----- Usage ----- You can read an ARFF file as follows:: >>> import arff >>> data = arff.load(open('wheater.arff', 'rb')) Which results in:: >>> data { u'attributes': [ (u'outlook', [u'sunny', u'overcast', u'rainy']), (u'temperature', u'REAL'), (u'humidity', u'REAL'), (u'windy', [u'TRUE', u'FALSE']), (u'play', [u'yes', u'no'])], u'data': [ [u'sunny', 85.0, 85.0, u'FALSE', u'no'], [u'sunny', 80.0, 90.0, u'TRUE', u'no'], [u'overcast', 83.0, 86.0, u'FALSE', u'yes'], [u'rainy', 70.0, 96.0, u'FALSE', u'yes'], [u'rainy', 68.0, 80.0, u'FALSE', u'yes'], [u'rainy', 65.0, 70.0, u'TRUE', u'no'], [u'overcast', 64.0, 65.0, u'TRUE', u'yes'], [u'sunny', 72.0, 95.0, u'FALSE', u'no'], [u'sunny', 69.0, 70.0, u'FALSE', u'yes'], [u'rainy', 75.0, 80.0, u'FALSE', u'yes'], [u'sunny', 75.0, 70.0, u'TRUE', u'yes'], [u'overcast', 72.0, 90.0, u'TRUE', u'yes'], [u'overcast', 81.0, 75.0, u'FALSE', u'yes'], [u'rainy', 71.0, 91.0, u'TRUE', u'no'] ], u'description': u'', u'relation': u'weather' } You can write an ARFF file with this structure:: >>> print arff.dumps(data) @RELATION weather @ATTRIBUTE outlook {sunny, overcast, rainy} @ATTRIBUTE temperature REAL @ATTRIBUTE humidity REAL @ATTRIBUTE windy {TRUE, FALSE} @ATTRIBUTE play {yes, no} @DATA sunny,85.0,85.0,FALSE,no sunny,80.0,90.0,TRUE,no overcast,83.0,86.0,FALSE,yes rainy,70.0,96.0,FALSE,yes rainy,68.0,80.0,FALSE,yes rainy,65.0,70.0,TRUE,no overcast,64.0,65.0,TRUE,yes sunny,72.0,95.0,FALSE,no sunny,69.0,70.0,FALSE,yes rainy,75.0,80.0,FALSE,yes sunny,75.0,70.0,TRUE,yes overcast,72.0,90.0,TRUE,yes overcast,81.0,75.0,FALSE,yes rainy,71.0,91.0,TRUE,no % % % Contributors ------------ - `Nate Moseley (FinalDoom) `_ - `Tarek Amr (gr33ndata) `_ - `Simon (M3t0r) `_ - `Gonzalo Almeida (flecox) `_ - `André Nordbø (AndyNor) `_ - `Niedakh `_ - `Zichen Wang (wangz10) `_ - `Matthias Feurer (mfeurer) `_ - `Hongjoo Lee (midnightradio) `_ - `Calvin Jeong (calvin) `_ - `Joel Nothman (jnothman) `_ - `Guillaume Lemaitre (glemaitre) `_ Project Page ------------ https://github.com/renatopp/liac-arff liac-arff-2.4.0/arff.py000066400000000000000000001135541342660433200146450ustar00rootroot00000000000000# -*- coding: utf-8 -*- # ============================================================================= # Federal University of Rio Grande do Sul (UFRGS) # Connectionist Artificial Intelligence Laboratory (LIAC) # Renato de Pontes Pereira - rppereira@inf.ufrgs.br # ============================================================================= # Copyright (c) 2011 Renato de Pontes Pereira, renato.ppontes at gmail dot com # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # ============================================================================= ''' The liac-arff module implements functions to read and write ARFF files in Python. It was created in the Connectionist Artificial Intelligence Laboratory (LIAC), which takes place at the Federal University of Rio Grande do Sul (UFRGS), in Brazil. ARFF (Attribute-Relation File Format) is an file format specially created for describe datasets which are commonly used for machine learning experiments and softwares. This file format was created to be used in Weka, the best representative software for machine learning automated experiments. An ARFF file can be divided into two sections: header and data. The Header describes the metadata of the dataset, including a general description of the dataset, its name and its attributes. The source below is an example of a header section in a XOR dataset:: % % XOR Dataset % % Created by Renato Pereira % rppereira@inf.ufrgs.br % http://inf.ufrgs.br/~rppereira % % @RELATION XOR @ATTRIBUTE input1 REAL @ATTRIBUTE input2 REAL @ATTRIBUTE y REAL The Data section of an ARFF file describes the observations of the dataset, in the case of XOR dataset:: @DATA 0.0,0.0,0.0 0.0,1.0,1.0 1.0,0.0,1.0 1.0,1.0,0.0 % % % Notice that several lines are starting with an ``%`` symbol, denoting a comment, thus, lines with ``%`` at the beginning will be ignored, except by the description part at the beginning of the file. The declarations ``@RELATION``, ``@ATTRIBUTE``, and ``@DATA`` are all case insensitive and obligatory. For more information and details about the ARFF file description, consult http://www.cs.waikato.ac.nz/~ml/weka/arff.html ARFF Files in Python ~~~~~~~~~~~~~~~~~~~~ This module uses built-ins python objects to represent a deserialized ARFF file. A dictionary is used as the container of the data and metadata of ARFF, and have the following keys: - **description**: (OPTIONAL) a string with the description of the dataset. - **relation**: (OBLIGATORY) a string with the name of the dataset. - **attributes**: (OBLIGATORY) a list of attributes with the following template:: (attribute_name, attribute_type) the attribute_name is a string, and attribute_type must be an string or a list of strings. - **data**: (OBLIGATORY) a list of data instances. Each data instance must be a list with values, depending on the attributes. The above keys must follow the case which were described, i.e., the keys are case sensitive. The attribute type ``attribute_type`` must be one of these strings (they are not case sensitive): ``NUMERIC``, ``INTEGER``, ``REAL`` or ``STRING``. For nominal attributes, the ``atribute_type`` must be a list of strings. In this format, the XOR dataset presented above can be represented as a python object as:: xor_dataset = { 'description': 'XOR Dataset', 'relation': 'XOR', 'attributes': [ ('input1', 'REAL'), ('input2', 'REAL'), ('y', 'REAL'), ], 'data': [ [0.0, 0.0, 0.0], [0.0, 1.0, 1.0], [1.0, 0.0, 1.0], [1.0, 1.0, 0.0] ] } Features ~~~~~~~~ This module provides several features, including: - Read and write ARFF files using python built-in structures, such dictionaries and lists; - Supports `scipy.sparse.coo `_ and lists of dictionaries as used by SVMLight - Supports the following attribute types: NUMERIC, REAL, INTEGER, STRING, and NOMINAL; - Has an interface similar to other built-in modules such as ``json``, or ``zipfile``; - Supports read and write the descriptions of files; - Supports missing values and names with spaces; - Supports unicode values and names; - Fully compatible with Python 2.7+, Python 3.3+, pypy and pypy3; - Under `MIT License `_ ''' __author__ = 'Renato de Pontes Pereira, Matthias Feurer, Joel Nothman' __author_email__ = ('renato.ppontes@gmail.com, ' 'feurerm@informatik.uni-freiburg.de, ' 'joel.nothman@gmail.com') __version__ = '2.4.0' import re import sys import csv # CONSTANTS =================================================================== _SIMPLE_TYPES = ['NUMERIC', 'REAL', 'INTEGER', 'STRING'] _TK_DESCRIPTION = '%' _TK_COMMENT = '%' _TK_RELATION = '@RELATION' _TK_ATTRIBUTE = '@ATTRIBUTE' _TK_DATA = '@DATA' _RE_RELATION = re.compile(r'^([^\{\}%,\s]*|\".*\"|\'.*\')$', re.UNICODE) _RE_ATTRIBUTE = re.compile(r'^(\".*\"|\'.*\'|[^\{\}%,\s]*)\s+(.+)$', re.UNICODE) _RE_TYPE_NOMINAL = re.compile(r'^\{\s*((\".*\"|\'.*\'|\S*)\s*,\s*)*(\".*\"|\'.*\'|\S*)\s*\}$', re.UNICODE) _RE_QUOTE_CHARS = re.compile(r'["\'\\\s%,\000-\031]', re.UNICODE) _RE_ESCAPE_CHARS = re.compile(r'(?=["\'\\%])|[\n\r\t\000-\031]') _RE_SPARSE_LINE = re.compile(r'^\s*\{.*\}\s*$', re.UNICODE) _RE_NONTRIVIAL_DATA = re.compile('["\'{}\\s]', re.UNICODE) def _build_re_values(): quoted_re = r''' " # open quote followed by zero or more of: (?: (?= len(conversors): raise BadDataFormat(row) # XXX: int 0 is used for implicit values, not '0' values = [values[i] if i in values else 0 for i in xrange(len(conversors))] else: if len(values) != len(conversors): raise BadDataFormat(row) yield self._decode_values(values, conversors) @staticmethod def _decode_values(values, conversors): try: values = [None if value is None else conversor(value) for conversor, value in zip(conversors, values)] except ValueError as exc: if 'float: ' in str(exc): raise BadNumericalValue() return values def encode_data(self, data, attributes): '''(INTERNAL) Encodes a line of data. Data instances follow the csv format, i.e, attribute values are delimited by commas. After converted from csv. :param data: a list of values. :param attributes: a list of attributes. Used to check if data is valid. :return: a string with the encoded data line. ''' current_row = 0 for inst in data: if len(inst) != len(attributes): raise BadObject( 'Instance %d has %d attributes, expected %d' % (current_row, len(inst), len(attributes)) ) new_data = [] for value in inst: if value is None or value == u'' or value != value: s = '?' else: s = encode_string(unicode(value)) new_data.append(s) current_row += 1 yield u','.join(new_data) class _DataListMixin(object): """Mixin to return a list from decode_rows instead of a generator""" def decode_rows(self, stream, conversors): return list(super(_DataListMixin, self).decode_rows(stream, conversors)) class Data(_DataListMixin, DenseGeneratorData): pass class COOData(object): def decode_rows(self, stream, conversors): data, rows, cols = [], [], [] for i, row in enumerate(stream): values = _parse_values(row) if not isinstance(values, dict): raise BadLayout() if not values: continue row_cols, values = zip(*sorted(values.items())) try: values = [value if value is None else conversors[key](value) for key, value in zip(row_cols, values)] except ValueError as exc: if 'float: ' in str(exc): raise BadNumericalValue() raise except IndexError: # conversor out of range raise BadDataFormat(row) data.extend(values) rows.extend([i] * len(values)) cols.extend(row_cols) return data, rows, cols def encode_data(self, data, attributes): num_attributes = len(attributes) new_data = [] current_row = 0 row = data.row col = data.col data = data.data # Check if the rows are sorted if not all(row[i] <= row[i + 1] for i in xrange(len(row) - 1)): raise ValueError("liac-arff can only output COO matrices with " "sorted rows.") for v, col, row in zip(data, col, row): if row > current_row: # Add empty rows if necessary while current_row < row: yield " ".join([u"{", u','.join(new_data), u"}"]) new_data = [] current_row += 1 if col >= num_attributes: raise BadObject( 'Instance %d has at least %d attributes, expected %d' % (current_row, col + 1, num_attributes) ) if v is None or v == u'' or v != v: s = '?' else: s = encode_string(unicode(v)) new_data.append("%d %s" % (col, s)) yield " ".join([u"{", u','.join(new_data), u"}"]) class LODGeneratorData(object): def decode_rows(self, stream, conversors): for row in stream: values = _parse_values(row) if not isinstance(values, dict): raise BadLayout() try: yield {key: None if value is None else conversors[key](value) for key, value in values.items()} except ValueError as exc: if 'float: ' in str(exc): raise BadNumericalValue() raise except IndexError: # conversor out of range raise BadDataFormat(row) def encode_data(self, data, attributes): current_row = 0 num_attributes = len(attributes) for row in data: new_data = [] if len(row) > 0 and max(row) >= num_attributes: raise BadObject( 'Instance %d has %d attributes, expected %d' % (current_row, max(row) + 1, num_attributes) ) for col in sorted(row): v = row[col] if v is None or v == u'' or v != v: s = '?' else: s = encode_string(unicode(v)) new_data.append("%d %s" % (col, s)) current_row += 1 yield " ".join([u"{", u','.join(new_data), u"}"]) class LODData(_DataListMixin, LODGeneratorData): pass def _get_data_object_for_decoding(matrix_type): if matrix_type == DENSE: return Data() elif matrix_type == COO: return COOData() elif matrix_type == LOD: return LODData() elif matrix_type == DENSE_GEN: return DenseGeneratorData() elif matrix_type == LOD_GEN: return LODGeneratorData() else: raise ValueError("Matrix type %s not supported." % str(matrix_type)) def _get_data_object_for_encoding(matrix): # Probably a scipy.sparse if hasattr(matrix, 'format'): if matrix.format == 'coo': return COOData() else: raise ValueError('Cannot guess matrix format!') elif isinstance(matrix[0], dict): return LODData() else: return Data() # ============================================================================= # ADVANCED INTERFACE ========================================================== class ArffDecoder(object): '''An ARFF decoder.''' def __init__(self): '''Constructor.''' self._conversors = [] self._current_line = 0 def _decode_comment(self, s): '''(INTERNAL) Decodes a comment line. Comments are single line strings starting, obligatorily, with the ``%`` character, and can have any symbol, including whitespaces or special characters. This method must receive a normalized string, i.e., a string without padding, including the "\r\n" characters. :param s: a normalized string. :return: a string with the decoded comment. ''' res = re.sub(r'^\%( )?', '', s) return res def _decode_relation(self, s): '''(INTERNAL) Decodes a relation line. The relation declaration is a line with the format ``@RELATION ``, where ``relation-name`` is a string. The string must start with alphabetic character and must be quoted if the name includes spaces, otherwise this method will raise a `BadRelationFormat` exception. This method must receive a normalized string, i.e., a string without padding, including the "\r\n" characters. :param s: a normalized string. :return: a string with the decoded relation name. ''' _, v = s.split(' ', 1) v = v.strip() if not _RE_RELATION.match(v): raise BadRelationFormat() res = unicode(v.strip('"\'')) return res def _decode_attribute(self, s): '''(INTERNAL) Decodes an attribute line. The attribute is the most complex declaration in an arff file. All attributes must follow the template:: @attribute where ``attribute-name`` is a string, quoted if the name contains any whitespace, and ``datatype`` can be: - Numerical attributes as ``NUMERIC``, ``INTEGER`` or ``REAL``. - Strings as ``STRING``. - Dates (NOT IMPLEMENTED). - Nominal attributes with format: {, , , ...} The nominal names follow the rules for the attribute names, i.e., they must be quoted if the name contains whitespaces. This method must receive a normalized string, i.e., a string without padding, including the "\r\n" characters. :param s: a normalized string. :return: a tuple (ATTRIBUTE_NAME, TYPE_OR_VALUES). ''' _, v = s.split(' ', 1) v = v.strip() # Verify the general structure of declaration m = _RE_ATTRIBUTE.match(v) if not m: raise BadAttributeFormat() # Extracts the raw name and type name, type_ = m.groups() # Extracts the final name name = unicode(name.strip('"\'')) # Extracts the final type if _RE_TYPE_NOMINAL.match(type_): try: type_ = _parse_values(type_.strip('{} ')) except Exception: raise BadAttributeType() if isinstance(type_, dict): raise BadAttributeType() else: # If not nominal, verify the type name type_ = unicode(type_).upper() if type_ not in ['NUMERIC', 'REAL', 'INTEGER', 'STRING']: raise BadAttributeType() return (name, type_) def _decode(self, s, encode_nominal=False, matrix_type=DENSE): '''Do the job the ``encode``.''' # Make sure this method is idempotent self._current_line = 0 # If string, convert to a list of lines if isinstance(s, basestring): s = s.strip('\r\n ').replace('\r\n', '\n').split('\n') # Create the return object obj = { u'description': u'', u'relation': u'', u'attributes': [], u'data': [] } attribute_names = {} # Create the data helper object data = _get_data_object_for_decoding(matrix_type) # Read all lines STATE = _TK_DESCRIPTION s = iter(s) for row in s: self._current_line += 1 # Ignore empty lines row = row.strip(' \r\n') if not row: continue u_row = row.upper() # DESCRIPTION ----------------------------------------------------- if u_row.startswith(_TK_DESCRIPTION) and STATE == _TK_DESCRIPTION: obj['description'] += self._decode_comment(row) + '\n' # ----------------------------------------------------------------- # RELATION -------------------------------------------------------- elif u_row.startswith(_TK_RELATION): if STATE != _TK_DESCRIPTION: raise BadLayout() STATE = _TK_RELATION obj['relation'] = self._decode_relation(row) # ----------------------------------------------------------------- # ATTRIBUTE ------------------------------------------------------- elif u_row.startswith(_TK_ATTRIBUTE): if STATE != _TK_RELATION and STATE != _TK_ATTRIBUTE: raise BadLayout() STATE = _TK_ATTRIBUTE attr = self._decode_attribute(row) if attr[0] in attribute_names: raise BadAttributeName(attr[0], attribute_names[attr[0]]) else: attribute_names[attr[0]] = self._current_line obj['attributes'].append(attr) if isinstance(attr[1], (list, tuple)): if encode_nominal: conversor = EncodedNominalConversor(attr[1]) else: conversor = NominalConversor(attr[1]) else: CONVERSOR_MAP = {'STRING': unicode, 'INTEGER': lambda x: int(float(x)), 'NUMERIC': float, 'REAL': float} conversor = CONVERSOR_MAP[attr[1]] self._conversors.append(conversor) # ----------------------------------------------------------------- # DATA ------------------------------------------------------------ elif u_row.startswith(_TK_DATA): if STATE != _TK_ATTRIBUTE: raise BadLayout() break # ----------------------------------------------------------------- # COMMENT --------------------------------------------------------- elif u_row.startswith(_TK_COMMENT): pass # ----------------------------------------------------------------- else: # Never found @DATA raise BadLayout() def stream(): for row in s: self._current_line += 1 row = row.strip() # Ignore empty lines and comment lines. if row and not row.startswith(_TK_COMMENT): yield row # Alter the data object obj['data'] = data.decode_rows(stream(), self._conversors) if obj['description'].endswith('\n'): obj['description'] = obj['description'][:-1] return obj def decode(self, s, encode_nominal=False, return_type=DENSE): '''Returns the Python representation of a given ARFF file. When a file object is passed as an argument, this method reads lines iteratively, avoiding to load unnecessary information to the memory. :param s: a string or file object with the ARFF file. :param encode_nominal: boolean, if True perform a label encoding while reading the .arff file. :param return_type: determines the data structure used to store the dataset. Can be one of `arff.DENSE`, `arff.COO`, `arff.LOD`, `arff.DENSE_GEN` or `arff.LOD_GEN`. Consult the sections on `working with sparse data`_ and `loading progressively`_. ''' try: return self._decode(s, encode_nominal=encode_nominal, matrix_type=return_type) except ArffException as e: e.line = self._current_line raise e class ArffEncoder(object): '''An ARFF encoder.''' def _encode_comment(self, s=''): '''(INTERNAL) Encodes a comment line. Comments are single line strings starting, obligatorily, with the ``%`` character, and can have any symbol, including whitespaces or special characters. If ``s`` is None, this method will simply return an empty comment. :param s: (OPTIONAL) string. :return: a string with the encoded comment line. ''' if s: return u'%s %s'%(_TK_COMMENT, s) else: return u'%s' % _TK_COMMENT def _encode_relation(self, name): '''(INTERNAL) Decodes a relation line. The relation declaration is a line with the format ``@RELATION ``, where ``relation-name`` is a string. :param name: a string. :return: a string with the encoded relation declaration. ''' for char in ' %{},': if char in name: name = '"%s"'%name break return u'%s %s'%(_TK_RELATION, name) def _encode_attribute(self, name, type_): '''(INTERNAL) Encodes an attribute line. The attribute follow the template:: @attribute where ``attribute-name`` is a string, and ``datatype`` can be: - Numerical attributes as ``NUMERIC``, ``INTEGER`` or ``REAL``. - Strings as ``STRING``. - Dates (NOT IMPLEMENTED). - Nominal attributes with format: {, , , ...} This method must receive a the name of the attribute and its type, if the attribute type is nominal, ``type`` must be a list of values. :param name: a string. :param type_: a string or a list of string. :return: a string with the encoded attribute declaration. ''' for char in ' %{},': if char in name: name = '"%s"'%name break if isinstance(type_, (tuple, list)): type_tmp = [u'%s' % encode_string(type_k) for type_k in type_] type_ = u'{%s}'%(u', '.join(type_tmp)) return u'%s %s %s'%(_TK_ATTRIBUTE, name, type_) def encode(self, obj): '''Encodes a given object to an ARFF file. :param obj: the object containing the ARFF information. :return: the ARFF file as an unicode string. ''' data = [row for row in self.iter_encode(obj)] return u'\n'.join(data) def iter_encode(self, obj): '''The iterative version of `arff.ArffEncoder.encode`. This encodes iteratively a given object and return, one-by-one, the lines of the ARFF file. :param obj: the object containing the ARFF information. :return: (yields) the ARFF file as unicode strings. ''' # DESCRIPTION if obj.get('description', None): for row in obj['description'].split('\n'): yield self._encode_comment(row) # RELATION if not obj.get('relation'): raise BadObject('Relation name not found or with invalid value.') yield self._encode_relation(obj['relation']) yield u'' # ATTRIBUTES if not obj.get('attributes'): raise BadObject('Attributes not found.') attribute_names = set() for attr in obj['attributes']: # Verify for bad object format if not isinstance(attr, (tuple, list)) or \ len(attr) != 2 or \ not isinstance(attr[0], basestring): raise BadObject('Invalid attribute declaration "%s"'%str(attr)) if isinstance(attr[1], basestring): # Verify for invalid types if attr[1] not in _SIMPLE_TYPES: raise BadObject('Invalid attribute type "%s"'%str(attr)) # Verify for bad object format elif not isinstance(attr[1], (tuple, list)): raise BadObject('Invalid attribute type "%s"'%str(attr)) # Verify attribute name is not used twice if attr[0] in attribute_names: raise BadObject('Trying to use attribute name "%s" for the ' 'second time.' % str(attr[0])) else: attribute_names.add(attr[0]) yield self._encode_attribute(attr[0], attr[1]) yield u'' attributes = obj['attributes'] # DATA yield _TK_DATA if 'data' in obj: data = _get_data_object_for_encoding(obj.get('data')) for line in data.encode_data(obj.get('data'), attributes): yield line yield u'' # ============================================================================= # BASIC INTERFACE ============================================================= def load(fp, encode_nominal=False, return_type=DENSE): '''Load a file-like object containing the ARFF document and convert it into a Python object. :param fp: a file-like object. :param encode_nominal: boolean, if True perform a label encoding while reading the .arff file. :param return_type: determines the data structure used to store the dataset. Can be one of `arff.DENSE`, `arff.COO`, `arff.LOD`, `arff.DENSE_GEN` or `arff.LOD_GEN`. Consult the sections on `working with sparse data`_ and `loading progressively`_. :return: a dictionary. ''' decoder = ArffDecoder() return decoder.decode(fp, encode_nominal=encode_nominal, return_type=return_type) def loads(s, encode_nominal=False, return_type=DENSE): '''Convert a string instance containing the ARFF document into a Python object. :param s: a string object. :param encode_nominal: boolean, if True perform a label encoding while reading the .arff file. :param return_type: determines the data structure used to store the dataset. Can be one of `arff.DENSE`, `arff.COO`, `arff.LOD`, `arff.DENSE_GEN` or `arff.LOD_GEN`. Consult the sections on `working with sparse data`_ and `loading progressively`_. :return: a dictionary. ''' decoder = ArffDecoder() return decoder.decode(s, encode_nominal=encode_nominal, return_type=return_type) def dump(obj, fp): '''Serialize an object representing the ARFF document to a given file-like object. :param obj: a dictionary. :param fp: a file-like object. ''' encoder = ArffEncoder() generator = encoder.iter_encode(obj) last_row = next(generator) for row in generator: fp.write(last_row + u'\n') last_row = row fp.write(last_row) return fp def dumps(obj): '''Serialize an object representing the ARFF document, returning a string. :param obj: a dictionary. :return: a string with the ARFF document. ''' encoder = ArffEncoder() return encoder.encode(obj) # ============================================================================= liac-arff-2.4.0/docs/000077500000000000000000000000001342660433200142745ustar00rootroot00000000000000liac-arff-2.4.0/docs/Makefile000066400000000000000000000127211342660433200157370ustar00rootroot00000000000000# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = build # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: -rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/liac-arff.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/liac-arff.qhc" devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/liac-arff" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/liac-arff" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." liac-arff-2.4.0/docs/make.bat000066400000000000000000000117671342660433200157150ustar00rootroot00000000000000@ECHO OFF REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set BUILDDIR=build set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source set I18NSPHINXOPTS=%SPHINXOPTS% source if NOT "%PAPER%" == "" ( set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% ) if "%1" == "" goto help if "%1" == "help" ( :help echo.Please use `make ^` where ^ is one of echo. html to make standalone HTML files echo. dirhtml to make HTML files named index.html in directories echo. singlehtml to make a single large HTML file echo. pickle to make pickle files echo. json to make JSON files echo. htmlhelp to make HTML files and a HTML help project echo. qthelp to make HTML files and a qthelp project echo. devhelp to make HTML files and a Devhelp project echo. epub to make an epub echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter echo. text to make text files echo. man to make manual pages echo. texinfo to make Texinfo files echo. gettext to make PO message catalogs echo. changes to make an overview over all changed/added/deprecated items echo. linkcheck to check all external links for integrity echo. doctest to run all doctests embedded in the documentation if enabled goto end ) if "%1" == "clean" ( for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i del /q /s %BUILDDIR%\* goto end ) if "%1" == "html" ( %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/html. goto end ) if "%1" == "dirhtml" ( %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. goto end ) if "%1" == "singlehtml" ( %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. goto end ) if "%1" == "pickle" ( %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the pickle files. goto end ) if "%1" == "json" ( %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the JSON files. goto end ) if "%1" == "htmlhelp" ( %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run HTML Help Workshop with the ^ .hhp project file in %BUILDDIR%/htmlhelp. goto end ) if "%1" == "qthelp" ( %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run "qcollectiongenerator" with the ^ .qhcp project file in %BUILDDIR%/qthelp, like this: echo.^> qcollectiongenerator %BUILDDIR%\qthelp\liac-arff.qhcp echo.To view the help file: echo.^> assistant -collectionFile %BUILDDIR%\qthelp\liac-arff.ghc goto end ) if "%1" == "devhelp" ( %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp if errorlevel 1 exit /b 1 echo. echo.Build finished. goto end ) if "%1" == "epub" ( %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub if errorlevel 1 exit /b 1 echo. echo.Build finished. The epub file is in %BUILDDIR%/epub. goto end ) if "%1" == "latex" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex if errorlevel 1 exit /b 1 echo. echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. goto end ) if "%1" == "text" ( %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text if errorlevel 1 exit /b 1 echo. echo.Build finished. The text files are in %BUILDDIR%/text. goto end ) if "%1" == "man" ( %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man if errorlevel 1 exit /b 1 echo. echo.Build finished. The manual pages are in %BUILDDIR%/man. goto end ) if "%1" == "texinfo" ( %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo if errorlevel 1 exit /b 1 echo. echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. goto end ) if "%1" == "gettext" ( %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale if errorlevel 1 exit /b 1 echo. echo.Build finished. The message catalogs are in %BUILDDIR%/locale. goto end ) if "%1" == "changes" ( %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes if errorlevel 1 exit /b 1 echo. echo.The overview file is in %BUILDDIR%/changes. goto end ) if "%1" == "linkcheck" ( %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck if errorlevel 1 exit /b 1 echo. echo.Link check complete; look for any errors in the above output ^ or in %BUILDDIR%/linkcheck/output.txt. goto end ) if "%1" == "doctest" ( %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest if errorlevel 1 exit /b 1 echo. echo.Testing of doctests in the sources finished, look at the ^ results in %BUILDDIR%/doctest/output.txt. goto end ) :end liac-arff-2.4.0/docs/source/000077500000000000000000000000001342660433200155745ustar00rootroot00000000000000liac-arff-2.4.0/docs/source/_static/000077500000000000000000000000001342660433200172225ustar00rootroot00000000000000liac-arff-2.4.0/docs/source/_static/empty000066400000000000000000000000001342660433200202710ustar00rootroot00000000000000liac-arff-2.4.0/docs/source/conf.py000066400000000000000000000172601342660433200171010ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # liac-arff documentation build configuration file, created by # sphinx-quickstart on Wed Feb 12 21:11:24 2014. # # This file is execfile()d with the current directory set to its containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import sys, os # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..')) # -- General configuration ----------------------------------------------------- # If your documentation needs a minimal Sphinx version, state it here. #needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be extensions # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. extensions = ['sphinx.ext.autodoc']#, 'sphinx.ext.viewcode'] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. project = u'liac-arff' copyright = u'2014, Renato de Pontes Pereira' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = '2.0' # The full version, including alpha/beta/rc tags. release = '2.0' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = [] # The reST default role (used for this markup: `text`) to use for all documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. #add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # -- Options for HTML output --------------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = 'sphinxdoc' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. #html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_domain_indices = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. #html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = None # Output file base name for HTML help builder. htmlhelp_basename = 'liac-arffdoc' # -- Options for LaTeX output -------------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, author, documentclass [howto/manual]). latex_documents = [ ('index', 'liac-arff.tex', u'liac-arff Documentation', u'Renato de Pontes Pereira', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # If true, show page references after internal links. #latex_show_pagerefs = False # If true, show URL addresses after external links. #latex_show_urls = False # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_domain_indices = True # -- Options for manual page output -------------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ ('index', 'liac-arff', u'liac-arff Documentation', [u'Renato de Pontes Pereira'], 1) ] # If true, show URL addresses after external links. #man_show_urls = False # -- Options for Texinfo output ------------------------------------------------ # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ('index', 'liac-arff', u'liac-arff Documentation', u'Renato de Pontes Pereira', 'liac-arff', 'One line description of project.', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] # If false, no module index is generated. #texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. #texinfo_show_urls = 'footnote' liac-arff-2.4.0/docs/source/index.rst000066400000000000000000000315161342660433200174430ustar00rootroot00000000000000============== LIAC-ARFF v2.1 ============== .. contents:: Table of Contents :depth: 2 :local: ------------ Introduction ------------ .. automodule:: arff ~~~~~~~~~~~~~~~~~~~~ How to Get LIAC-ARFF ~~~~~~~~~~~~~~~~~~~~ See https://github.com/renatopp/liac-arff ~~~~~~~~~~~~~~ How To Install ~~~~~~~~~~~~~~ Via pip:: $ pip install liac-arff Via easy_install:: $ easy_install liac-arff Manually:: $ python setup.py install .. include:: ../../CHANGES.rst ----------- Basic Usage ----------- .. autofunction:: arff.load .. autofunction:: arff.loads .. autofunction:: arff.dump .. autofunction:: arff.dumps --------------------- Encoders and Decoders --------------------- .. autoclass:: arff.ArffDecoder :members: .. autoclass:: arff.ArffEncoder :members: ---------- Exceptions ---------- .. autoexception:: arff.BadRelationFormat :members: .. autoexception:: arff.BadAttributeFormat :members: .. autoexception:: arff.BadDataFormat :members: .. autoexception:: arff.BadAttributeType :members: .. autoexception:: arff.BadNominalValue :members: .. autoexception:: arff.BadNumericalValue :members: .. autoexception:: arff.BadLayout :members: .. autoexception:: arff.BadObject :members: ------- Unicode ------- LIAC-ARFF works with unicode (for python 2.7+, in python 3.x this is default), and to take advantage of it, you need to load the arff file using ``codecs``, specifying its codification:: import codecs import arff file_ = codecs.open('/path/to/file.arff', 'rb', 'utf-8') arff.load(file_) -------- Examples -------- ~~~~~~~~~~~~~~~~~ Dumping An Object ~~~~~~~~~~~~~~~~~ Converting an object to ARFF:: import arff obj = { 'description': u'', 'relation': 'weather', 'attributes': [ ('outlook', ['sunny', 'overcast', 'rainy']), ('temperature', 'REAL'), ('humidity', 'REAL'), ('windy', ['TRUE', 'FALSE']), ('play', ['yes', 'no']) ], 'data': [ ['sunny', 85.0, 85.0, 'FALSE', 'no'], ['sunny', 80.0, 90.0, 'TRUE', 'no'], ['overcast', 83.0, 86.0, 'FALSE', 'yes'], ['rainy', 70.0, 96.0, 'FALSE', 'yes'], ['rainy', 68.0, 80.0, 'FALSE', 'yes'], ['rainy', 65.0, 70.0, 'TRUE', 'no'], ['overcast', 64.0, 65.0, 'TRUE', 'yes'], ['sunny', 72.0, 95.0, 'FALSE', 'no'], ['sunny', 69.0, 70.0, 'FALSE', 'yes'], ['rainy', 75.0, 80.0, 'FALSE', 'yes'], ['sunny', 75.0, 70.0, 'TRUE', 'yes'], ['overcast', 72.0, 90.0, 'TRUE', 'yes'], ['overcast', 81.0, 75.0, 'FALSE', 'yes'], ['rainy', 71.0, 91.0, 'TRUE', 'no'] ], } print arff.dumps(obj) resulting in:: @RELATION weather @ATTRIBUTE outlook {sunny, overcast, rainy} @ATTRIBUTE temperature REAL @ATTRIBUTE humidity REAL @ATTRIBUTE windy {TRUE, FALSE} @ATTRIBUTE play {yes, no} @DATA sunny,85.0,85.0,FALSE,no sunny,80.0,90.0,TRUE,no overcast,83.0,86.0,FALSE,yes rainy,70.0,96.0,FALSE,yes rainy,68.0,80.0,FALSE,yes rainy,65.0,70.0,TRUE,no overcast,64.0,65.0,TRUE,yes sunny,72.0,95.0,FALSE,no sunny,69.0,70.0,FALSE,yes rainy,75.0,80.0,FALSE,yes sunny,75.0,70.0,TRUE,yes overcast,72.0,90.0,TRUE,yes overcast,81.0,75.0,FALSE,yes rainy,71.0,91.0,TRUE,no % % % ~~~~~~~~~~~~~~~~~ Loading An Object ~~~~~~~~~~~~~~~~~ Loading and ARFF file:: import arff import pprint file_ = '''@RELATION weather @ATTRIBUTE outlook {sunny, overcast, rainy} @ATTRIBUTE temperature REAL @ATTRIBUTE humidity REAL @ATTRIBUTE windy {TRUE, FALSE} @ATTRIBUTE play {yes, no} @DATA sunny,85.0,85.0,FALSE,no sunny,80.0,90.0,TRUE,no overcast,83.0,86.0,FALSE,yes rainy,70.0,96.0,FALSE,yes rainy,68.0,80.0,FALSE,yes rainy,65.0,70.0,TRUE,no overcast,64.0,65.0,TRUE,yes sunny,72.0,95.0,FALSE,no sunny,69.0,70.0,FALSE,yes rainy,75.0,80.0,FALSE,yes sunny,75.0,70.0,TRUE,yes overcast,72.0,90.0,TRUE,yes overcast,81.0,75.0,FALSE,yes rainy,71.0,91.0,TRUE,no % % % ''' d = arff.loads(file_) pprint.pprint(d) resulting in:: {u'attributes': [(u'outlook', [u'sunny', u'overcast', u'rainy']), (u'temperature', u'REAL'), (u'humidity', u'REAL'), (u'windy', [u'TRUE', u'FALSE']), (u'play', [u'yes', u'no'])], u'data': [[u'sunny', 85.0, 85.0, u'FALSE', u'no'], [u'sunny', 80.0, 90.0, u'TRUE', u'no'], [u'overcast', 83.0, 86.0, u'FALSE', u'yes'], [u'rainy', 70.0, 96.0, u'FALSE', u'yes'], [u'rainy', 68.0, 80.0, u'FALSE', u'yes'], [u'rainy', 65.0, 70.0, u'TRUE', u'no'], [u'overcast', 64.0, 65.0, u'TRUE', u'yes'], [u'sunny', 72.0, 95.0, u'FALSE', u'no'], [u'sunny', 69.0, 70.0, u'FALSE', u'yes'], [u'rainy', 75.0, 80.0, u'FALSE', u'yes'], [u'sunny', 75.0, 70.0, u'TRUE', u'yes'], [u'overcast', 72.0, 90.0, u'TRUE', u'yes'], [u'overcast', 81.0, 75.0, u'FALSE', u'yes'], [u'rainy', 71.0, 91.0, u'TRUE', u'no']], u'description': u'', u'relation': u'weather'} ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Loading An Object with encoded labels ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In some cases it is practical to have categorical data represented by integers, rather than strings. In `scikit-learn `__ for example, integer data can be directly converted in a continuous representation with the `One-Hot Encoder `__, which is necessary for most machine learning algorithms, e.g. `Support Vector Machines `__. The values ``[u'sunny', u'overcast', u'rainy']`` of the attribute ``u'outlook'`` would be represented by ``[0, 1, 2]``. This representation can be directly used the `One-Hot Encoder `__. Encoding categorical data while reading it from a file saves at least one memory copy and can be invoked like in this example:: import arff import pprint file_ = '''@RELATION weather @ATTRIBUTE outlook {sunny, overcast, rainy} @ATTRIBUTE temperature REAL @ATTRIBUTE humidity REAL @ATTRIBUTE windy {TRUE, FALSE} @ATTRIBUTE play {yes, no} @DATA sunny,85.0,85.0,FALSE,no sunny,80.0,90.0,TRUE,no overcast,83.0,86.0,FALSE,yes rainy,70.0,96.0,FALSE,yes rainy,68.0,80.0,FALSE,yes rainy,65.0,70.0,TRUE,no overcast,64.0,65.0,TRUE,yes sunny,72.0,95.0,FALSE,no sunny,69.0,70.0,FALSE,yes rainy,75.0,80.0,FALSE,yes sunny,75.0,70.0,TRUE,yes overcast,72.0,90.0,TRUE,yes overcast,81.0,75.0,FALSE,yes rainy,71.0,91.0,TRUE,no % % % ''' decoder = arff.ArffDecoder() d = decoder.decode(file_, encode_nominal=True) pprint.pprint(d) resulting in:: {u'attributes': [(u'outlook', [u'sunny', u'overcast', u'rainy']), (u'temperature', u'REAL'), (u'humidity', u'REAL'), (u'windy', [u'TRUE', u'FALSE']), (u'play', [u'yes', u'no'])], u'data': [[0, 85.0, 85.0, 1, 1], [0, 80.0, 90.0, 0, 1], [1, 83.0, 86.0, 1, 0], [2, 70.0, 96.0, 1, 0], [2, 68.0, 80.0, 1, 0], [2, 65.0, 70.0, 0, 1], [1, 64.0, 65.0, 0, 0], [0, 72.0, 95.0, 1, 1], [0, 69.0, 70.0, 1, 0], [2, 75.0, 80.0, 1, 0], [0, 75.0, 70.0, 0, 0], [1, 72.0, 90.0, 0, 0], [1, 81.0, 75.0, 1, 0], [2, 71.0, 91.0, 0, 1]], u'description': u'', u'relation': u'weather'} Using this dataset in `scikit-learn `__:: from sklearn import preprocessing, svm enc = preprocessing.OneHotEncoder(categorical_features=[0, 3, 4]) enc.fit(d['data']) encoded_data = enc.transform(d['data']).toarray() clf = svm.SVC() clf.fit(encoded_data[:,0:4], encoded_data[:,4]) .. _sparse: ~~~~~~~~~~~~~~~~~~~~~~~~ Working with sparse data ~~~~~~~~~~~~~~~~~~~~~~~~ Sparse data is data in which most of the elements are zero. By saving only non-zero elements, one can potentially save a lot of space on either the harddrive or in RAM. liac-arff supports two sparse data structures: * `scipy.sparse.coo `__ is intended for easy construction of sparse matrices inside a python program. * list of dictionaries in the form .. code:: python [{column: value, column: value}, {column: value, column: value}] Dumping sparse data ~~~~~~~~~~~~~~~~~~~ Both `scipy.sparse.coo `__ matrices and lists of dictionaries can be used as the value for `data` in the arff object. Let's look again at the XOR example, this time with the data encoded as a list of dictionaries: .. code:: python xor_dataset = { 'description': 'XOR Dataset', 'relation': 'XOR', 'attributes': [ ('input1', 'REAL'), ('input2', 'REAL'), ('y', 'REAL'), ], 'data': [ {}, {1: 1.0, 2: 1.0}, {0: 1.0, 2: 1.0}, {0: 1.0, 1: 1.0} ] } print arff.dumps(xor_dataset) resulting in:: % XOR Dataset @RELATION XOR @ATTRIBUTE input1 REAL @ATTRIBUTE input2 REAL @ATTRIBUTE y REAL @DATA { } { 1 1.0,2 1.0 } { 0 1.0,2 1.0 } { 0 1.0,1 1.0 } % % % Loading sparse data ~~~~~~~~~~~~~~~~~~~ When reading a sparse dataset, the user can choose a target data structure. These are represented by the constants `arff.DENSE`, `arff.COO` and `arff.LOD`:: decoder = arff.ArffDecoder() d = decoder.decode(file_, encode_nominal=True, return_type=arff.LOD) pprint.pprint(d) resulting in:: { 'description': 'XOR Dataset', 'relation': 'XOR', 'attributes': [ ('input1', 'REAL'), ('input2', 'REAL'), ('y', 'REAL'), ], 'data': [ {}, {1: 1.0, 2: 1.0}, {0: 1.0, 2: 1.0}, {0: 1.0, 1: 1.0} ] } When choosing `arff.COO`, the data can be dircetly passed to the scipy constructor:: from scipy import sparse decoder = arff.ArffDecoder() d = decoder.decode(file_, encode_nominal=True, return_type=arff.COO) data = d['data'][0] row = d['data'][1] col = d['data'][2] matrix = sparse.coo_matrix((data, (row, col)), shape=(max(row)+1, max(col)+1)) .. _generator: ~~~~~~~~~~~~~~~~~~~~~ Loading progressively ~~~~~~~~~~~~~~~~~~~~~ To avoid storing all the data in memory at once, dense and LOD sparse matrices can be loaded progressively. Setting `return_type` to `arff.DENSE_GEN` or `arff.LOD_GEN` results in the returned 'data' key containing a generator. Iterating through this generator will process each line of input and yield its data:: file_ = '''@RELATION weather @ATTRIBUTE outlook {sunny, overcast, rainy} @ATTRIBUTE temperature REAL @ATTRIBUTE humidity REAL @ATTRIBUTE windy {TRUE, FALSE} @ATTRIBUTE play {yes, no} @DATA sunny,85.0,85.0,FALSE,no sunny,80.0,90.0,TRUE,no overcast,83.0,86.0,FALSE,yes rainy,70.0,96.0,FALSE,yes rainy,68.0,80.0,FALSE,yes rainy,65.0,70.0,TRUE,no overcast,64.0,65.0,TRUE,yes sunny,72.0,95.0,FALSE,no sunny,69.0,70.0,FALSE,yes rainy,75.0,80.0,FALSE,yes sunny,75.0,70.0,TRUE,yes overcast,72.0,90.0,TRUE,yes overcast,81.0,75.0,FALSE,yes rainy,71.0,91.0,TRUE,no % % % ''' decoder = arff.ArffDecoder() d = decoder.decode(file_, return_type=arff.DENSE_GEN) next(d['data']) # get process the first record resulting in:: [u'sunny', 85.0, 85.0, u'FALSE', u'no'] If you know the number of samples in your data, you can also use progressive loading to preallocate an array and do any data conversion on the fly:: import numpy as np decoder = arff.ArffDecoder() d = decoder.decode(file_, return_type=arff.DENSE_GEN, encode_nominal=True) arr = np.fromiter((tuple(x) for x in d['data']), dtype=[('outlook', 'int'), ('temperature', 'float'), ('humidity', 'float'), ('windy', 'bool'), ('play', 'bool')], count=14) `arr[:2]` is then:: array([(0, 85., 85., True, True), (0, 80., 90., False, True)], dtype=[('outlook', '