././@PaxHeader 0000000 0000000 0000000 00000000034 00000000000 011452 x ustar 00 0000000 0000000 28 mtime=1671575901.8459957
datacache-1.2.1/ 0000755 0000765 0000024 00000000000 00000000000 014334 5 ustar 00iskander staff 0000000 0000000 ././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671574879.0
datacache-1.2.1/LICENSE 0000644 0000765 0000024 00000026073 00000000000 015351 0 ustar 00iskander staff 0000000 0000000 Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. ././@PaxHeader 0000000 0000000 0000000 00000000033 00000000000 011451 x ustar 00 0000000 0000000 27 mtime=1671575901.845862
datacache-1.2.1/PKG-INFO 0000644 0000765 0000024 00000003400 00000000000 015426 0 ustar 00iskander staff 0000000 0000000 Metadata-Version: 2.1
Name: datacache
Version: 1.2.1
Summary: Helpers for transparently downloading datasets
Home-page: https://github.com/openvax/datacache
Author: Alex Rubinsteyn
Author-email: alex.rubinsteyn@unc.edu
License: http://www.apache.org/licenses/LICENSE-2.0.html
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Description-Content-Type: text/markdown
License-File: LICENSE
DataCache
=========
Helpers for transparently downloading datasets
## API
* **fetch_file**(download_url, filename = *None*, decompress = *False*, subdir = *None*)
* **fetch_and_transform**(transformed_filename, transformer, loader,
source_filename, source_url, subdir = *None*)
* **fetch_fasta_dict**(download_url, filename = *None*, subdir = *None*)
* **fetch_fasta_db**(table_name, download_url, fasta_filename = *None*,
key_column = *'id'*, value_column = *'seq'*, subdir = *None*)
* **fetch_csv_db**(table_name, download_url, csv_filename = *None*, subdir = *None*,
\*\*pandas_kwargs)
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671574879.0
datacache-1.2.1/README.md 0000644 0000765 0000024 00000002125 00000000000 015613 0 ustar 00iskander staff 0000000 0000000
DataCache
=========
Helpers for transparently downloading datasets
## API
* **fetch_file**(download_url, filename = *None*, decompress = *False*, subdir = *None*)
* **fetch_and_transform**(transformed_filename, transformer, loader,
source_filename, source_url, subdir = *None*)
* **fetch_fasta_dict**(download_url, filename = *None*, subdir = *None*)
* **fetch_fasta_db**(table_name, download_url, fasta_filename = *None*,
key_column = *'id'*, value_column = *'seq'*, subdir = *None*)
* **fetch_csv_db**(table_name, download_url, csv_filename = *None*, subdir = *None*,
\*\*pandas_kwargs)
././@PaxHeader 0000000 0000000 0000000 00000000033 00000000000 011451 x ustar 00 0000000 0000000 27 mtime=1671575901.844456
datacache-1.2.1/datacache/ 0000755 0000765 0000024 00000000000 00000000000 016231 5 ustar 00iskander staff 0000000 0000000 ././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575071.0
datacache-1.2.1/datacache/__init__.py 0000644 0000765 0000024 00000002450 00000000000 020343 0 ustar 00iskander staff 0000000 0000000 # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .download import fetch_file, fetch_and_transform, fetch_csv_dataframe
from .database_helpers import (
db_from_dataframe,
db_from_dataframes,
db_from_dataframes_with_absolute_path,
fetch_csv_db,
connect_if_correct_version
)
from .common import (
ensure_dir,
get_data_dir,
build_path,
clear_cache,
build_local_filename
)
from .cache import Cache
from .version import __version__
__all__ = [
'fetch_file',
'fetch_and_transform',
'fetch_csv_dataframe',
'db_from_dataframe',
'db_from_dataframes',
'db_from_dataframes_with_absolute_path',
'fetch_csv_db',
'connect_if_correct_version',
'ensure_dir',
'get_data_dir',
'build_path',
'clear_cache',
'build_local_filename',
'Cache',
]
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575151.0
datacache-1.2.1/datacache/cache.py 0000644 0000765 0000024 00000010557 00000000000 017656 0 ustar 00iskander staff 0000000 0000000 # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from os.path import exists, join
from os import remove
from . import common
from . import download
from .database_helpers import db_from_dataframe
class Cache(object):
def __init__(self, subdir="datacache"):
assert subdir
self.subdir = subdir
self.cache_directory_path = common.get_data_dir(subdir)
# dictionary mapping from (URL, decompress) pair to local paths
# TODO: handle decompression separately from download,
# so we can use copies of compressed files we've already downloaded
self._local_paths = {}
def delete_url(self, url):
"""
Delete local files downloaded from given URL
"""
# file may exist locally in compressed and decompressed states
# delete both
for decompress in [False, True]:
key = (url, decompress)
if key in self._local_paths:
path = self._local_paths[key]
remove(path)
del self._local_paths[key]
# possible that file was downloaded via the download module without
# using the Cache object, this wouldn't end up in the local_paths
# but should still be deleted
path = self.local_path(
url, decompress=decompress, download=False)
if exists(path):
remove(path)
def delete_all(self):
self._local_paths.clear()
common.clear_cache(self.cache_directory_path)
common.ensure_dir(self.cache_directory_path)
def exists(self, url, filename=None, decompress=False):
"""
Return True if a local file corresponding to these arguments
exists.
"""
return download.file_exists(
url,
filename=filename,
decompress=decompress,
subdir=self.subdir)
def fetch(
self,
url,
filename=None,
decompress=False,
force=False,
timeout=None,
use_wget_if_available=True):
"""
Return the local path to the downloaded copy of a given URL.
Don't download the file again if it's already present,
unless `force` is True.
"""
key = (url, decompress)
if not force and key in self._local_paths:
path = self._local_paths[key]
if exists(path):
return path
else:
del self._local_paths[key]
path = download.fetch_file(
url,
filename=filename,
decompress=decompress,
subdir=self.subdir,
force=force,
timeout=timeout,
use_wget_if_available=use_wget_if_available)
self._local_paths[key] = path
return path
def local_filename(
self,
url=None,
filename=None,
decompress=False):
"""
What local filename will we use within the cache directory
for the given URL/filename/decompress options.
"""
return common.build_local_filename(url, filename, decompress)
def local_path(self, url, filename=None, decompress=False, download=False):
"""
What will the full local path be if we download the given file?
"""
if download:
return self.fetch(url=url, filename=filename, decompress=decompress)
else:
filename = self.local_filename(url, filename, decompress)
return join(self.cache_directory_path, filename)
def db_from_dataframe(
self,
db_filename,
table_name,
df,
key_column_name=None):
return db_from_dataframe(
db_filename=db_filename,
table_name=table_name,
df=df,
primary_key=key_column_name,
subdir=self.subdir)
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575145.0
datacache-1.2.1/datacache/common.py 0000644 0000765 0000024 00000004622 00000000000 020077 0 ustar 00iskander staff 0000000 0000000 # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import hashlib
from os import makedirs, environ
from os.path import join, exists, split, splitext
import re
from shutil import rmtree
import appdirs
def ensure_dir(path):
if not exists(path):
makedirs(path)
def get_data_dir(subdir=None, envkey=None):
if envkey and environ.get(envkey):
envdir = environ[envkey]
if subdir:
return join(envdir, subdir)
else:
return envdir
return appdirs.user_cache_dir(subdir if subdir else "datacache")
def build_path(filename, subdir=None):
data_dir = get_data_dir(subdir)
ensure_dir(data_dir)
return join(data_dir, filename)
def clear_cache(subdir=None):
data_dir = get_data_dir(subdir)
rmtree(data_dir)
def normalize_filename(filename):
"""
Remove special characters and shorten if name is too long
"""
# if the url pointed to a directory then just replace all the special chars
filename = re.sub("/|\\|;|:|\?|=", "_", filename)
if len(filename) > 150:
prefix = hashlib.md5(filename).hexdigest()
filename = prefix + filename[-140:]
return filename
def build_local_filename(download_url=None, filename=None, decompress=False):
"""
Determine which local filename to use based on the file's source URL,
an optional desired filename, and whether a compression suffix needs
to be removed
"""
assert download_url or filename, "Either filename or URL must be specified"
# if no filename provided, use the original filename on the server
if not filename:
digest = hashlib.md5(download_url.encode('utf-8')).hexdigest()
parts = split(download_url)
filename = digest + "." + "_".join(parts)
filename = normalize_filename(filename)
if decompress:
(base, ext) = splitext(filename)
if ext in (".gz", ".zip"):
filename = base
return filename
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671574879.0
datacache-1.2.1/datacache/database.py 0000644 0000765 0000024 00000017563 00000000000 020363 0 ustar 00iskander staff 0000000 0000000 # Copyright (c) 2015-2018 Mount Sinai School of Medicine
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function, division, absolute_import
import logging
import sqlite3
from typechecks import require_integer, require_string, require_iterable_of
logger = logging.getLogger(__name__)
METADATA_TABLE_NAME = "_datacache_metadata"
class Database(object):
"""
Wrapper object for sqlite3 database which provides helpers for
querying and constructing the datacache metadata table, as well as
creating and checking for existence of particular table names.
Calls to methods other than Database.close() and Database.create()
will not commit their changes.
"""
def __init__(self, path):
self.path = path
self.connection = sqlite3.connect(path)
def _commit(self):
self.connection.commit()
def close(self):
"""Commit changes and close database connection"""
self._commit()
self.connection.close()
def table_names(self):
"""Returns names of all tables in the database"""
query = "SELECT name FROM sqlite_master WHERE type='table'"
cursor = self.connection.execute(query)
results = cursor.fetchall()
return [result_tuple[0] for result_tuple in results]
def has_table(self, table_name):
"""Does a table named `table_name` exist in the sqlite database?"""
table_names = self.table_names()
return table_name in table_names
def drop_all_tables(self):
"""Drop all tables in the database"""
for table_name in self.table_names():
self.execute_sql("DROP TABLE %s" % table_name)
self.connection.commit()
def execute_sql(self, sql, commit=False):
"""Log and then execute a SQL query"""
logger.info("Running sqlite query: \"%s\"", sql)
self.connection.execute(sql)
if commit:
self.connection.commit()
def has_tables(self, table_names):
"""Are all of the given table names present in the database?"""
return all(self.has_table(table_name) for table_name in table_names)
def has_version(self):
"""Does this database have version information?
The absence of version information indicates that this database was
either not created by datacache or is incomplete.
"""
return self.has_table(METADATA_TABLE_NAME)
def version(self):
"""What's the version of this database? Found in metadata attached
by datacache when creating this database."""
query = "SELECT version FROM %s" % METADATA_TABLE_NAME
cursor = self.connection.execute(query)
version = cursor.fetchone()
if not version:
return 0
else:
return int(version[0])
def _finalize_database(self, version):
"""
Create metadata table for database with version number.
Parameters
----------
version : int
Tag created database with user-specified version number
"""
require_integer(version, "version")
create_metadata_sql = \
"CREATE TABLE %s (version INT)" % METADATA_TABLE_NAME
self.execute_sql(create_metadata_sql)
insert_version_sql = \
"INSERT INTO %s VALUES (%s)" % (METADATA_TABLE_NAME, version)
self.execute_sql(insert_version_sql)
def _create_table(self, table_name, column_types, primary=None, nullable=()):
"""Creates a sqlite3 table from the given metadata.
Parameters
----------
column_types : list of (str, str) pairs
First element of each tuple is the column name, second element is the sqlite3 type
primary : str, optional
Which column is the primary key
nullable : iterable, optional
Names of columns which have null values
"""
require_string(table_name, "table name")
require_iterable_of(column_types, tuple, name="rows")
if primary is not None:
require_string(primary, "primary")
require_iterable_of(nullable, str, name="nullable")
column_decls = []
for column_name, column_type in column_types:
decl = "%s %s" % (column_name, column_type)
if column_name == primary:
decl += " UNIQUE PRIMARY KEY"
if column_name not in nullable:
decl += " NOT NULL"
column_decls.append(decl)
column_decl_str = ", ".join(column_decls)
create_table_sql = \
"CREATE TABLE %s (%s)" % (table_name, column_decl_str)
self.execute_sql(create_table_sql)
def _fill_table(self, table_name, rows):
require_string(table_name, "table_name")
require_iterable_of(rows, tuple, "rows")
if not self.has_table(table_name):
raise ValueError(
"Table '%s' does not exist in database" % (table_name,))
if len(rows) == 0:
raise ValueError("Rows must be non-empty sequence")
first_row = rows[0]
n_columns = len(first_row)
if not all(len(row) == n_columns for row in rows):
raise ValueError("Rows must all have %d values" % n_columns)
blank_slots = ", ".join("?" for _ in range(n_columns))
logger.info("Inserting %d rows into table %s", len(rows), table_name)
sql = "INSERT INTO %s VALUES (%s)" % (table_name, blank_slots)
self.connection.executemany(sql, rows)
def create(self, tables, version):
"""Do the actual work of creating the database, filling its tables with
values, creating indices, and setting the datacache version metadata.
Parameters
----------
tables : list
List of datacache.DatabaseTable objects
version : int
"""
for table in tables:
self._create_table(
table_name=table.name,
column_types=table.column_types,
primary=table.primary_key,
nullable=table.nullable)
self._fill_table(table.name, table.rows)
self._create_indices(table.name, table.indices)
self._finalize_database(version)
self._commit()
def _create_index(self, table_name, index_columns):
"""
Creates an index over multiple columns of a given table.
Parameters
----------
table_name : str
index_columns : iterable of str
Which columns should be indexed
"""
logger.info(
"Creating index on %s (%s)",
table_name,
", ".join(index_columns))
index_name = "%s_index_%s" % (
table_name,
"_".join(index_columns))
self.connection.execute(
"CREATE INDEX IF NOT EXISTS %s ON %s (%s)" % (
index_name,
table_name,
", ".join(index_columns)))
def _create_indices(self, table_name, indices):
"""
Create multiple indices (each over multiple columns) on a given table.
Parameters
----------
table_name : str
indices : iterable of tuples
Multiple groups of columns, each of which should be indexed.
"""
require_string(table_name, "table_name")
require_iterable_of(indices, (tuple, list))
for index_column_set in indices:
self._create_index(table_name, index_column_set)
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575103.0
datacache-1.2.1/datacache/database_helpers.py 0000644 0000765 0000024 00000020553 00000000000 022076 0 ustar 00iskander staff 0000000 0000000 # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function, division, absolute_import
from os import remove
from os.path import splitext, exists
import logging
from typechecks import (
require_string,
require_integer,
require_iterable_of
)
from .common import build_path
from .download import fetch_csv_dataframe
from .database import Database
from .database_table import DatabaseTable
from .database_types import db_type
logger = logging.getLogger(__name__)
def connect_if_correct_version(db_path, version):
"""Return a sqlite3 database connection if the version in the database's
metadata matches the version argument.
Also implicitly checks for whether the data in this database has
been completely filled, since we set the version last.
TODO: Make an explicit 'complete' flag to the metadata.
"""
db = Database(db_path)
if db.has_version() and db.version() == version:
return db.connection
return None
def _create_cached_db(
db_path,
tables,
version=1):
"""
Either create or retrieve sqlite database.
Parameters
--------
db_path : str
Path to sqlite3 database file
tables : dict
Dictionary mapping table names to datacache.DatabaseTable objects
version : int, optional
Version acceptable as cached data.
Returns sqlite3 connection
"""
require_string(db_path, "db_path")
require_iterable_of(tables, DatabaseTable)
require_integer(version, "version")
# if the database file doesn't already exist and we encounter an error
# later, delete the file before raising an exception
delete_on_error = not exists(db_path)
# if the database already exists, contains all the table
# names and has the right version, then just return it
db = Database(db_path)
# make sure to delete the database file in case anything goes wrong
# to avoid leaving behind an empty DB
table_names = [table.name for table in tables]
try:
if db.has_tables(table_names) and \
db.has_version() and \
db.version() == version:
logger.info("Found existing table in database %s", db_path)
else:
if len(db.table_names()) > 0:
logger.info(
"Dropping tables from database %s: %s",
db_path,
", ".join(db.table_names()))
db.drop_all_tables()
logger.info(
"Creating database %s containing: %s",
db_path,
", ".join(table_names))
db.create(tables, version)
except:
logger.warning(
"Failed to create tables %s in database %s",
table_names,
db_path)
db.close()
if delete_on_error:
remove(db_path)
raise
return db.connection
def build_tables(
table_names_to_dataframes,
table_names_to_primary_keys={},
table_names_to_indices={}):
"""
Parameters
----------
table_names_to_dataframes : dict
Dictionary mapping each table name to a DataFrame
table_names_to_primary_keys : dict
Dictionary mapping each table to its primary key
table_names_to_indices : dict
Dictionary mapping each table to a set of indices
Returns list of DatabaseTable objects
"""
tables = []
for table_name, df in table_names_to_dataframes.items():
table_indices = table_names_to_indices.get(table_name, [])
primary_key = table_names_to_primary_keys.get(table_name)
table = DatabaseTable.from_dataframe(
name=table_name,
df=df,
indices=table_indices,
primary_key=primary_key)
tables.append(table)
return tables
def db_from_dataframes_with_absolute_path(
db_path,
table_names_to_dataframes,
table_names_to_primary_keys={},
table_names_to_indices={},
overwrite=False,
version=1):
"""
Create a sqlite3 database from a collection of DataFrame objects
Parameters
----------
db_path : str
Path to database file to create
table_names_to_dataframes : dict
Dictionary from table names to DataFrame objects
table_names_to_primary_keys : dict, optional
Name of primary key column for each table
table_names_to_indices : dict, optional
Dictionary from table names to list of column name tuples
overwrite : bool, optional
If the database already exists, overwrite it?
version : int, optional
"""
if overwrite and exists(db_path):
remove(db_path)
tables = build_tables(
table_names_to_dataframes,
table_names_to_primary_keys,
table_names_to_indices)
return _create_cached_db(
db_path,
tables=tables,
version=version)
def db_from_dataframes(
db_filename,
dataframes,
primary_keys={},
indices={},
subdir=None,
overwrite=False,
version=1):
"""
Create a sqlite3 database from a collection of DataFrame objects
Parameters
----------
db_filename : str
Name of database file to create
dataframes : dict
Dictionary from table names to DataFrame objects
primary_keys : dict, optional
Name of primary key column for each table
indices : dict, optional
Dictionary from table names to list of column name tuples
subdir : str, optional
overwrite : bool, optional
If the database already exists, overwrite it?
version : int, optional
"""
if not (subdir is None or isinstance(subdir, str)):
raise TypeError("Expected subdir to be None or str, got %s : %s" % (
subdir, type(subdir)))
db_path = build_path(db_filename, subdir)
return db_from_dataframes_with_absolute_path(
db_path,
table_names_to_dataframes=dataframes,
table_names_to_primary_keys=primary_keys,
table_names_to_indices=indices,
overwrite=overwrite,
version=version)
def db_from_dataframe(
db_filename,
table_name,
df,
primary_key=None,
subdir=None,
overwrite=False,
indices=(),
version=1):
"""
Given a dataframe `df`, turn it into a sqlite3 database.
Store values in a table called `table_name`.
Returns full path to the sqlite database file.
"""
return db_from_dataframes(
db_filename=db_filename,
dataframes={table_name: df},
primary_keys={table_name: primary_key},
indices={table_name: indices},
subdir=subdir,
overwrite=overwrite,
version=version)
def _db_filename_from_dataframe(base_filename, df):
"""
Generate database filename for a sqlite3 database we're going to
fill with the contents of a DataFrame, using the DataFrame's
column names and types.
"""
db_filename = base_filename + ("_nrows%d" % len(df))
for column_name in df.columns:
column_db_type = db_type(df[column_name].dtype)
column_name = column_name.replace(" ", "_")
db_filename += ".%s_%s" % (column_name, column_db_type)
return db_filename + ".db"
def fetch_csv_db(
table_name,
download_url,
csv_filename=None,
db_filename=None,
subdir=None,
version=1,
**pandas_kwargs):
"""
Download a remote CSV file and create a local sqlite3 database
from its contents
"""
df = fetch_csv_dataframe(
download_url=download_url,
filename=csv_filename,
subdir=subdir,
**pandas_kwargs)
base_filename = splitext(csv_filename)[0]
if db_filename is None:
db_filename = _db_filename_from_dataframe(base_filename, df)
return db_from_dataframe(
db_filename,
table_name,
df,
subdir=subdir,
version=version)
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575198.0
datacache-1.2.1/datacache/database_table.py 0000644 0000765 0000024 00000005461 00000000000 021524 0 ustar 00iskander staff 0000000 0000000 # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .database_types import db_type
class DatabaseTable(object):
"""Converts between a DataFrame and a sqlite3 database table"""
def __init__(
self,
name,
column_types,
make_rows,
indices=[],
nullable=set(),
primary_key=None):
self.name = name
self.column_types = column_types
self.make_rows = make_rows
self.indices = indices
self.nullable = nullable
self.primary_key = primary_key
@property
def rows(self):
"""Delay constructing list of row tuples"""
return self.make_rows()
@classmethod
def from_dataframe(cls, name, df, indices, primary_key=None):
"""Infer table metadata from a DataFrame"""
# ordered list (column_name, column_type) pairs
column_types = []
# which columns have nullable values
nullable = set()
# tag cached database by dataframe's number of rows and columns
for column_name in df.columns:
values = df[column_name]
if values.isnull().any():
nullable.add(column_name)
column_db_type = db_type(values.dtype)
column_types.append((column_name.replace(" ", "_"), column_db_type))
def make_rows():
return list(tuple(row) for row in df.values)
return cls(
name=name,
column_types=column_types,
make_rows=make_rows,
indices=indices,
nullable=nullable,
primary_key=primary_key)
@classmethod
def from_fasta_dict(cls, name, fasta_dict, key_column, value_column):
key_list = list(fasta_dict.keys())
key_set = set(key_list)
assert len(key_set) == len(key_list), \
"FASTA file from contains %d non-unique sequence identifiers" % \
(len(key_list) - len(key_set))
column_types = [(key_column, "TEXT"), (value_column, "TEXT")]
def make_rows():
return [
(idx, str(record.seq))
for (idx, record)
in fasta_dict.items()
]
return cls(
name=name,
column_types=column_types,
make_rows=make_rows,
primary_key=key_column)
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575128.0
datacache-1.2.1/datacache/database_types.py 0000644 0000765 0000024 00000005246 00000000000 021602 0 ustar 00iskander staff 0000000 0000000 # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert from Python type names to sqlite3 column types"""
from __future__ import print_function, division, absolute_import
import numpy as np
_dtype_to_db_type_dict = {
'int': 'INT',
'int8': 'INT',
'int16': 'INT',
'int32': 'INT',
'int64': 'INT',
'uint8': 'INT',
'uint16': 'INT',
'uint32': 'INT',
'uint64': 'INT',
'bool': 'INT',
'float': 'FLOAT',
'float32': 'FLOAT',
'float64': 'FLOAT',
'object': 'TEXT',
'object_': 'TEXT',
'string_': 'TEXT',
'str': 'TEXT',
}
def _lookup_type_name(type_name):
if type_name in _dtype_to_db_type_dict:
return _dtype_to_db_type_dict[type_name]
else:
return None
def _candidate_type_names(python_type_representation):
"""Generator which yields possible type names to look up in the conversion
dictionary.
Parameters
----------
python_type_representation : object
Any Python object which represents a type, such as `int`,
`dtype('int8')`, `np.int8`, or `"int8"`.
"""
# if we get a dtype object i.e. dtype('int16'), then pull out its name
if hasattr(python_type_representation, 'name'):
yield python_type_representation.name
# convert Python types by adding their type's name
if hasattr(python_type_representation, '__name__'):
yield python_type_representation.__name__
# for a dtype like dtype('S3') need to access dtype.type.__name__
# to get 'string_'
if hasattr(python_type_representation, 'type'):
if hasattr(python_type_representation.type, '__name__'):
yield python_type_representation.type.__name__
yield str(python_type_representation)
def db_type(python_type_representation):
"""
Converts from any of:
(1) Python type
(2) NumPy/Pandas dtypes
(3) string names of types
...to a sqlite3 type name
"""
for type_name in _candidate_type_names(python_type_representation):
db_type_name = _lookup_type_name(type_name)
if db_type_name:
return db_type_name
raise ValueError("Failed to find sqlite3 column type for %s" % (
python_type_representation))
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575184.0
datacache-1.2.1/datacache/download.py 0000644 0000765 0000024 00000020431 00000000000 020412 0 ustar 00iskander staff 0000000 0000000 # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import gzip
import logging
import os
import subprocess
from shutil import move
from tempfile import NamedTemporaryFile
import zipfile
import urllib
import requests
import pandas as pd
from .common import build_path, build_local_filename
logger = logging.getLogger(__name__)
def _download(download_url, timeout=None):
if download_url.startswith("http"):
response = requests.get(download_url, timeout=timeout)
response.raise_for_status()
return response.content
else:
req = urllib.request.Request(download_url)
response = urllib.request.urlopen(req, data=None, timeout=timeout)
return response.read()
def _download_to_temp_file(
download_url,
timeout=None,
base_name="download",
ext="tmp",
use_wget_if_available=False):
if not download_url:
raise ValueError("URL not provided")
with NamedTemporaryFile(
suffix='.' + ext,
prefix=base_name,
delete=False) as tmp:
tmp_path = tmp.name
def download_using_python():
with open(tmp_path, mode="w+b") as tmp_file:
tmp_file.write(
_download(download_url, timeout=timeout))
if not use_wget_if_available:
download_using_python()
else:
try:
# first try using wget to download since this works on Travis
# even when FTP otherwise fails
wget_command_list = [
"wget",
download_url,
"-O", tmp_path,
"--no-verbose",
]
if download_url.startswith("ftp"):
wget_command_list.extend(["--passive-ftp"])
if timeout:
wget_command_list.extend(["-T", "%s" % timeout])
logger.info("Running: %s" % (" ".join(wget_command_list)))
subprocess.call(wget_command_list)
except OSError as e:
if e.errno == os.errno.ENOENT:
# wget not found
download_using_python()
else:
raise
return tmp_path
def _download_and_decompress_if_necessary(
full_path,
download_url,
timeout=None,
use_wget_if_available=False):
"""
Downloads remote file at `download_url` to local file at `full_path`
"""
logger.info("Downloading %s to %s", download_url, full_path)
filename = os.path.split(full_path)[1]
base_name, ext = os.path.splitext(filename)
tmp_path = _download_to_temp_file(
download_url=download_url,
timeout=timeout,
base_name=base_name,
ext=ext,
use_wget_if_available=use_wget_if_available)
if download_url.endswith("zip") and not filename.endswith("zip"):
logger.info("Decompressing zip into %s...", filename)
with zipfile.ZipFile(tmp_path) as z:
names = z.namelist()
assert len(names) > 0, "Empty zip archive"
if filename in names:
chosen_filename = filename
else:
# If zip archive contains multiple files, choose the biggest.
biggest_size = 0
chosen_filename = names[0]
for info in z.infolist():
if info.file_size > biggest_size:
chosen_filename = info.filename
biggest_size = info.file_size
extract_path = z.extract(chosen_filename)
move(extract_path, full_path)
os.remove(tmp_path)
elif download_url.endswith("gz") and not filename.endswith("gz"):
logger.info("Decompressing gzip into %s...", filename)
with gzip.GzipFile(tmp_path) as src:
contents = src.read()
os.remove(tmp_path)
with open(full_path, 'wb') as dst:
dst.write(contents)
elif download_url.endswith(("html", "htm")) and full_path.endswith(".csv"):
logger.info("Extracting HTML table into CSV %s...", filename)
df = pd.read_html(tmp_path, header=0)[0]
df.to_csv(full_path, sep=',', index=False, encoding='utf-8')
else:
move(tmp_path, full_path)
def file_exists(
download_url,
filename=None,
decompress=False,
subdir=None):
"""
Return True if a local file corresponding to these arguments
exists.
"""
filename = build_local_filename(download_url, filename, decompress)
full_path = build_path(filename, subdir)
return os.path.exists(full_path)
def fetch_file(
download_url,
filename=None,
decompress=False,
subdir=None,
force=False,
timeout=None,
use_wget_if_available=False):
"""
Download a remote file and store it locally in a cache directory. Don't
download it again if it's already present (unless `force` is True.)
Parameters
----------
download_url : str
Remote URL of file to download.
filename : str, optional
Local filename, used as cache key. If omitted, then determine the local
filename from the URL.
decompress : bool, optional
By default any file whose remote extension is one of (".zip", ".gzip")
and whose local filename lacks this suffix is decompressed. If a local
filename wasn't provided but you still want to decompress the stored
data then set this option to True.
subdir : str, optional
Group downloads in a single subdirectory.
force : bool, optional
By default, a remote file is not downloaded if it's already present.
However, with this argument set to True, it will be overwritten.
timeout : float, optional
Timeout for download in seconds, default is None which uses
global timeout.
use_wget_if_available: bool, optional
If the `wget` command is available, use that for download instead
of Python libraries (default True)
Returns the full path of the local file.
"""
filename = build_local_filename(download_url, filename, decompress)
full_path = build_path(filename, subdir)
if not os.path.exists(full_path) or force:
logger.info("Fetching %s from URL %s", filename, download_url)
_download_and_decompress_if_necessary(
full_path=full_path,
download_url=download_url,
timeout=timeout,
use_wget_if_available=use_wget_if_available)
else:
logger.info("Cached file %s from URL %s", filename, download_url)
return full_path
def fetch_and_transform(
transformed_filename,
transformer,
loader,
source_filename,
source_url,
subdir=None):
"""
Fetch a remote file from `source_url`, save it locally as `source_filename` and then use
the `loader` and `transformer` function arguments to turn this saved data into an in-memory
object.
"""
transformed_path = build_path(transformed_filename, subdir)
if not os.path.exists(transformed_path):
source_path = fetch_file(source_url, source_filename, subdir)
logger.info("Generating data file %s from %s", transformed_path, source_path)
result = transformer(source_path, transformed_path)
else:
logger.info("Cached data file: %s", transformed_path)
result = loader(transformed_path)
assert os.path.exists(transformed_path)
return result
def fetch_csv_dataframe(
download_url,
filename=None,
subdir=None,
**pandas_kwargs):
"""
Download a remote file from `download_url` and save it locally as `filename`.
Load that local file as a CSV into Pandas using extra keyword arguments such as sep='\t'.
"""
path = fetch_file(
download_url=download_url,
filename=filename,
decompress=True,
subdir=subdir)
return pd.read_csv(path, **pandas_kwargs)
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575805.0
datacache-1.2.1/datacache/version.py 0000644 0000765 0000024 00000000030 00000000000 020261 0 ustar 00iskander staff 0000000 0000000 __version__ = '1.2.1'
././@PaxHeader 0000000 0000000 0000000 00000000034 00000000000 011452 x ustar 00 0000000 0000000 28 mtime=1671575901.8449829
datacache-1.2.1/datacache.egg-info/ 0000755 0000765 0000024 00000000000 00000000000 017723 5 ustar 00iskander staff 0000000 0000000 ././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575901.0
datacache-1.2.1/datacache.egg-info/PKG-INFO 0000644 0000765 0000024 00000003400 00000000000 021015 0 ustar 00iskander staff 0000000 0000000 Metadata-Version: 2.1
Name: datacache
Version: 1.2.1
Summary: Helpers for transparently downloading datasets
Home-page: https://github.com/openvax/datacache
Author: Alex Rubinsteyn
Author-email: alex.rubinsteyn@unc.edu
License: http://www.apache.org/licenses/LICENSE-2.0.html
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Description-Content-Type: text/markdown
License-File: LICENSE
DataCache
=========
Helpers for transparently downloading datasets
## API
* **fetch_file**(download_url, filename = *None*, decompress = *False*, subdir = *None*)
* **fetch_and_transform**(transformed_filename, transformer, loader,
source_filename, source_url, subdir = *None*)
* **fetch_fasta_dict**(download_url, filename = *None*, subdir = *None*)
* **fetch_fasta_db**(table_name, download_url, fasta_filename = *None*,
key_column = *'id'*, value_column = *'seq'*, subdir = *None*)
* **fetch_csv_db**(table_name, download_url, csv_filename = *None*, subdir = *None*,
\*\*pandas_kwargs)
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575901.0
datacache-1.2.1/datacache.egg-info/SOURCES.txt 0000644 0000765 0000024 00000001107 00000000000 021606 0 ustar 00iskander staff 0000000 0000000 LICENSE
README.md
setup.py
datacache/__init__.py
datacache/cache.py
datacache/common.py
datacache/database.py
datacache/database_helpers.py
datacache/database_table.py
datacache/database_types.py
datacache/download.py
datacache/version.py
datacache.egg-info/PKG-INFO
datacache.egg-info/SOURCES.txt
datacache.egg-info/dependency_links.txt
datacache.egg-info/requires.txt
datacache.egg-info/top_level.txt
test/test_cache_object.py
test/test_database_objects.py
test/test_database_types.py
test/test_db_from_dataframes.py
test/test_download.py
test/test_fasta_dict.py
test/test_names.py ././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575901.0
datacache-1.2.1/datacache.egg-info/dependency_links.txt 0000644 0000765 0000024 00000000001 00000000000 023771 0 ustar 00iskander staff 0000000 0000000
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575901.0
datacache-1.2.1/datacache.egg-info/requires.txt 0000644 0000765 0000024 00000000130 00000000000 022315 0 ustar 00iskander staff 0000000 0000000 pandas>=0.15.2
appdirs>=1.4.0
progressbar33>=2.4
requests>=2.5.1
typechecks>=0.0.2
mock
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575901.0
datacache-1.2.1/datacache.egg-info/top_level.txt 0000644 0000765 0000024 00000000012 00000000000 022446 0 ustar 00iskander staff 0000000 0000000 datacache
././@PaxHeader 0000000 0000000 0000000 00000000034 00000000000 011452 x ustar 00 0000000 0000000 28 mtime=1671575901.8460357
datacache-1.2.1/setup.cfg 0000644 0000765 0000024 00000000046 00000000000 016155 0 ustar 00iskander staff 0000000 0000000 [egg_info]
tag_build =
tag_date = 0
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575233.0
datacache-1.2.1/setup.py 0000644 0000765 0000024 00000004303 00000000000 016046 0 ustar 00iskander staff 0000000 0000000 # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import logging
import re
from setuptools import setup
readme_filename = "README.md"
current_directory = os.path.dirname(__file__)
readme_path = os.path.join(current_directory, readme_filename)
project_name = "datacache"
try:
with open(readme_path, "r") as f:
readme_markdown = f.read()
except:
logging.warn("Failed to load %s", readme_filename)
readme_markdown = ""
with open('%s/version.py' % project_name, 'r') as f:
version = re.search(
r'^__version__\s*=\s*[\'"]([^\'"]*)[\'"]',
f.read(),
re.MULTILINE).group(1)
if __name__ == "__main__":
setup(
name=project_name,
version=version,
description="Helpers for transparently downloading datasets",
author="Alex Rubinsteyn",
author_email="alex.rubinsteyn@unc.edu",
url="https://github.com/openvax/%s" % project_name,
license="http://www.apache.org/licenses/LICENSE-2.0.html",
classifiers=[
"Development Status :: 4 - Beta",
"Environment :: Console",
"Operating System :: OS Independent",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: Apache Software License",
"Programming Language :: Python",
"Topic :: Scientific/Engineering :: Bio-Informatics",
],
install_requires=[
"pandas>=0.15.2",
"appdirs>=1.4.0",
"progressbar33>=2.4",
"requests>=2.5.1",
"typechecks>=0.0.2",
"mock",
],
long_description=readme_markdown,
long_description_content_type='text/markdown',
packages=[project_name],
)
././@PaxHeader 0000000 0000000 0000000 00000000034 00000000000 011452 x ustar 00 0000000 0000000 28 mtime=1671575901.8457015
datacache-1.2.1/test/ 0000755 0000765 0000024 00000000000 00000000000 015313 5 ustar 00iskander staff 0000000 0000000 ././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575718.0
datacache-1.2.1/test/test_cache_object.py 0000644 0000765 0000024 00000004374 00000000000 021325 0 ustar 00iskander staff 0000000 0000000 from os import remove
from os.path import exists
from mock import patch
from datacache import Cache
CACHE_DIR = "datacache_test"
TEST_URL = "http://www.google.com"
TEST_FILENAME = "google"
def eq_(x, y):
assert x == y
def test_cache_object_path():
cache = Cache(CACHE_DIR)
assert cache.cache_directory_path.endswith(CACHE_DIR), \
"Expected directory path to end with %s but got %s" % (
CACHE_DIR, cache.cache_directory_path)
def test_cache_object_local_filename():
filename = Cache(CACHE_DIR).local_filename(filename="test")
assert filename.endswith("test")
def test_cache_fetch_google():
cache = Cache(CACHE_DIR)
path = cache.fetch(TEST_URL, filename=TEST_FILENAME)
assert path.endswith(TEST_FILENAME), \
"Expected local file to be named %s but got %s" % (
TEST_FILENAME, path)
assert exists(path), "File not found: %s" % path
eq_(path, cache.local_path(TEST_URL, filename=TEST_FILENAME))
@patch('datacache.cache.download._download_and_decompress_if_necessary')
def test_cache_fetch_force(mock_download):
cache = Cache("datacache_test")
cache.fetch("http://www.google.com", filename="google", force=True)
cache.fetch("http://www.google.com", filename="google", force=True)
assert len(mock_download.call_args_list) == 2, \
"Expected two separate calls to _download, given force=True"
def test_cache_delete_url():
cache = Cache(CACHE_DIR)
path = cache.fetch(TEST_URL, filename=TEST_FILENAME)
assert exists(path), "Expected %s to exist after download" % path
cache.delete_url(TEST_URL)
assert not exists(path), \
"Expected %s to be deleted after call to delete_url" % path
def test_cache_missing_file():
"""test_cache_missing_file : Files can be deleted from the file system,
Cache should be aware that these files no longer exist
"""
cache = Cache(CACHE_DIR)
path = cache.fetch(TEST_URL, filename=TEST_FILENAME)
# does the filename exist?
assert exists(path)
# does the cache know the URL has been downloaded?
assert cache.exists(TEST_URL, filename=TEST_FILENAME)
remove(path)
assert not cache.exists(TEST_URL, filename=TEST_FILENAME), \
"Local file for %s has been deleted from the file system" % TEST_URL
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575782.0
datacache-1.2.1/test/test_database_objects.py 0000644 0000765 0000024 00000002773 00000000000 022212 0 ustar 00iskander staff 0000000 0000000 """
Test that datacache constructs databases correctly
(separately from downloading/caching them)
"""
import tempfile
import datacache
def eq_(x, y):
assert x == y
TABLE_NAME = "test"
INT_COL_NAME = "int_col"
STR_COL_NAME = "str_col"
COL_TYPES = [(INT_COL_NAME, "INT"), (STR_COL_NAME, "STR")]
KEY_COLUMN_NAME = "int_col"
NULLABLE = {STR_COL_NAME}
ROWS = [(1, "darkness"), (2, "light"), (3, None)]
INDICES = [["str_col"]]
VERSION = 2
def make_table_object():
return datacache.database_table.DatabaseTable(
name=TABLE_NAME,
column_types=COL_TYPES,
make_rows=lambda: ROWS,
indices=INDICES,
nullable=NULLABLE,
primary_key=INT_COL_NAME)
def test_database_table_object():
table = make_table_object()
eq_(table.name, TABLE_NAME)
eq_(table.indices, INDICES)
eq_(table.nullable, NULLABLE)
eq_(table.rows, ROWS)
eq_(table.indices, INDICES)
def test_create_db():
with tempfile.NamedTemporaryFile(suffix="test.db") as f:
db = datacache.database.Database(f.name)
table = make_table_object()
db.create(tables=[table], version=VERSION)
assert db.has_table(TABLE_NAME)
assert db.has_version()
assert db.version() == VERSION
sql = """
SELECT %s from %s WHERE %s = '%s'
""" % (INT_COL_NAME, TABLE_NAME, STR_COL_NAME, "light")
cursor = db.connection.execute(sql)
int_result_tuple = cursor.fetchone()
int_result = int_result_tuple[0]
eq_(int_result, 2)
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575695.0
datacache-1.2.1/test/test_database_types.py 0000644 0000765 0000024 00000000626 00000000000 021720 0 ustar 00iskander staff 0000000 0000000 import numpy as np
from datacache.database_types import db_type
def test_db_types():
for int_type in [
int,
np.int8, np.int16, np.int32, np.int64,
np.uint8, np.uint16, np.uint32, np.uint64]:
assert db_type(int_type) == "INT"
for float_type in [float, np.float32, np.float64]:
assert db_type(float) == "FLOAT"
assert db_type(str) == "TEXT"
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671575737.0
datacache-1.2.1/test/test_db_from_dataframes.py 0000644 0000765 0000024 00000002517 00000000000 022530 0 ustar 00iskander staff 0000000 0000000 import pandas as pd
from tempfile import NamedTemporaryFile
from datacache import db_from_dataframes, db_from_dataframe
dfA = pd.DataFrame({"numbers": [1, 2, 3], "strings": ["a", "b", "c"]})
dfB = pd.DataFrame({"wuzzles": ["nuzzle", "ruzzle"]})
def eq_(x, y):
assert x == y
def test_database_from_dataframes():
with NamedTemporaryFile(suffix="test.db") as f:
db = db_from_dataframes(
db_filename=f.name,
dataframes={"A": dfA, "B": dfB},
primary_keys={"A": "numbers"},
indices={"A": [("numbers", "strings")]},
subdir="test_datacache")
cursor_A = db.execute("SELECT * FROM A")
results_A = cursor_A.fetchall()
eq_(results_A, [(1, "a"), (2, "b"), (3, "c")])
cursor_B = db.execute("SELECT * FROM B")
results_B = cursor_B.fetchall()
eq_(results_B, [("nuzzle",), ("ruzzle",)])
def test_database_from_single_dataframe():
with NamedTemporaryFile(suffix="test.db") as f:
db = db_from_dataframe(
db_filename=f.name,
table_name="A",
df=dfA,
primary_key="numbers",
indices=[("numbers", "strings")],
subdir="test_datacache")
cursor = db.execute("SELECT * FROM A")
results = cursor.fetchall()
eq_(results, [(1, "a"), (2, "b"), (3, "c")])
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671574879.0
datacache-1.2.1/test/test_download.py 0000644 0000765 0000024 00000003277 00000000000 020544 0 ustar 00iskander staff 0000000 0000000 # Copyright (c) 2014. Mount Sinai School of Medicine
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from datacache import fetch_file
FASTA_FILENAME = 'Homo_sapiens.GRCh37.75.dna_rm.chromosome.MT.fa'
URL = \
'ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna_rm.chromosome.MT.fa.gz'
def test_fetch_decompress():
for use_wget_if_available in [True, False]:
for timeout in [None, 10**6]:
path1 = fetch_file(
URL,
decompress=True,
subdir="datacache",
use_wget_if_available=use_wget_if_available,
timeout=timeout)
assert path1.endswith(FASTA_FILENAME)
with open(path1, 'r') as f1:
s1 = f1.read()
assert "TCAATTTCGTGCCAG" in s1
def test_fetch_subdirs():
path = fetch_file(URL, decompress=True, subdir="datacache")
assert path.endswith(FASTA_FILENAME)
# if we change the subdir then data should end up in
# something like /Users/me/Library/Caches/epitopes_test/
other_path = fetch_file(URL, decompress=True, subdir="datacache_test")
assert other_path.endswith(FASTA_FILENAME)
assert other_path != path, other_path
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671574879.0
datacache-1.2.1/test/test_fasta_dict.py 0000644 0000765 0000024 00000003045 00000000000 021027 0 ustar 00iskander staff 0000000 0000000 # Copyright (c) 2014. Mount Sinai School of Medicine
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from datacache import fetch_file
import gzip
URL = "".join([
'ftp://ftp.ensembl.org/pub/release-75',
'/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75',
'.dna_rm.chromosome.MT.fa.gz',
])
def fetch_fasta_dict(path_or_url):
path = fetch_file(path_or_url)
d = {}
value_buffer = []
key = None
if path.endswith(".gz") or path.endswith(".gzip"):
f = gzip.open(path, "r")
else:
f = open(path, "r")
for line in f.readlines():
if type(line) is bytes:
line = line.decode("ascii")
if line.startswith(">"):
if key is not None:
d[key] = "".join(value_buffer)
value_buffer = []
key = line.split()[0][1:]
else:
value_buffer.append(line.strip())
if key and value_buffer:
d[key] = "".join(value_buffer)
f.close()
return d
def test_download_fasta_dict():
d = fetch_fasta_dict(URL)
assert len(d) > 0
././@PaxHeader 0000000 0000000 0000000 00000000026 00000000000 011453 x ustar 00 0000000 0000000 22 mtime=1671574879.0
datacache-1.2.1/test/test_names.py 0000644 0000765 0000024 00000001064 00000000000 020030 0 ustar 00iskander staff 0000000 0000000 from datacache import build_local_filename
def test_url_without_filename():
filename = build_local_filename(download_url="http://www.google.com/")
assert filename
assert "google" in filename
def test_multiple_domains_same_file():
filename_google = build_local_filename(
download_url="http://www.google.com/index.html")
filename_yahoo = build_local_filename(
download_url="http://www.yahoo.com/index.html")
assert "index" in filename_google
assert "index" in filename_yahoo
assert filename_yahoo != filename_google