Reference databases (modelarchive.databases)
The databases module helps with retrieving and verifying
data from reference databases like UniProtKB and NCBI DBs.
NCBI (modelarchive.databases.ncbi)
Functions to retrieve data for NCBI databases in batch.
- modelarchive.databases.ncbi.get_and_check_ncbi_data(ncbi_acs, ncbi_metadata_file=None)[source]
Fetch NCBI sequence data, enrich with species names, and run checks.
Deduplicates the given accessions, fetches sequence and metadata (from cache or NCBI), adds scientific species names from the taxonomy database, and runs sanity checks. Warnings about outdated entries will be printed out.
- Parameters:
- Returns:
Mapping from accession to a dict with keys
"seq_name","seq_str", and"info". The"info"sub-dict contains all summary key-value pairs plus"SpeciesName".- Return type:
- Raises:
RuntimeError – If a returned accession or taxon ID cannot be matched to the requested set, if sequence and info keys diverge after a batch, if the final resolved set is incomplete, or if an entry fails the mmCIF character or accession consistency checks.
- modelarchive.databases.ncbi.get_and_check_ncbi_species_names(ncbi_tax_ids)[source]
Fetch and validate scientific species names for NCBI taxon IDs.
Retrieves taxonomy summaries for all given taxon IDs and checks that all entries carry the expected
Statusvalue. Unexpected status values will be printed out.- Parameters:
- Returns:
Mapping from taxon ID string to scientific species name.
- Return type:
- Raises:
RuntimeError – If a returned taxon ID is not in the requested set, or if the final set of resolved taxon IDs does not match the requested set.
UniProtKB (modelarchive.databases.uniprotkb)
Functions to retrieve data for UniProtKB entries/ sequences.
- class modelarchive.databases.uniprotkb.UniProtKBEntry(unp_ac, entry_version=None, json_data=None)[source]
Bases:
objectRepresent a single UniProtKB entry and its metadata.
Fetches and parses a UniProtKB entry in TXT flat-file format from the UniProtKB or UniSave REST API on construction, or restores from a previously serialised JSON object.
- Parameters:
unp_ac (str) – Accession code of the UniProtKB entry to fetch.
entry_version (str | int | None) – Entry version to fetch. If
None, the latest version is retrieved.json_data (dict | None) – Restore the object from a serialised JSON dict instead of fetching from the API. Ignores
unp_acwhen provided.
- first_appearance
Date the entry was integrated into UniProtKB.
- Type:
datetime | None
- last_change
Date of the last annotation change.
- Type:
datetime | None
- last_seq_change
Date of the last sequence change.
- Type:
datetime | None
- class modelarchive.databases.uniprotkb.UniProtKBEntryCache(json_cache_file=None)[source]
Bases:
objectCached retrieval of UniProtKB entries.
To avoid calling the UniProtKB API for the same accession code multiple times, use this cache. The cache is keyed by accession code and entry version.
Be aware that when no entry version is specified, the latest version is fetched, which may change at UniProtKB couple of times a year. The cache has no size limit and is never swept.
- Parameters:
json_cache_file (str | None) – Path to a JSON file used to persist the cache across runs. If
None, the cache is kept in memory only and is lost when the process exits.
- get(unp_ac, entry_version=None)[source]
Return a
UniProtKBEntryfrom the cache.Fetches from the UniProtKB API on cache miss and persists the updated cache to disk if a cache file was configured.
- Parameters:
- Returns:
The requested entry.
- Return type:
- match_sequence(unp_ac, sequence, start=None, end=None)[source]
Match a sequence against a UniProtKB entry, walking through various versions.
Aligns
sequenceagainst the canonical sequence of the UNP entry using a Needleman-Wunsch global alignment (parasail NW, BLOSUM62). If the alignment at the current entry version does not produce an exact match in the requested range (no gaps in the UNP sequence, boundaries matchstartandend), older entry versions are tried in descending order until a perfect match is found.If
startandendareNone, the full length ofsequenceis used as the range (start=1,end=len(sequence)).- Parameters:
- Returns:
A 3-tuple of the matching
UniProtKBEntry, the aligned range in the UNP sequence as(start, end), and the aligned range insequenceas(start, end). All positions are 1-based inclusive.- Return type:
- Raises:
RuntimeError – If no exact match can be found across all available entry versions.
- to_json()[source]
Serialise the cache contents to a JSON-compatible dict.
- Returns:
Nested dict mapping accession codes and entry versions to their serialised
UniProtKBEntryrepresentations.- Return type:
- modelarchive.databases.uniprotkb.translate_upkb_date_string(date_string)[source]
Convert a UniProtKB date string to a locale-independent format.
UniProtKB uses 3-letter English month abbreviations (e.g.
'MAY','NOV') which fail withdatetime.strptime()in non-English locales. This function replaces the month abbreviation with its zero-padded numeric equivalent before parsing.- Parameters:
date_string (str) – A UniProtKB date string containing a 3-letter month abbreviation, e.g.
'15-MAY-2023'.- Returns:
The date string with the month abbreviation replaced by its numeric equivalent, e.g.
'15-05-2023'.- Return type:
- Raises:
RuntimeError – If no known 3-letter month abbreviation is found in
date_string.