Reference databases (modelarchive.databases)

The databases module helps with retrieving and verifying data from reference databases like UniProtKB and NCBI DBs.

NCBI (modelarchive.databases.ncbi)

Functions to retrieve data for NCBI databases in batch.

modelarchive.databases.ncbi.get_and_check_ncbi_data(ncbi_acs, ncbi_metadata_file=None)[source]

Fetch NCBI sequence data, enrich with species names, and run checks.

Deduplicates the given accessions, fetches sequence and metadata (from cache or NCBI), adds scientific species names from the taxonomy database, and runs sanity checks. Warnings about outdated entries will be printed out.

Parameters:
  • ncbi_acs (list[str]) – NCBI protein accessions to process. Duplicates are silently removed.

  • ncbi_metadata_file (str | None) – Path to a JSON cache file for NCBI protein data. If None, data is always fetched from NCBI. Defaults to None.

Returns:

Mapping from accession to a dict with keys "seq_name", "seq_str", and "info". The "info" sub-dict contains all summary key-value pairs plus "SpeciesName".

Return type:

dict[str, dict]

Raises:

RuntimeError – If a returned accession or taxon ID cannot be matched to the requested set, if sequence and info keys diverge after a batch, if the final resolved set is incomplete, or if an entry fails the mmCIF character or accession consistency checks.

modelarchive.databases.ncbi.get_and_check_ncbi_species_names(ncbi_tax_ids)[source]

Fetch and validate scientific species names for NCBI taxon IDs.

Retrieves taxonomy summaries for all given taxon IDs and checks that all entries carry the expected Status value. Unexpected status values will be printed out.

Parameters:

ncbi_tax_ids (list[str]) – NCBI taxon IDs as strings.

Returns:

Mapping from taxon ID string to scientific species name.

Return type:

dict[str, str]

Raises:

RuntimeError – If a returned taxon ID is not in the requested set, or if the final set of resolved taxon IDs does not match the requested set.

UniProtKB (modelarchive.databases.uniprotkb)

Functions to retrieve data for UniProtKB entries/ sequences.

class modelarchive.databases.uniprotkb.UniProtKBEntry(unp_ac, entry_version=None, json_data=None)[source]

Bases: object

Represent a single UniProtKB entry and its metadata.

Fetches and parses a UniProtKB entry in TXT flat-file format from the UniProtKB or UniSave REST API on construction, or restores from a previously serialised JSON object.

Parameters:
  • unp_ac (str) – Accession code of the UniProtKB entry to fetch.

  • entry_version (str | int | None) – Entry version to fetch. If None, the latest version is retrieved.

  • json_data (dict | None) – Restore the object from a serialised JSON dict instead of fetching from the API. Ignores unp_ac when provided.

unp_ac

UniProtKB accession code.

Type:

str

entry_status

Entry status, e.g. 'REVIEWED'.

Type:

str | None

entry_version

Entry version number.

Type:

int | None

first_appearance

Date the entry was integrated into UniProtKB.

Type:

datetime | None

last_change

Date of the last annotation change.

Type:

datetime | None

last_seq_change

Date of the last sequence change.

Type:

datetime | None

ncbi_taxid

NCBI taxonomy ID.

Type:

str | None

organism_species

Organism species name.

Type:

str

seq_version

Sequence version number.

Type:

int | None

seqlen

Length of the canonical sequence.

Type:

int | None

unp_crc64

CRC64 checksum of the sequence.

Type:

str | None

unp_details_full

Full recommended protein name.

Type:

str | None

unp_id

UniProtKB entry name (mnemonic ID).

Type:

str | None

unp_seq

Canonical amino-acid sequence.

Type:

str

to_json()[source]

Serialise the entry to a JSON-compatible dict.

The returned dict can be passed to __init__() via the json_data parameter to restore the entry without an API call.

Returns:

JSON-serialisable representation of the entry.

Return type:

dict

class modelarchive.databases.uniprotkb.UniProtKBEntryCache(json_cache_file=None)[source]

Bases: object

Cached retrieval of UniProtKB entries.

To avoid calling the UniProtKB API for the same accession code multiple times, use this cache. The cache is keyed by accession code and entry version.

Be aware that when no entry version is specified, the latest version is fetched, which may change at UniProtKB couple of times a year. The cache has no size limit and is never swept.

Parameters:

json_cache_file (str | None) – Path to a JSON file used to persist the cache across runs. If None, the cache is kept in memory only and is lost when the process exits.

get(unp_ac, entry_version=None)[source]

Return a UniProtKBEntry from the cache.

Fetches from the UniProtKB API on cache miss and persists the updated cache to disk if a cache file was configured.

Parameters:
  • unp_ac (str) – UniProtKB accession code.

  • entry_version (int | None) – Entry version to retrieve. If None, the latest version is fetched.

Returns:

The requested entry.

Return type:

UniProtKBEntry

match_sequence(unp_ac, sequence, start=None, end=None)[source]

Match a sequence against a UniProtKB entry, walking through various versions.

Aligns sequence against the canonical sequence of the UNP entry using a Needleman-Wunsch global alignment (parasail NW, BLOSUM62). If the alignment at the current entry version does not produce an exact match in the requested range (no gaps in the UNP sequence, boundaries match start and end), older entry versions are tried in descending order until a perfect match is found.

If start and end are None, the full length of sequence is used as the range (start=1, end=len(sequence)).

Parameters:
  • unp_ac (str) – UniProtKB accession code.

  • sequence (str) – Target sequence of the model.

  • start (int | None) – Start residue of the alignment, 1-based. Defaults to 1.

  • end (int | None) – End residue of the alignment, 1-based inclusive. Defaults to len(sequence).

Returns:

A 3-tuple of the matching UniProtKBEntry, the aligned range in the UNP sequence as (start, end), and the aligned range in sequence as (start, end). All positions are 1-based inclusive.

Return type:

tuple[UniProtKBEntry, tuple[int, int], tuple[int, int]]

Raises:

RuntimeError – If no exact match can be found across all available entry versions.

to_json()[source]

Serialise the cache contents to a JSON-compatible dict.

Returns:

Nested dict mapping accession codes and entry versions to their serialised UniProtKBEntry representations.

Return type:

dict

modelarchive.databases.uniprotkb.translate_upkb_date_string(date_string)[source]

Convert a UniProtKB date string to a locale-independent format.

UniProtKB uses 3-letter English month abbreviations (e.g. 'MAY', 'NOV') which fail with datetime.strptime() in non-English locales. This function replaces the month abbreviation with its zero-padded numeric equivalent before parsing.

Parameters:

date_string (str) – A UniProtKB date string containing a 3-letter month abbreviation, e.g. '15-MAY-2023'.

Returns:

The date string with the month abbreviation replaced by its numeric equivalent, e.g. '15-05-2023'.

Return type:

str

Raises:

RuntimeError – If no known 3-letter month abbreviation is found in date_string.