Handling ModelCIF (`modelarchive.modelcif`)

The modelcif module assists the ModelArchive team to prepare deposition data before storing in the database. For the most part this involves converting files from PDB legacy format into ModelCIF, as well as refining submitted mmCIF/ ModelCIF files.

Functionality can be broadly divided into accessing and editing ModelCIF files.

One note on performance: by the nature of the task addressed here, code clarity is preferred over raw efficiency. Scripts that translate user data into ModelCIF for a deposition, are run once and offline, so it does not matter if execution takes one minute or five. In general, preparing data usually takes far longer than running the code itself.

Keep this in mind when implementing ModelCIF support in your own tool, you may prefer to draw inspiration from this module rather than use it directly… or use the python-modelcif package straight away.

Accessing ModelCIF (`modelarchive.modelcif.access`)

Functionality to access data in a ModelCIF file.

class modelarchive.modelcif.access.MABlock(model_data)[source]

Bases: object

Wrapper around gemmi.cif.Block for mmCIF/ ModelCIF structure files.

Reads a single mmCIF block from a file or string and exposes the full gemmi.cif.Block interface via attribute delegation, extended by convenience methods for common ModelCIF operations.

Parameters:: model_data (Path | str) – Path to the mmCIF input file or CIF data as text.

source

Path to the mmCIF input file, None if model_data provides CIF data as string.

Type:: str | None

doc

The parsed CIF document.

Type:: gemmi.cif.Document

block

The sole block of the CIF document.

Type:: gemmi.cif.Block

__getattr__(name)[source]

Delegate attribute lookup to the wrapped gemmi.cif.Block.

Called only when normal attribute resolution has already failed (default Python behaviour), so self.block itself is always found through the standard mechanism. Any attribute present on block is transparently forwarded; anything else raises AttributeError as usual.

Parameters:: name (str) – Name of the attribute to look up.
Returns:: The attribute value from block.
Return type:: object
Raises:: AttributeError – If name is not found on block either.

__iter__()[source]

Iterate over the wrapped gemmi.cif.Block.

Delegates directly to gemmi.cif.Block.__iter__(), yielding whatever the underlying block exposes during iteration (typically its items).

Returns:: An iterator over the block’s contents.
Return type:: iterator

add_category(category, after=None, **kwargs)[source]

Add a new single-row category to the block.

If the category already exists it is overwritten. The new category can be positioned immediately after an existing one by passing its name as after.

Parameters:

category (str) – mmCIF category name to create.
after (str | None) – Name of an existing category after which the new one should be placed. If None, the category is appended to the end.
**kwargs – Item names mapped to lists of values (one value per row — pass single-element lists for a one-row category).

Returns:

None

add_to_category(category, match=None, silent=False, **kwargs)[source]

Update item values in an existing mmCIF category.

Locates a row in category and overwrites the values of the items named by kwargs. When match is given, the row is identified by a key–value pair; otherwise the category must contain exactly one row.

Existing non-placeholder values (i.e. not '.' or '?') are reported to stderr unless silent is True.

Parameters:

category (str) – mmCIF category name, e.g. '_entry'.
match (tuple[str, str] | None) – A (item_name, value) pair used to identify the target row. If None, the category must have exactly one row.
silent (bool) – Suppress replacement messages. Defaults to False.
**kwargs – Item names mapped to their new values.

Returns:

None

Raises:

RuntimeError – If category is absent or empty (via find_strict()), or if no row matches the match criterion.

find_strict(name, columns)[source]

Return a table from the block, raising if it is absent or empty.

Parameters:

name (str) – Category name, e.g. '_entity.'.
columns (list[str]) – Column names to select.

Returns:

The requested table.

Return type:

gemmi.cif.Table

Raises:

RuntimeError – If the table is not found or contains no rows.

get_category(category)[source]

Return a category as a dict, compatible with add_category().

Wraps gemmi.cif.Block.get_mmcif_category(). Returns an empty dict when the category is not present in the block. The returned dict can be modified and passed directly to add_category() to round-trip a category:

entity_dict = block.get_category("entity")
# ... modify entity_dict ...
block.add_category("entity", **entity_dict)

Parameters:: category (str) – mmCIF category name, e.g. '_entity'.
Returns:: Mapping of item names to lists of values, or an empty dict if the category is absent.
Return type:: dict

get_sequence(entity)[source]

Return the one-letter sequence for a polymer entity.

Reads residue entries from _entity_poly_seq and converts each three-letter code to a one-letter code via gemmi.find_tabulated_residue(). Reading is done without caching, so calling this method repeatedly for many entities is inefficient.

Parameters:

entity (str) – Numeric entity ID as a string.

Returns:

One-letter amino-acid sequence for the entity.

Return type:

str

Raises:

RuntimeError – If _entity_poly_seq is absent or empty.
RuntimeError – If residue numbering in _entity_poly_seq is not strictly sequential.

has_category(category)[source]

Check whether a category is present in the block.

Parameters:: category (str) – mmCIF category name to look up.
Returns:: True if the category exists in the block, False otherwise.
Return type:: bool

property polymer_targets

Subset of targets whose _entity.type is 'polymer'.

Lazily populated on first access by filtering targets.

Returns:: Dicts of target entity information for polymer entities.
Return type:: list[dict]

property targets

Mapping of target entities keyed by their entity ID.

Lazily populated on first access from the _ma_target_entity and _entity categories. Each value is a dict with the keys _ma_target_entity.entity_id, _ma_target_entity.sequence and _ma_target_entity.type.

Returns:: Target entity information keyed by entity ID string.
Return type:: dict
Raises:: RuntimeError – If an entity ID appears more than once in _ma_target_entity.

write_file(filename, compress=False, style=Style.Simple)[source]

Write the CIF document to disk, with optional gzip compression.

If compress is True, or if filename already ends with '.gz', the output is written as a gzip-compressed file. The '.gz' suffix is appended automatically when compress is True but the suffix is missing.

Parameters:

filename (str) – Destination file path.
compress (bool) – Whether to gzip-compress the output. Defaults to False.
style (gemmi.cif.Style) – Formatting style for the CIF output. Defaults to gemmi.cif.Style.Simple.

Returns:

None

modelarchive.modelcif.access.get_table(block, category, items=None)[source]

Get a gemmi.cif.Table from a gemmi.cif.Block for a category.

It is much more convenient to work with gemmi.cif.Table objects instead of Gemmi’s loops and pairs directly. Imagine a ModelCIF file in which a certain category is represented as loop, while another ModelCIF file stores the same category as list of pairs. Both representations may be valid ModelCIF files and would require two separate handlers implemented for essentially the same data.

By using gemmi.cif.Table as a wrapper, loops and pairs can be treated uniformly, allowing you to handle both cases through a single code base.

Gemmi provides two functions to retrieve tables, find_mmcif_category() and find(). One of them just needs a category name and the other requires a category name and a list of columns to fetch. So, different behaviour again and… lets just accept: get_table() hides these details away and happily returns a table, whether you provide a list of items or not. If a list of items is given, the resulting table will contain only those columns. Plus, in case the category can’t be found in block, an empty list is returned, which feels more pythonic than getting an empty table back of length 0. Retrieving an empty list also makes looping over a table easier.

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import access
>>> # get sample CIF data
>>> cif_data = '''data_test
... _ma_qa_metric.id 1
... _ma_qa_metric.description test_score
... loop_
... _ma_qa_metric_local.ordinal_id
... _ma_qa_metric_local.metric_value
... _ma_qa_metric_local.metric_id
... 1 1.0 1
... 2 1.5 1
... '''
>>> block = cif.read_string(cif_data).sole_block()
>>> table = access.get_table(block, "_ma_qa_metric")
>>> len(table)
1
>>> table[-1]["description"]
'test_score'
>>> table = access.get_table(
...             block,
...             "_ma_qa_metric_local",
...             items=["metric_id", "metric_value"],
...         )
>>> # table should have 2 columns and 2 rows
>>> table
<gemmi.cif.Table 2 x 2>
>>> # columns are sorted as requested, not as stored
>>> table.tags[0]
'_ma_qa_metric_local.metric_id'
>>> table.tags[1]
'_ma_qa_metric_local.metric_value'

Parameters:

block (gemmi.cif.Block) – CIF data block holding the categories of the CIF document.
category (str) – Category to fetch from block, single category only, no Joins. Gemmi requires category names to end with ., so this function adds it if missing.
items (list[str]) – List of items to fetch as columns. Order of columns (items) follows the provided list. If None, the whole category with all its items as columns will be fetched. In case of None, items are fetched in the same order as they are found in the CIF document.

Returns:

The requested table if category can be found, otherwise empty list.

Return type:

gemmi.cif.Table | list

Editing ModelCIF (`modelarchive.modelcif.edit`)

Functionality to extend and modify ModelCIF files.

exception modelarchive.modelcif.edit.MoveIdxToFarError(category, idx)[source]

Bases: RuntimeError

Exception if repositioning exceeds the size of document-category-list.

Primarily used by move_category(), on the attempt to move a category to a position that does not exist within the corresponding gemmi.cif.Block. For example, if the gemmi.cif.Block object contains 10 categories, trying to move a category to position 15 will fail and should raise this exception.

Parameters:

category (str) – Name of the category that could not be moved.
idx (int) – Target position to which the category was to be moved.

exception modelarchive.modelcif.edit.NotFoundCategoryError(category=None, msg=None)[source]

Bases: NotFoundError

Exception if a category can not be found.

This exception should be raised when a function expects a specific category to exist in the corresponding gemmi.cif.Block, but the category cannot be retrieved.

category

Tuple of category names that could not be found.

Type:: tuple

Parameters:

category (str|list) – Name of the category that could not be found. Using a list of categories writes the generated message in plural.
msg (str) – Optional alternative error message.

exception modelarchive.modelcif.edit.NotFoundError(subject, value, msg)[source]

Bases: RuntimeError

General exception for ‘things’ that can not be found.

If msg is omitted, generates a message “<SUBJECT> ‘<VALUE>’ does not exist”. If value is a list with more than one element, the message will be written in plural mode. If subject is a list or tuple, a second element will be used as plural of the subject.

This exception should not be raised directly, it exists to define other “NotFound” exceptions inheriting from it.

Parameters:

subject (str|list|tuple) – The ‘thing’ that can not be found, used in the generated message. If list: or tuple, a second element is used as plural.
value (str|list) – The name of what can not be found, used in the generated message. Provied a list of values to get a message fitting plural.
msg (str) – Optional alternative error message.

exception modelarchive.modelcif.edit.NotFoundItemError(item=None, msg=None)[source]

Bases: NotFoundError

Exception if an item can not be found.

This exception should be raised when a function expects a specific item to exist in the corresponding CIF category, but the item cannot be retrieved.

item

Tuple of item names that could not be found.

Type:: tuple

Parameters:

item (str) – Name of the item that could not be found. Use as “<CATEGORY>.<ITEM>” for clarity. Using a list of items writes the generated message in plural.
msg (str) – Optional alternative error message.

modelarchive.modelcif.edit.add_category(block, category, item_data, index=None, mod_cat_itms=None, raw=False)[source]

Introduce a new category to a gemmi.cif.Block and populate it.

Add category to block using data from item_data. item_data is a dictionary with the CIF item names as keys and values as values to the items. On single values, named-pairs will be created, on lists with more than one value, a loop will be created. index can be used to place the category at a certain position. Use an integer for a specific place in the category list or a string of form [after|before]:<CATEGORY> for relative positioning.

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import edit
>>> # start with an empty CIF document
>>> cif_data = '''data_test
... '''
>>> block = cif.read_string(cif_data).sole_block()
>>> # lets add entities
>>> _ = edit.add_category(
...     block,
...     "_entity",
...     {
...         "id": [1, 2, 3],
...         "type": ["polymer", "non-polymer", "water"],
...     },
... )
>>> print(block.as_string())
data_test
loop_
_entity.id
_entity.type
1 polymer
2 non-polymer
3 water

>>> # lets add an "_entry" ID before the entities
>>> _ = edit.add_category(
...         block, "_entry", {"id": "1FOO"}, index="before:_entity"
...     )
>>> print(block.as_string())
data_test
_entry.id 1FOO

loop_
_entity.id
_entity.type
1 polymer
2 non-polymer
3 water

Parameters:

block (gemmi.cif.Block) – CIF data block holding the categories of the CIF document.
category (str) – Name of the new category to be created.
item_data (dict[str, list[Any]|Any]) – Attributes and values to be added to the new category. Dictionary with item names as keys. Values are either a list of values or a single value. If a single value is provided (or a list containing only one element), a named key-value pair is created instead of a loop.
index (int|str) – Placement of the new category within block. This can be an integer for exact positioning, or a string of form [after|before]:<CATEGORY> for relative positioning. In relative positioning, <CATEGORY> specifies the name of the category before or after which cat will be placed.
mod_cat_itms (dict[str, set[str]] | None) – A record of what has been modified. Dictionary of category assigned a set of items changed. Items which already have the value of the update, are not recorded. This is meant for the revision history, most likely you can ignore it.
raw (bool, optional) – If True, do not force quoting strings containing whitespace.

Returns:

A record of what has been modified. To be used with a revision history, most likely you can ignore it.

Return type:

dict[str, set[str]]

Raises:

MoveIdxToFarError – If the target position is outside block. For example, if block contains 10 categories, trying to create a category at position 15 will raise this error.

modelarchive.modelcif.edit.add_column(block, category, item, callback, pos=-1, raw=False)[source]

Extend a category with a new item and populate it using a callback.

Thinking of ModelCIF categories as tables, this function adds a new column (item) to a table that already exists in block. A callback function, to be provided, is executed with each row to compute the value for the new column. This avoids having a static list to fetch the values from.

make_res_per_chain_counter() is an example of a stateful implementation of a working callback.

The callback has to be of form function(row) and return the value to be set for the item in the given row.

Examples

>>> # Add "ndb_seq_num" to "_pdbx_nonpoly_scheme" including values
>>> # Reminder: "ndb_seq_num" -> column, "_pdbx_nonpoly_scheme" -> table
>>> from gemmi import cif
>>> from modelarchive.modelcif import edit
>>> cif_data = '''data_test
... loop_
... _pdbx_nonpoly_scheme.asym_id
... _pdbx_nonpoly_scheme.entity_id
... _pdbx_nonpoly_scheme.mon_id
... _pdbx_nonpoly_scheme.pdb_seq_num
... C 1 ATP 1
... D 2 HEM 1
... E 3 HOH 1
... E 3 HOH 2
... '''
>>> block = cif.read_string(cif_data).sole_block()
>>> edit.add_column(
...     block,
...     "_pdbx_nonpoly_scheme",
...     "ndb_seq_num",
...     edit.make_res_per_chain_counter("asym_id"),
...     pos=-1,
... )
>>> print(block.as_string())
data_test
loop_
_pdbx_nonpoly_scheme.asym_id
_pdbx_nonpoly_scheme.entity_id
_pdbx_nonpoly_scheme.mon_id
_pdbx_nonpoly_scheme.pdb_seq_num
_pdbx_nonpoly_scheme.ndb_seq_num
C 1 ATP 1 1
D 2 HEM 1 1
E 3 HOH 1 1
E 3 HOH 2 2

>>> # "ndb_seq_num" was appended as last column according to pos=-1

Parameters:

block (gemmi.cif.Block) – block holding the categories of the CIF document.
category (str) – The CIF category (table) to add the item to.
item (str) – The item (column) to be added.
callback (Callable[[gemmi.cif.Table.Row], int]) – Function to be executed to compute values for each row of the new column.
pos (int) – Position to insert the column at. Default is at the end (-1). Inserting at the beginning requires pos=1.
raw (bool) – Force to not quote strings containing white-spaces.

Returns:

None

Raises:

NotFoundCategoryError – If category can not be found in block.

modelarchive.modelcif.edit.add_rows(block, category, row_dict, ordinal_item='ordinal', mod_cat_itms=None, raw=False)[source]

Add rows to a category in block using an item-dictionary.

Thinking of ModelCIF categories as tables, this function adds new rows (items) to a table (category) in block. If category does not yet exist, it will be created. If multiple rows are provided, the new category will be created as loop, pairs otherwise. When adding row(s) to an existing pairs-category, the function will convert the category into a loop.

Input data is provided via row_dict. It must be a dict of list (for a single row, values may be single elements instead of lists). Item names are used as keys in row_dict. Missing items that exist in category will be added as . in new rows. The order of items in row_dict can be arbitrary; this function will align them with the existing order in category.

ordinal_item describes a unique numerical ID for each row. If provided, the function will automatically increment it for new rows. In ModelCIF, this column is often called ordinal though some categories use different names.

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import edit
>>> # start with an empty CIF document
>>> cif_data = '''data_test
... '''
>>> block = cif.read_string(cif_data).sole_block()
>>> # Lets add an entity to create a category in block. ordinal_item
>>> # is set to None on purpose to show how it works later.
>>> _ = edit.add_rows(
...     block,
...     "_entity",
...     {"id": 1, "details": "Protein", "type": "polymer"},
...     ordinal_item=None,
... )
>>> # see how the _entity category is created as couple of pairs
>>> print(block.as_string())
data_test
_entity.id 1
_entity.details Protein
_entity.type polymer

>>> # Add a second row (pairs will turn into a loop). This time, include
>>> # ordinal_item to let the function take care of incrementing IDs.
>>> _ = edit.add_rows(
...     block,
...     "_entity",
...     {"details": ["H2O"], "type": ["water"]},
...     ordinal_item="id",
... )
>>> # Now _entity is a loop and _entity.id was incremented automatically
>>> print(block.as_string())
data_test
loop_
_entity.id
_entity.details
_entity.type
1 Protein polymer
2 H2O water

>>> # As a last example, add multiple new rows at once but skip the
>>> # 'details' column.
>>> _ = edit.add_rows(
...     block,
...     "_entity",
...     {"type": ["polymer", "polymer"]},
...     ordinal_item="id",
... )
>>> # Now there are two more polymer entities in the loop but since
>>> # the 'details' information was missing, the function added '.' in
>>> # those fields.
>>> print(block.as_string())
data_test
loop_
_entity.id
_entity.details
_entity.type
1 Protein polymer
2 H2O water
3 . polymer
4 . polymer

Parameters:

block (gemmi.cif.Block) – CIF data block holding the categories of the CIF document.
category (str) – Name of the category to which row(s) will be added.
row_dict (dict[str, list | Any]) – Row data to be added to category. Keys are item names of the category. Values must be lists when adding multiple rows. For a single row, values may be provided as scalars instead of lists. If an item is missing from row_dict but exists in the category, ‘.’ will be assigned for that item in the new row(s).
ordinal_item (str | None) – If the category includes an ordinal (in database terms a primary key), this identifies the item name of it. If ordinal_item is provided, the latest ordinal will be read from the category and automatically incremented for new rows. Use None in case the category does not have an ordinal or if the ordinal should be set explicitly. The ordinal does not need to be included in row_dict.
mod_cat_itms (dict[str, set[str]] | None) – A record of what has been modified. Dictionary of category assigned a set of items changed. Items which already have the value of the update, are not recorded. This is meant for the revision history, most likely you can ignore it.
raw (bool, optional) – If True, do not force quoting strings containing whitespace.

Returns:

A record of what has been modified. To be used with a revision history, most likely you can ignore it.

Return type:

dict[str, set[str]]

Raises:

ValueError – In case item lists in row_dict are not of equal length.

modelarchive.modelcif.edit.make_copy_value_in_row(src_item)[source]

Returns a callback that returns a value from the same row.

Supposed to be used in functions that require a callback, e.g. add_column().

Meant to copy values over from the same row. That is handy in case a missing column needs to be populated with values, e.g. if author defined values are missing but required, copy over the “label” fields.

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import edit
>>> # Add _atom_site.auth_comp_id from _atom_site.label_comp_id
>>> # Note: category _atom_site is heavily cropped in thsi example to
>>> # keep it concise
>>> CIF_DATA = '''data_test
... #
... loop_
... _atom_site.group_PDB
... _atom_site.type_symbol
... _atom_site.label_atom_id
... _atom_site.label_comp_id
... _atom_site.label_asym_id
... _atom_site.auth_seq_id
... ATOM C CA MET A 1
... ATOM C CA ALA A 2
... ATOM C CA THR A 3
... ATOM C CA ALA A 4
... ATOM C CA ALA A 5
... ATOM C CA TYR A 6
... '''
>>> block = cif.read_string(CIF_DATA).sole_block()
>>> edit.add_column(
...     block,
...     "_atom_site",
...     "auth_comp_id",
...     edit.make_copy_value_in_row("label_comp_id"),
... )
>>> print(block.as_string())
data_test
loop_
_atom_site.group_PDB
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.auth_seq_id
_atom_site.auth_comp_id
ATOM C CA MET A 1 MET
ATOM C CA ALA A 2 ALA
ATOM C CA THR A 3 THR
ATOM C CA ALA A 4 ALA
ATOM C CA ALA A 5 ALA
ATOM C CA TYR A 6 TYR

Parameters:: src_item (str) – the column name to copy from (source).
Returns:: Callback function usable as callback in add_column().
Return type:: Callable[[gemmi.cif.Table.Row], str]

Note

This function may be outsourced to a supporting module, if edit gets to- big.

modelarchive.modelcif.edit.make_res_per_chain_counter(asym_id_item)[source]

Returns a stateful callback function counting residues per chain.

make_res_per_chain_counter() returns a function that can be used as callback in add_column().

The returned callback assigns consecutive residue numbers within each chain of a table, starting at 1. When the chain identifier changes between two rows while iterating over the table, the counter is reset to 1.

Examples

>>> # Add item "ndb_seq_num" to category "_pdbx_nonpoly_scheme"
>>> # Reminder: "ndb_seq_num" -> column, "_pdbx_nonpoly_scheme" -> table
>>> from gemmi import cif
>>> from modelarchive.modelcif import edit
>>> cif_data = '''data_test
... loop_
... _pdbx_nonpoly_scheme.asym_id
... _pdbx_nonpoly_scheme.auth_seq_num
... _pdbx_nonpoly_scheme.entity_id
... _pdbx_nonpoly_scheme.mon_id
... _pdbx_nonpoly_scheme.pdb_seq_num
... C 1 3  ATP 1
... D 1 4  HEM 1
... E 1 5  HOH 1
... E 2 5  HOH 2
... '''
>>> block = cif.read_string(cif_data).sole_block()
>>> # Using make_res_per_chain_counter() in add_column() will add a
>>> # column to the loop_ and populate it with values:
>>> edit.add_column(
...     block,
...     "_pdbx_nonpoly_scheme",
...     "ndb_seq_num",
...     edit.make_res_per_chain_counter("asym_id"), # CALLBACK
...     pos=5,
... )
>>> print(block.as_string())
data_test
loop_
_pdbx_nonpoly_scheme.asym_id
_pdbx_nonpoly_scheme.auth_seq_num
_pdbx_nonpoly_scheme.entity_id
_pdbx_nonpoly_scheme.mon_id
_pdbx_nonpoly_scheme.ndb_seq_num
_pdbx_nonpoly_scheme.pdb_seq_num
C 1 3 ATP 1 1
D 1 4 HEM 1 1
E 1 5 HOH 1 1
E 2 5 HOH 2 2

>>> # "ndb_seq_num" is inserted as fifth column. The ATP in chain C
>>> # ("asym_id") gets "ndb_seq_num" 1 and the HEM in chain D also gets
>>> # "ndb_seq_num" 1. But the HOH, both live in chain E together, get
>>> # "ndb_seq_num" 1 and 2. So for each chain, counting starts at 1
>>> # and per compound in a chain, the counter is increased by 1.

Parameters:: asym_id_item (str) – Item name hosting the chain name.
Returns:: Callback function usable as callback in add_column().
Return type:: Callable[[gemmi.cif.Table.Row], int]

Note

This function may be outsourced to a supporting module, if edit gets too big.

modelarchive.modelcif.edit.move_category(block, cat, idx)[source]

Move a category to a new position in a gemmi.cif.Block.

By design, ModelCIF files are not intended to be read or edited manually. Instead, dedicated applications should handle the format, providing functionality to view and modify the data. However, at ModelArchive we occasionally need to open ModelCIF files in an editor to inspect specific details. In such cases, it is helpful to have related categories grouped together, reducing the need to jump back and forth between different categories. This asks for a function to reposition categories within a ModelCIF file.

move_category() takes category cat and moves it to position idx in the CIF block block. The parameter idx is somewhat special: it can be just an integer index, specifying the exact position to move cat to. That comes in handy placing categories at the beginning (idx=0) or at the end (idx=-1) of block. However, specifying an absolute index is often less useful in practice, as categories are typically organised relative to related categories. For this purpose, idx provides a special syntax: [after|before]:<CATEGORY>. For example, if you want to put category _ma_qa_metric in front of category _ma_qa_metric_local, you can use idx="before:_ma_qa_metric_local" for cat=_ma_qa_metric…

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import edit
>>> # get sample CIF data
>>> cif_data = '''data_test
... _ma_qa_metric.id 1
... _ma_qa_metric.description test_score
... loop_
... _ma_qa_metric_local.ordinal_id
... _ma_qa_metric_local.metric_value
... _ma_qa_metric_local.metric_id
... 1 1.0 1
... 2 1.5 1
... '''
>>> block = cif.read_string(cif_data).sole_block()
>>> # move _ma_qa_metric_local to BEFORE _ma_qa_metric
>>> edit.move_category(
...     block,
...     "_ma_qa_metric_local",
...     "before:_ma_qa_metric",
... )
>>> print(block.as_string())
data_test
loop_
_ma_qa_metric_local.ordinal_id
_ma_qa_metric_local.metric_value
_ma_qa_metric_local.metric_id
1 1.0 1
2 1.5 1

_ma_qa_metric.id 1
_ma_qa_metric.description test_score

>>> # move _ma_qa_metric to the front
>>> edit.move_category(block, "_ma_qa_metric", 0)
>>> print(block.as_string())
data_test
_ma_qa_metric.id 1
_ma_qa_metric.description test_score

loop_
_ma_qa_metric_local.ordinal_id
_ma_qa_metric_local.metric_value
_ma_qa_metric_local.metric_id
1 1.0 1
2 1.5 1

Parameters:

block (gemmi.cif.Block) – CIF block to operate on.
cat (str) – Name of the CIF category to be moved.
idx (int|str) – Position to move cat to. This can be an integer for exact positioning, or a string of form [after|before]:<CATEGORY> for relative positioning. In relative positioning, <CATEGORY> specifies the name of the category before or after which cat will be placed. If <CATEGORY> can not be found, cat will not be relocated.

Returns:

None

Raises:

NotFoundCategoryError – If cat can not be found in block.
MoveIdxToFarError – If the target position is outside block. For example, if block contains 10 categories, trying to move a category to position 15 will raise this error.

modelarchive.modelcif.edit.sort(table_or_block, item, category=None, key=None)[source]

Sort a gemmi.cif.Table or gemmi.cif.Block in-place by the given item.

This may be useful after editing a table, to sort it by a selected column (e.g. the ordinal). Numerical values are sorted numerically, all others lexicographically. key can take a function to extract a comparison key from each row. This is helpful for cases like _citation.id, where special values (e.g. id=primary) might need to be placed first.

Works on an already loaded gemmi.cif.Table, or on a gemmi.cif.Block (requires category) to sort many categories one after another in less code.

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import access, edit
>>> # start with an empty CIF document
>>> CIF_DATA = '''data_test
... loop_
... _citation.id
... _citation.journal_full
... _citation.title
... _citation.year
... _citation.journal_volume
... 3 "The Lord of the Rings" "Return of the King" 1955 3
... 1 "The Lord of the Rings" "The Fellowship of the Ring" 1954 2
... 2 "The Lord of the Rings" "The Two Towers" 1954 1
... primary . "The Hobbit or There and Back Again" 1937 .
... '''
>>> block = cif.read_string(CIF_DATA).sole_block()
>>> table = access.get_table(block, "_citation")
>>> # first sort without a key function
>>> edit.sort(table, "id")
>>> # This sorts the LOTR books properly, but the 'primary' book is at
>>> # the bottom
>>> print(block.as_string())
data_test
loop_
_citation.id
_citation.journal_full
_citation.title
_citation.year
_citation.journal_volume
1 "The Lord of the Rings" "The Fellowship of the Ring" 1954 2
2 "The Lord of the Rings" "The Two Towers" 1954 1
3 "The Lord of the Rings" "Return of the King" 1955 3
primary . "The Hobbit or There and Back Again" 1937 .

>>> # sort again (this time by block), with a lambda that puts
>>> # 'primary' first
>>> edit.sort(
...     block,
...     "id",
...     category="_citation",
...     key=lambda row: (
...         (0, "") if row["id"] == "primary" else (1, row["id"])
...     ),
... )
>>> print(block.as_string())
data_test
loop_
_citation.id
_citation.journal_full
_citation.title
_citation.year
_citation.journal_volume
primary . "The Hobbit or There and Back Again" 1937 .
1 "The Lord of the Rings" "The Fellowship of the Ring" 1954 2
2 "The Lord of the Rings" "The Two Towers" 1954 1
3 "The Lord of the Rings" "Return of the King" 1955 3

Parameters:

table_or_block (gemmi.cif.Table | gemmi.cif.Block) – Object to be sorted. On gemmi.cif.Block, the corresponding table will be loaded using category.
item (str) – Name of the column (item) in the table to sort by.
category (str, optional) – Name of the category when sorting a gemmi.cif.Block.
key (callable, optional) – Function taking a row and returning a sortable value. Defaults to lexicographic row[item] with a fix for numerical sorting.

Returns:

None

Raises:

ValueError – If table_or_block is a gemmi.cif.Block object but no category was provided.

Fixing AlphaFold 2 ModelCIF files (`modelarchive.modelcif.fix_af2`)

ModelCIF files generated by AlphaFold 2 deviate from the official ModelCIF definition dictionary in specific cases. Here are functions to fix this.

class modelarchive.modelcif.fix_af2.GlobalConfRankMultimer(value)[source]

Bases: Global, NormalizedScore

Default ranking score used by AlphaFold-Multimer

name = 'ranking-confidence (ipTM*0.8+pTM*0.2)'

software = None

class modelarchive.modelcif.fix_af2.GlobalIpTM(value)[source]

Bases: Global, IpTM

Predicted protein-protein interface score based on TM-score in [0,1]

name = 'ipTM'

software = None

class modelarchive.modelcif.fix_af2.GlobalPLDDT(value)[source]

Bases: Global, PLDDT

Predicted accuracy according to the CA-only lDDT in [0,100]

name = 'pLDDT'

software = None

class modelarchive.modelcif.fix_af2.GlobalPTM(value)[source]

Bases: Global, PTM

Predicted accuracy according to the TM-score score in [0,1]

name = 'pTM'

software = None

class modelarchive.modelcif.fix_af2.LocalPLDDT(residue, value)[source]

Bases: Local, PLDDT

Predicted accuracy according to the CA-only lDDT in [0,100]

name = 'pLDDT'

software = None

class modelarchive.modelcif.fix_af2.LocalPairwisePAE(residue1, residue2, value)[source]

Bases: LocalPairwise, PAE

Predicted aligned error (in Angstroms)

name = 'PAE'

software = None

modelarchive.modelcif.fix_af2.assemble_modelcif_software(soft_dict, params_dict)[source]

Create a modelcif.SoftwareWithParameters instance from dictionaries.

Parameters:

soft_dict (dict) – Software metadata as returned by functions such as get_colabfold_software(). Must contain the keys name, classification, description, location, type, version, and citation.
params_dict (dict) – Software parameters, where each key is passed as the parameter name and each value as the parameter value to modelcif.SoftwareParameter.

Returns:

A ModelCIF software object with associated parameters.

Return type:

modelcif.SoftwareWithParameters

modelarchive.modelcif.fix_af2.get_af2_config(af_version, af_params=None, custom_ranking=None, up_version=None, up_rel_date=None, pdb_rel_date=None)[source]

Get configuration data for an AlphaFold 2 modelling run.

Derives modelling settings from the provided AlphaFold 2 version and parameters, builds a human-readable description of the run, and returns a configuration dictionary for use by downstream functions.

Parameters:

af_version (str) – AlphaFold 2 version string (e.g. "2.3.2").
af_params (dict, optional) – Non-default AlphaFold 2 parameters. Recognised keys include model_preset, db_preset, num_multimer_predictions_per_model, models_to_relax, run_relax, max_template_date, and num_ensemble. Defaults to an empty dict if not provided.
custom_ranking (str, optional) – Custom model ranking expression. If not provided, defaults to "pLDDT" for monomer runs and "ipTM*0.8+pTM*0.2" for multimer runs.
up_version (str, optional) – UniProt release in "YYYY_MM" format current at the time of AF2 installation. (see https://www.uniprot.org/release-notes)
up_rel_date (datetime.date, optional) – Release date corresponding to up_version.
pdb_rel_date (datetime.date, optional) – PDB release date current at the time of AF2 installation. Relevant for multimer runs using templates.

Returns:

Configuration data for downstream functions. Keys:

af_params (dict): Parameters as passed (or empty dict).

af_version (str): AlphaFold 2 version string as passed.

description (str): Human-readable run description.

use_templates (bool): Whether templates were used.

use_small_bfd (bool): Whether the reduced BFD database setting was used.

use_multimer (bool): Whether multimer mode was used.

up_version (str or None): UniProt release as passed.

up_rel_date (datetime.date or None): UniProt release date as passed.

pdb_rel_date (datetime.date or None): PDB release date as passed.

seq_dbs (list[modelcif.ReferenceDatabase]): Sequence DB objects.

Return type:

dict

modelarchive.modelcif.fix_af2.get_af2_sequence_dbs(config_data)[source]

Get AF2 sequence databases and store them in config_data.

Builds a list of modelcif.ReferenceDatabase objects based on the AlphaFold 2 configuration and writes it to config_data["seq_dbs"]. The selection depends on the database preset, AF2 version, and whether multimer mode or templates are used.

Parameters:

config_data (dict) –

AF2 configuration data, as returned by get_af2_config(). Relevant keys:

af_version (str): AlphaFold 2 version string; determines which MGnify and UniRef variants are added.
use_small_bfd (bool): If True, uses Reduced BFD instead of full BFD.
use_multimer (bool): If True, adds TrEMBL, Swiss-Prot, and PDB seqres databases.
use_templates (bool): If True, adds a PDB sequence database (PDB seqres for multimer, PDB70 for monomer).
up_version (str or None): UniProt release version, passed to the version attribute of UniRef90, TrEMBL, and Swiss-Prot database objects.
up_rel_date (datetime.date or None): UniProt release date, passed to the release_date attribute of UniRef90, TrEMBL, and Swiss-Prot database objects.
pdb_rel_date (datetime.date or None): PDB release date, passed to the release_date attribute of the PDB seqres database object.

Returns:

Results are written to config_data["seq_dbs"] as a list of modelcif.ReferenceDatabase objects.

Return type:

None

modelarchive.modelcif.fix_af2.get_af2_software(version=None, is_multimer=False)[source]

Get AlphaFold 2 as a dict for creating a software object.

Parameters:

version (str) – Version of AlphaFold 2. Should only be None if the version is genuinely unavailable.
is_multimer (bool) – If True, return metadata for AlphaFold-Multimer instead of AlphaFold 2.

Returns:

A dictionary with software metadata suitable for creating a ModelCIF software object. The name and citation entries differ depending on is_multimer.

Return type:

dict

modelarchive.modelcif.fix_af2.get_cf_config(cf_config, ur30_db_version=None, tpl_db=None, tpl_db_version=None)[source]

Process a ColabFold configuration into a standardised data dictionary.

Parameters:

cf_config (dict) – Raw ColabFold configuration data, typically read from a ColabFold configuration file. Must contain the keys version, msa_mode, model_type, num_recycles, use_templates, and rank_by. Optional keys include commit, pair_mode, recycle_early_stop_tolerance, stop_at_score, num_seeds, num_models, use_amber, and num_relax.
ur30_db_version (str, optional) – Version of the UniRef30 database used. Should only be None if the database was not used.
tpl_db (str, optional) – Template database used. Accepted values are "PDB70", "PDB100", or None if no template database was used.
tpl_db_version (str, optional) – Version of the template database used. Should only be None if the database was not used.

Returns:

A dictionary with processed ColabFold configuration data for further use in model preparation.

Return type:

dict

Raises:

ValueError – If msa_mode is not one of the known values.
ValueError – If model_type is not one of the known values.
ValueError – If rank_by is not one of the known values.

modelarchive.modelcif.fix_af2.get_cf_db_versions(dt, num_days_unk=1)[source]

Get ColabFold database versions for a given date.

Returns the UniRef30, template database name, and template database version used by the ColabFold MSA server on a given date. Based on https://github.com/sokrypton/ColabFold/wiki/MSA-Server-Database-History.

Parameters:

dt (datetime.date) – Date for which to look up the database versions.
num_days_unk (int) – Number of days around a database switch date within which the result is considered unknown. Defaults to 1.

Returns:

A 3-tuple of (ur30_db_version, tpl_db, tpl_db_version),: each a str. Values are set to "UNK" if dt falls within num_days_unk days of a switch date, if the template database version is unknown, or if no matching date range is found.

Return type:

tuple

modelarchive.modelcif.fix_af2.get_cf_sequence_dbs(config_data)[source]

Get ColabFold sequence databases and store them in config_data.

Looks up a hardcoded list of known ColabFold sequence databases and populates config_data["seq_dbs"] with modelcif.ReferenceDatabase instances corresponding to the databases requested via config_data["seq_db_keys"]. If a template database is specified via config_data["tpl_db"], it is appended as well. UniRef database entries require a version string in config_data["ur30_db_version"]; template database entries require a version string in config_data["tpl_db_version"].

Parameters:

config_data (dict) – Configuration data dictionary. Relevant keys: seq_db_keys (list of str) — sequence database identifiers to look up; ur30_db_version (str or None) — version string required when "UniRef" is in seq_db_keys; tpl_db (str or None) — optional template database identifier; tpl_db_version (str or None) — version string required when tpl_db is set. On return, seq_dbs is added as a list of modelcif.ReferenceDatabase instances.

Returns:

None

Raises:

ValueError – If "UniRef" is in seq_db_keys but ur30_db_version is None.
ValueError – If tpl_db is set but tpl_db_version is None.
ValueError – If a resolved database key is not found in the hardcoded set of known ColabFold databases.

modelarchive.modelcif.fix_af2.get_cf_sw_plus_params(config_data, use_localcolabfold=False)[source]

Create a list of software and parameters for a ColabFold protocol step.

Parameters:

config_data (dict) – ColabFold configuration data as returned by get_cf_config().
use_localcolabfold (bool) – If True, prepend LocalColabFold to the list of software entries.

Returns:

A list of (software, parameters) tuples suitable for use in a protocol.

Return type:

list[tuple[dict, dict]]

modelarchive.modelcif.fix_af2.get_colabfold_software(version=None)[source]

Get ColabFold as a dict for creating a software object.

Parameters:: version (str) – Version of ColabFold. Should only be None if the version is genuinely unavailable.
Returns:: A dictionary with software metadata suitable for creating a ModelCIF software object.
Return type:: dict

modelarchive.modelcif.fix_af2.get_galaxy_software(version)[source]

Get Galaxy as a software dictionary for a ModelCIF file.

Builds a dictionary suitable for creating a modelcif.Software object, with citation and download URL derived from the provided version string.

Parameters:

version (str) – Galaxy AlphaFold 2 version string in the format [AF2v]+galaxy[X], e.g. "2.3.2+galaxy1".

Returns:

Software descriptor with keys name, classification,: description, citation, location, type, and version.

Return type:

dict

modelarchive.modelcif.fix_af2.get_localcolabfold_software(version=None)[source]

Get LocalColabFold as a dict for creating a software object.

Parameters:: version (str) – Version of LocalColabFold. Should only be None if the version is genuinely unavailable.
Returns:: A dictionary with software metadata suitable for creating a ModelCIF software object.
Return type:: dict

modelarchive.modelcif.fix_af2.get_mmseqs2_software(version=None)[source]

Get MMseqs2 as a dict for creating a software object.

Parameters:: version (str) – Version of MMseqs2. Should only be None if the version is genuinely unavailable.
Returns:: A dictionary with software metadata suitable for creating a ModelCIF software object.
Return type:: dict

modelarchive.modelcif.fix_af2.get_sequence(chn, use_auth=False)[source]

Get the sequence of an OpenStructure chain, inserting '-' for gaps.

Parameters:

chn (ost.mol.ChainHandle or ost.mol.ChainView) –
OST chain to extract the sequence from. Any object providing the following interface can be used as a drop-in replacement for the OST chain object:
- chn.residues: sequence of residue objects, each providing
- chn.residues[i].number.num (int): internal residue number
- chn.residues[i].one_letter_code (str): single-letter code
- chn.residues[i].GetStringProp("pdb_auth_resnum") (str): author residue number as an integer string, only required if use_auth=True
use_auth (bool) – If True, use PDB author residue numbers instead of internal residue numbers.

Returns:

One-letter code sequence with '-' characters inserted for gaps.

Return type:

str

modelarchive.modelcif.fix_af2.store_as_modelcif(mdl_data, out_dir, mdl_fle_stem, compress)[source]

Assemble model data into a ModelCIF file and write it to disk.

Creates a modelcif.System from the provided data, attaches entities, models, QA scores, associated files, and a modelling protocol, then writes the result as a ModelCIF file. Optionally compresses the output and packages associated files into a ZIP archive.

Parameters:

mdl_data (dict) –
Dictionary with model data. Expected keys:
- title (str): Title of the modelling system.
- mdl_id (str): Model identifier; converted to upper case.
- model_details (str): Free-text description of the model.
- audit_authors (list[str]): Author names for the audit record.
- ranked_mdls (list[dict]): Per-model atom data.
- target_entities (list[dict]): Target sequence data used to build asymmetric units and entities.
- config_data (dict): Configuration data; must contain the key seq_dbs with reference database entries for the protocol.
- protocol (dict): Modelling protocol description passed to _get_modelcif_protocol().
- acc_files (dict, optional): Mapping of labels to associated file descriptors, each containing details, destination_file_name, source_file_path, file_format, and file_content.
- af2_protocol_name (str, optional): If present, used to assign software metadata to AlphaFold 2 QA metric classes.
out_dir (str | Path) – Directory to write the output file(s) to.
mdl_fle_stem (str) – Base name for the output file, without extension.
compress (bool) – If True, the mmCIF file is gzip-compressed after writing.

Returns:

File name of the written mmCIF file, relative to out_dir. Ends with .cif or .cif.gz depending on compress.

Return type:

str

Fixing AlphaFold 3 ModelCIF files (`modelarchive.modelcif.fix_af3`)

ModelCIF files generated by AlphaFold 3 deviate from the official ModelCIF definition dictionary in specific cases. In particular, for homomeric assemblies, each molecular entity copy is written as a separate entity in the CIF document, instead of defining a single entity referenced multiple times. This module provides functionality to correct the deviations.

For ModelCIF conversions done for ModelArchive, we first fix entity desctiptions and then call the following functions in order: fix_modelcif_issues(), fix_citation(), fix_software_location(), fix_model_name(), fix_protocol(), add_per_residue_plddt(), and add_data_from_json_files() (if AF3 JSON files available and with pairwise_in_zip=True and use_local_pairwise_if_possible=True).

exception modelarchive.modelcif.fix_af3.NotIdentifiedContextRecordError(category, item=None, context=None)[source]

Bases: NotIdentifiedRecordError

Exception if a record for a specific context can not be identified.

category

Category for which the exception is raised.

Type:: str

item

Involved item, if any.

Type:: str|None

Parameters:

category (str) – Affected category.
item (str, optional) – Affected item.
context (str, optional) – Context, part of the message.

exception modelarchive.modelcif.fix_af3.NotIdentifiedDuplicatedRecordError(category, record_id)[source]

Bases: NotIdentifiedRecordError

Exception if a duplicated record is found in a table.

category

Category with the non-unique records.

Type:: str

Parameters:

category (str) – Missing category.
record_id (str) – Identifier for the duplicated record. Not bound to a specific item on purpose.

exception modelarchive.modelcif.fix_af3.NotIdentifiedRecordError(msg)[source]

Bases: RuntimeError

General exception for records that can not be identified in a table.

This exception should not be raised directly, it exists to define other “NotIdentified” exceptions inheriting from it.

Parameters:: msg (str) – Exception message.

exception modelarchive.modelcif.fix_af3.NotIdentifiedSingleRecordError(category, item=None, value=None)[source]

Bases: NotIdentifiedRecordError

Exception if a specific record can not be identified in a table.

category

Affected category.

Type:: str

item

Affected item.

Type:: str|None

Parameters:

category (str) – Affected category.
item (str, optional) – Missing item, extends the exception message.
value (str, optional) – Value, in case a record is found but with mismatching value. Extends the exception message.

modelarchive.modelcif.fix_af3.add_data_from_json_files(block, input_path, full_qe_path, summary_qe_path, out_zip_path, pairwise_in_zip=True, use_local_pairwise_if_possible=False)[source]

Add QA metrics and metadata from AF3 JSON files to a ModelCIF block.

Reads the AF3 input JSON, full confidence JSON, and summary confidence JSON to populate QA metric categories in block. Packs the input JSON and, optionally, pairwise QA scores into a ZIP archive at out_zip_path. Updates _ma_qa_metric, _ma_qa_metric_global, _ma_qa_metric_feature, _ma_qa_metric_local_pairwise, _ma_qa_metric_feature_pairwise, _ma_entry_associated_files, _ma_associated_archive_file_details, _audit_conform|, and _ma_software_parameter (the latter only if model seeds or recycle counts are present in the input JSON).

The function derives feature lists from _atom_site. Per-residue pairwise scores are written to _ma_qa_metric_local_pairwise when use_local_pairwise_if_possible is True and all tokens are polymer residues (no HETATM); in all other cases _ma_qa_metric_feature_pairwise is used.

Parameters:

block (gemmi.cif.Block) – mmCIF block to be updated in place. Must already contain _atom_site, _entity, _ma_qa_metric, _ma_qa_metric_global, _ma_software_group, and _entry categories.
input_path (Path | str) – Path to the AF3 input JSON file. Server output: <JOBNAME>_job_request.json.
full_qe_path (Path | str) – Path to the JSON file containing per-atom and per-token confidence arrays (pLDDT, PAE, contact probabilities). Server output: <JOBNAME>_full_data_<N>.json; code output: <JOBNAME>_confidences.json.
summary_qe_path (Path | str) – Path to the JSON file containing summary confidence values (pTM, ipTM, ranking score, etc.). Server output: <JOBNAME>_summary_confidences_<N>.json; code output: <JOBNAME>_summary_confidences.json.
out_zip_path (Path | str) – Path for the output ZIP archive. The archive always contains input.json (a copy of input_path). If pairwise_in_zip is True, a pairwise_qa.cif file with the pairwise QA metrics is also included. The value of _ma_entry_associated_files.file_url is set to the bare filename (i.e. without any directory component), so the main CIF file and the ZIP must reside in the same directory.
pairwise_in_zip (bool) – If True, pairwise QA scores are written to a separate pairwise_qa.cif file and packaged inside the ZIP archive rather than embedded directly in block. Defaults to True.
use_local_pairwise_if_possible (bool) – If True and every token in the structure is a polymer residue (no HETATM records), _ma_qa_metric_local_pairwise is used for pairwise token scores instead of _ma_qa_metric_feature_pairwise. Defaults to False.

Returns:

None

Raises:

RuntimeError – If full_qe_path or summary_qe_path contain score keys that are not listed in known_scores.

modelarchive.modelcif.fix_af3.add_json_files_in_archive_file(block, input_path, full_qe_path, summary_qe_path, out_zip_path)[source]

Package AF3 JSON files as accompanying data without processing them.

Writes ModelCIF categories _ma_entry_associated_files and _ma_associated_archive_file_details to block and packages the three AF3 JSON files into a ZIP archive at out_zip_path. Alternative to add_data_from_json_files() with the same ..._path parameters; use this function when the JSON files should be stored as-is rather than parsed for QA metrics.

Warning

Existing _ma_entry_associated_files and _ma_associated_archive_file_details categories in block are overwritten without checking their prior contents.

Parameters:

block (gemmi.cif.Block) – mmCIF block to be updated in place. Must already contain an _entry category with exactly one row.
input_path (Path | str) – Path to the AF3 input JSON file. Server output: <JOBNAME>_job_request.json.
full_qe_path (Path | str) – Path to the JSON file containing per-atom and per-token confidence arrays (pLDDT, PAE, contact probabilities). Server output: <JOBNAME>_full_data_<N>.json; code output: <JOBNAME>_confidences.json.
summary_qe_path (Path | str) – Path to the JSON file containing summary confidence values (pTM, ipTM, ranking score, etc.). Server output: <JOBNAME>_summary_confidences_<N>.json; code output: <JOBNAME>_summary_confidences.json.
out_zip_path (Path | str) – Path for the output ZIP archive. The archive contains input.json, summary_confidences.json, and confidences.json. The value of _ma_entry_associated_files.file_url is set to the bare filename (i.e. without any directory component), so the main CIF file and the ZIP must reside in the same directory.

Returns:

None

modelarchive.modelcif.fix_af3.add_per_residue_plddt(block)[source]

Add average per-residue pLDDT scores to an AF3 ModelCIF file.

Adds _ma_qa_metric_local data derived from B-factor values in _atom_site. The per-residue pLDDT is computed as the mean over all atoms of a residue. Non-polymer residues (missing value in _atom_site.label_seq_id) are excluded.

If _ma_qa_metric_local is already present in the block, the function exits early with a warning. If no local pLDDT entry exists in _ma_qa_metric, it will be added; if more than one local pLDDT entry is found, an exception is raised as this is most likely an error in the ModelCIF file.

This fix targets AF3 files predating version 3.0.1, which lack _ma_qa_metric_local.

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import access, fix_af3
>>> # Please note: the example CIF document for this case has the
>>> # _atom_site category reduce to the bare minimum to make the
>>> # mechanics of add_per_residue_plddt() work, to keep the example
>>> # shorter.
>>> CIF_DATA = '''data_test
... #
... loop_
... _ma_software_group.group_id
... _ma_software_group.ordinal_id
... _ma_software_group.software_id
... 1 1 1
... #
... loop_
... _software.classification
... _software.date
... _software.description
... _software.name
... _software.pdbx_ordinal
... _software.type
... _software.version
... other ? "Structure prediction" AlphaFold 1 package AlphaFold-beta
... #
... loop_
... _atom_site.group_PDB
... _atom_site.label_comp_id
... _atom_site.label_asym_id
... _atom_site.label_seq_id
... _atom_site.B_iso_or_equiv
... _atom_site.pdbx_PDB_model_num
... ATOM MET A 1 35.00 1
... ATOM ALA A 2 50.30 1
... ATOM THR A 3 65.75 1
... '''
>>> block = cif.read_string(CIF_DATA).sole_block()
>>> fix_af3.add_per_residue_plddt(block)
>>> # After execution, the CIF document has categories _ma_qa_metric
>>> # and _ma_qa_metric_local added
>>> # There should be only 1 record in _ma_qa_metric
>>> qa_dict = block.get_mmcif_category("_ma_qa_metric.")
>>> print(qa_dict)
{'id': ['1'], 'mode': ['local'], 'name': ['pLDDT'], 'software_group_id': ['1'], 'type': ['pLDDT']}
>>> # There should be 3 records of local scores
>>> table = access.get_table(block, "_ma_qa_metric_local.")
>>> print("# chain res seqID pLDDT")
# chain res seqID pLDDT
>>> for r in table:
...     print(
...         f"{r['ordinal_id']}   {r['label_asym_id']}   "
...         + f"{r['label_comp_id']}   {r['label_seq_id']}   "
...         + f"{r['metric_value']}"
...     )
1   A   MET   1   35.0
2   A   ALA   2   50.3
3   A   THR   3   65.75

Parameters:: block (gemmi.cif.Block) – CIF block to operate on.
Returns:: None
Raises:: RuntimeError – If _ma_qa_metric contains more than one local pLDDT entry.

modelarchive.modelcif.fix_af3.fix_citation(block)[source]

Normalise the AlphaFold 3 citation in a ModelCIF block.

Ensures that the AlphaFold 3 publication (PMID 38718835) is not marked as the “primary” citation, assigns a numeric citation ID instead. Fixes an incomplete AlphaFold 3 citation. Replaces the author list with the full curated list of names and updates its citation ID. Reorders citations so that the primary entry appears first and links the citation to the corresponding software record.

This adjustment is not required for valid ModelCIF files, but follows ModelArchive conventions where the primary citation must refer to the deposited model rather than the software used to generate it.

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import access, fix_af3
>>> # start with an empty CIF document
>>> CIF_DATA = '''data_test
... _citation.id primary
... _citation.country UK
... _citation.journal_full Nature
... _citation.journal_id_ASTM NATUAS
... _citation.journal_id_CSD 0006
... _citation.journal_id_ISSN 0028-0836
... _citation.journal_volume 630
... _citation.page_first 493
... _citation.page_last 500
... _citation.pdbx_database_id_DOI 10.1038/s41586-024-07487-w
... _citation.pdbx_database_id_PubMed 38718835
... _citation.title 'Accurate structure prediction of biomolecular ...'
... _citation.year 2024
... #
... loop_
... _citation_author.citation_id
... _citation_author.name
... _citation_author.ordinal
... primary "Google DeepMind AlphaFold Team" 1
... primary "Isomorphic Labs Team" 2
... #
... loop_
... _software.classification
... _software.date
... _software.description
... _software.name
... _software.pdbx_ordinal
... _software.type
... _software.version
... other ? "Structure prediction" AlphaFold 1 package AlphaFold-beta
... '''
>>> block = cif.read_string(CIF_DATA).sole_block()
>>> fix_af3.fix_citation(block)
>>> # The usual block.as_string() output would be too much for a
>>> # docstring, just check some important values.
>>> table = access.get_table(block, "_citation")
>>> assert table[0]["id"] == "1"
>>> table = access.get_table(block, "_citation_author")
>>> assert table[0]["name"] != "Google DeepMind AlphaFold Team"
>>> table = access.get_table(block, "_software")
>>> assert table[0]["citation_id"] == "1"

Parameters:

block (gemmi.cif.Block) – CIF block to operate on.

Returns:

None

Raises:

edit.NotFoundCategoryError – If _software category can not be found.
NotIdentifiedSingleRecordError – If required item is missing from _citation category. If item values are not as expected for _citation category.
NotIdentifiedDuplicatedRecordError – If multiple entries for AlphaFold are found in _software category. In that case, the “right” record can not be identified.

modelarchive.modelcif.fix_af3.fix_model_name(block, mdl_rank)[source]

Normalise _ma_model_list.model_name for given rank.

AlphaFold 3 sets _ma_model_list.model_name to “Top ranked model” for all models, regardless of their rank. This function rewrites the value such that only mdl_rank == 1 is labelled “Top ranked model”. All other ranks are renamed to “#<mdl_rank> ranked model”.

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import fix_af3
>>> # get sample CIF data
>>> cif_data = '''data_test
... _ma_model_list.data_id    1
... _ma_model_list.model_name "Top ranked model"
... _ma_model_list.model_type "Ab initio model"
... _ma_model_list.ordinal_id 1
... '''
>>> block = cif.read_string(cif_data).sole_block()
>>> fix_af3.fix_model_name(block, 2)
>>> print(block.as_string())
data_test
_ma_model_list.data_id 1
_ma_model_list.model_name "#2 ranked model"
_ma_model_list.model_type "Ab initio model"
_ma_model_list.ordinal_id 1

>>> fix_af3.fix_model_name(block, 1)
>>> print(block.as_string())
data_test
_ma_model_list.data_id 1
_ma_model_list.model_name "Top ranked model"
_ma_model_list.model_type "Ab initio model"
_ma_model_list.ordinal_id 1

Parameters:

block (gemmi.cif.Block) – CIF block to operate on.
mdl_rank (int) – Rank of the AlphaFold 3 model. If mdl_rank == 1, the name is set to “Top ranked model”.

Returns:

None

Raises:

RuntimeError – If the _ma_model_list category contains more than one row.
edit.NotFoundCategoryError – no software entry found for AF3.
edit.NotFoundItemError – If _ma_model_list.model_name can not be found in block.

modelarchive.modelcif.fix_af3.fix_modelcif_issues(block, compdict_cache='.compdict_cache')[source]

Fix multiple small issues in AF3 ModelCIF files.

Things corrected:

_atom_site.auth_comp_id gets added if not present
_pdbx_poly_seq_scheme.pdb_mon_id gets added if not present
_pdbx_branch_scheme.pdb_mon_id gets added if not present
_pdbx_entity_branch_list gets added when _pdbx_branch_scheme exists
_pdbx_nonpoly_scheme.ndb_seq_num gets added if not present
single sugars mistakenly marked as ‘branched’ entity will be relabelled to ‘non-polymer’ in the _entity category
duplicated molecular entities are reduced to a single molecular entity (AF3 adds a molecular entity per copy of a molecule)
atom names are changed to comply with IUPAC
ligand naming scheme LIG_<CHARACTER> is replaced with proper molecule names (if possible, via RCSB)
if _ma_data_ref_db set, replace wrong “id” item with “data_id”, remove duplicate entries and set necessary data in _ma_data

Parameters:

block (gemmi.cif.Block) – CIF block to operate on.
compdict_cache (str | Path) – Path to the cache file for RCSB API calls for chemical compounds. Defaults to .compdict_cache.

Returns:

None

Raises:

NotIdentifiedSingleRecordError – If a record to be deleted cannot be found.
RuntimeError – If entities to be merged have differing data, if an empty entity is found, or if a SMILES string cannot be identified.
RuntimeError – If entities to be merged have differing data, if an empty entity is found, if no _entity_poly record is found for an entity ID, if polymer types of duplicated entities mismatch, or if a SMILES string cannot be identified via the RCSB API.

modelarchive.modelcif.fix_af3.fix_protocol(block)[source]

Fix the MA protocol to a single well-formed step.

Rewrites _ma_data, _ma_data_group, and _ma_protocol_step from scratch based on the existing _ma_target_entity, _ma_model_list and _ma_software_group categories.

Warning

Existing _ma_data, _ma_data_group, and _ma_protocol_step categories in block are overwritten without checking their prior contents.
It is important that fix_modelcif_issues() and fix_model_name() are called before this since this function uses existing data from _ma_model_list and _ma_data_ref_db set there.

Data layout after the call:

_ma_data:
One record per target entity (content_type “target”) followed by one record per model (content_type “model coordinates”). If _ma_data_ref_db is available, records for them are added as well. Names for the data items are taken from _entity.pdbx_description, _ma_model_list.model_name and _ma_data_ref_db.name, respectively. IDs are assigned sequentially starting at 1.

_ma_data_group:
Group 1 - all target data IDs (input side).

Group 2 - all model data IDs (output side).

_ma_protocol_step:
A single step referencing the AF3 software group, group 1 as input, and group 2 as output.

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import access, fix_af3
>>> # start with an empty CIF document
>>> CIF_DATA = '''data_test
... #
... loop_
... _entity.id
... _entity.pdbx_description
... _entity.type
... 1 "bestest polymer in universe" polymer
... 2 "second best polythingi in universe" polymer
... #
... loop_
... _ma_target_entity.data_id
... _ma_target_entity.entity_id
... _ma_target_entity.origin
... 1 1 .
... 1 2 .
... #
... _ma_model_list.data_id          1
... _ma_model_list.model_group_id   1
... _ma_model_list.model_group_name "AlphaFold-beta-20231127 (...)"
... _ma_model_list.model_id         1
... _ma_model_list.model_name       "Top ranked model"
... _ma_model_list.model_type       "Ab initio model"
... _ma_model_list.ordinal_id       1
... #
... loop_
... _ma_software_group.group_id
... _ma_software_group.ordinal_id
... _ma_software_group.software_id
... 1 1 1
... #
... loop_
... _software.classification
... _software.date
... _software.description
... _software.name
... _software.pdbx_ordinal
... _software.type
... _software.version
... other ? "Structure prediction" AlphaFold 1 package AlphaFold-beta
... '''
>>> block = cif.read_string(CIF_DATA).sole_block()
>>> fix_af3.fix_protocol(block)
>>> access.get_table(block, "_entity").erase()
>>> access.get_table(block, "_ma_data").erase()
>>> access.get_table(block, "_ma_data_group").erase()
>>> access.get_table(block, "_ma_model_list").erase()
>>> access.get_table(block, "_ma_software_group").erase()
>>> access.get_table(block, "_ma_target_entity").erase()
>>> access.get_table(block, "_software").erase()
>>> print(block.as_string())
data_test
loop_
_ma_protocol_step.ordinal_id
_ma_protocol_step.protocol_id
_ma_protocol_step.step_id
_ma_protocol_step.method_type
_ma_protocol_step.details
_ma_protocol_step.software_group_id
_ma_protocol_step.input_data_group_id
_ma_protocol_step.output_data_group_id
1 1 1 modeling 'Model generated with AlphaFold 3.' 1 1 2

Parameters:

block (gemmi.cif.Block) – CIF block to operate on.

Returns:

None

Raises:

edit.NotFoundCategoryError – If any required source category is absent: _entity, _ma_target_entity, _ma_model_list, or _ma_software_group.
edit.NotFoundItemError – If _ma_target_entity.data_id, _ma_model_list.data_id or _ma_model_list.model_name are missing.
NotIdentifiedDuplicatedRecordError – If multiple _ma_software_group records exist and the AF3 entry cannot be unambiguously identified in _software.
NotIdentifiedContextRecordError – If multiple _ma_software_group records exist but no AF3 entry can be found in _software at all.

modelarchive.modelcif.fix_af3.fix_software_location(block)[source]

Ensures the AlphaFold 3 _software entry has a correct location URL.

Determines whether the ModelCIF block originates from the AlphaFold 3 server or a local installation and sets the corresponding URL in _software.location. If the column does not yet exist it is created; otherwise only the row for AlphaFold 3 is updated.

Examples

>>> from gemmi import cif
>>> from modelarchive.modelcif import access, fix_af3
>>> # start with an empty CIF document
>>> CIF_DATA = '''data_test
... _pdbx_data_usage.details "... alphafoldserver.com/output-terms."
... _pdbx_data_usage.id 1
... _pdbx_data_usage.type license
... _pdbx_data_usage.url ?
... #
... loop_
... _software.classification
... _software.date
... _software.description
... _software.name
... _software.pdbx_ordinal
... _software.type
... _software.version
... other ? "Structure prediction" AlphaFold 1 package AlphaFold-beta
... '''
>>> block = cif.read_string(CIF_DATA).sole_block()
>>> fix_af3.fix_software_location(block)
>>> # Just check that _software.location exists and has the right value
>>> table = access.get_table(block, "_software")
>>> assert "_software.location" in table.tags
>>> assert table[0]["location"] == "https://alphafoldserver.com/"
>>> # Change block to look like ModelCIF file from local installation
>>> table = access.get_table(block, "_pdbx_data_usage")
>>> table[0]["details"] = "...github.com/google-deepmind/alphafold3..."
>>> fix_af3.fix_software_location(block)
>>> # Check _software.location to point to GitHub, now
>>> table = access.get_table(block, "_software")
>>> assert table[0]["location"] == "https://github.com/google-deepmind/alphafold3"

Parameters:

block (gemmi.cif.Block) – CIF block to operate on.

Returns:

None

Raises:

NotIdentifiedContextRecordError – If no AlphaFold 3 entry is found in the _software table.
NotIdentifiedContextRecordError – If the origin of the AlphaFold 3 license could not be identified in the _pdbx_data_usage table.
NotIdentifiedDuplicatedRecordError – If multiple entries for AlphaFold 3 are found in the _software table.

ModelCif as `dict` (`modelarchive.modelcif.ma_dict`)

Functionality for ModelCIF data/ categories when represented as dict.

modelarchive.modelcif.ma_dict.add_row_to_category_dict(cat_dict, row_dict, ordinal_item=None, null_value=False)[source]

Add a row to a category dict or create it if empty.

Appends a new row to cat_dict, which holds mmCIF category data as a column-oriented dict of lists, as obtained from gemmi.cif.Block.find_mmcif_category(raw=False) or an empty dict. If cat_dict is empty, it is populated from row_dict directly.

Parameters:

cat_dict (dict[str]) – Column-oriented category dict to update. Each key maps to a list of values, one entry per row.
row_dict (dict[str]) – Dict with keys and single values representing the new row. Keys not present in cat_dict are added as new columns, back-filled with null_value. Keys only in cat_dict are filled with null_value.
ordinal_item (str or None) – Key whose value is auto-incremented on each call. Set to "1" when cat_dict is empty. If set but not present in a non-empty cat_dict, it is silently ignored. Defaults to None.
null_value – Value used to fill missing keys. False for inapplicable ('.'), None for unknown ('?'). Defaults to False.

Returns:

The new ordinal value as a string, or: None if ordinal_item is not set.

Return type:

str or None

Raises:

ValueError – If column lengths in cat_dict are inconsistent after appending.

Handling ModelCIF (modelarchive.modelcif)

Accessing ModelCIF (modelarchive.modelcif.access)

Editing ModelCIF (modelarchive.modelcif.edit)

Fixing AlphaFold 2 ModelCIF files (modelarchive.modelcif.fix_af2)

Fixing AlphaFold 3 ModelCIF files (modelarchive.modelcif.fix_af3)

ModelCif as dict (modelarchive.modelcif.ma_dict)

Handling ModelCIF (`modelarchive.modelcif`)

Accessing ModelCIF (`modelarchive.modelcif.access`)

Editing ModelCIF (`modelarchive.modelcif.edit`)

Fixing AlphaFold 2 ModelCIF files (`modelarchive.modelcif.fix_af2`)

Fixing AlphaFold 3 ModelCIF files (`modelarchive.modelcif.fix_af3`)

ModelCif as `dict` (`modelarchive.modelcif.ma_dict`)