structscope
Work in progress — APIs, CLI flags, and output schemas may change between releases. Pin a version tag for reproducible workflows.
structscope is a Rust-native structural bioinformatics toolkit for canonical
protein structure parsing, graph-native representations, reproducible feature
extraction, and analytical outputs.
It parses PDB, mmCIF, and BinaryCIF (with gzip support) into a canonical model, builds residue/atom/interface graphs, and computes raw structural primitives: solvent accessible surface area, relative accessibility, DSSP-style secondary structure, backbone dihedrals, optimal superposition/RMSD, and typed interactions. All primitives emit raw quantities; downstream interpretation is left to the user.
See CLI Usage to get started, Architecture for the crate layout, and the Changelog for recent additions.
CLI Usage
structscope <command> [options]. All commands accept PDB, mmCIF, and
BinaryCIF inputs, including gzip-compressed variants (.pdb.gz, .cif.gz,
.bcif.gz).
parse
Summarise a structure (chains, residues, atoms, heteroatoms, ligands).
structscope parse 1nkd.cif.gz
structscope parse 1nkd.bcif --format json
featurize
Compute structure-level features and write them to an output directory (JSONL + Parquet). Accepts a single file or a directory to batch-process.
structscope featurize 1nkd.cif.gz --out ./out
structscope featurize ./structures --out ./out --provenance
structscope featurize ./structures --out ./out --provenance -j 4
structscope featurize ./structures --out ./out --ligand-exclude SO4,PO4
structscope featurize dimer.pdb --out ./out --interface-distance 8.0
Optional ligand flags (also on ligands):
--ligand-exclude RES[,RES...]— add residue names to the default denylist--ligand-include RES[,RES...]— allowlist mode; only these hetero residues count--binding-distance <Å>— binding-site cutoff (default5.0)
Optional interface flags (also on interfaces):
--interface-distance <Å>— chain-pair contact cutoff (default8.0)--interface-area-distance <Å>— interface patch area cutoff (default5.0)--interface-sc-distance <Å>— shape complementarity surface cutoff (default5.0)
Optional quality flags (also on quality):
--clash-overlap <Å>— VdW overlap threshold for steric clashes (default0.4)--all-residues— emit every evaluated residue, not just problems (default: problems only)
Emitted features include counts (atoms, residues, chains, ligands), graph metrics (contacts, density, clustering), geometry (radius of gyration, SASA), secondary-structure composition, typed-interaction counts, protein–ligand interaction counts, binding-site residue count, ligand SASA, buried/exposed residue counts, protein–protein interface summaries (pair count, total and max BSA/area/SC, largest-interface chain IDs), and structure quality summaries (Ramachandran favored/allowed/outlier counts, clash pairs, missing backbone residues).
ligands
Emit one JSON record per filtered ligand (SASA, binding-site residues, interaction counts).
structscope ligands 1nkd.cif.gz
structscope ligands complex.cif.gz --out ligands.jsonl
structscope ligands complex.cif.gz --ligand-include HEM,NAG --binding-distance 4.0
interfaces
Emit one JSON record per contacting chain pair (BSA, interface patch area, shape complementarity, contact and residue counts).
structscope interfaces 1nkd.cif.gz
structscope interfaces dimer.pdb --out interfaces.jsonl
structscope interfaces dimer.pdb --interface-distance 8.0 --interface-area-distance 5.0
quality
Emit per-residue structure quality (Ramachandran classification, steric
clashes, missing backbone atoms) as JSONL. By default only problem residues
are emitted; use --all-residues for the full table.
structscope quality 1nkd.cif.gz
structscope quality model.pdb --out quality.jsonl
structscope quality model.pdb --clash-overlap 0.4 --all-residues
residues
Emit one JSON record per residue (SASA, RSA, secondary structure, phi/psi/omega) as JSONL, to stdout or a file.
structscope residues 1nkd.cif.gz
structscope residues 1nkd.cif.gz --out residues.jsonl
compare
Compare two or more structures: pairwise RMSD matrix (CA atoms with sequence alignment by default) and numeric feature deltas against a chosen reference. Accepts a single file or a directory of structures (minimum two inputs).
structscope compare ./models
structscope compare ./models --reference ref.pdb
structscope compare ./models --auto-reference
structscope compare ./models --reference-by min:ramachandran_outlier_count
structscope compare ./models --delta-fields sasa_total,interface_bsa_total --out ./compare-out
structscope compare ./models --out ./compare-out --format csv
structscope compare ./models --atoms backbone --local --reference #0
Without --out, prints a single JSON object to stdout (matrix, deltas, reference
metadata, and any parse failures). With --out:
--format json(default):matrix.jsonanddeltas.jsonl--format csv:matrix.csvanddeltas.csv
Reference selection (first match wins):
--reference PATH|#INDEX— explicit path, basename, structure ID, or input index--reference-by min:fieldormax:field— pick by a numeric featurize field--auto-reference— lowest Ramachandran outliers, then clashes, then missing backbone- else the first successfully parsed input
RMSD correspondence flags (same semantics as rmsd):
--atoms ca|backbone|all— atom selection (defaultca)--align— sequence-based residue correspondence (default on)--local— Smith-Waterman local alignment for partial overlaps
Optional flags shared with featurize:
--delta-fields FIELD[,FIELD...]— limit delta columns (default: all numeric features)- ligand:
--ligand-exclude,--ligand-include,--binding-distance - interface:
--interface-distance,--interface-area-distance,--interface-sc-distance - quality:
--clash-overlap
rmsd
Optimal-superposition RMSD between two structures.
# Equal-length structures, matched by atom order:
structscope rmsd ref.pdb mobile.pdb --atoms ca # or: backbone, all
# Different-length but related structures (sequence-aligned CA atoms):
structscope rmsd ref.pdb mobile.pdb --align
# Partial or domain-level overlap (local Smith-Waterman alignment):
structscope rmsd ref.pdb fragment.pdb --local
Without --align, the two selections must have equal atom counts; the error
message hints at --align when they differ.
graph
Export a residue, atom, or interface contact graph. Supported formats are graphml (default), gml, and json. Chemical and geometric interactions (disulfides, salt bridges, hydrogen bonds, cation-pi, pi-pi, and hydrophobic contacts) are automatically resolved and embedded as prioritized edges in residue and interface graphs.
structscope graph 1nkd.cif.gz --graph-type residue --format gml
structscope graph complex.cif.gz --graph-type interface --format json
structscope graph 1nkd.cif.gz --graph-type residue --out graph.graphml
query
Run SQL over a feature Parquet file or featurize output directory. Requires a
build with the duckdb feature.
cargo build -p structscope-cli --features duckdb
structscope query ./out --sql "SELECT structure_id, sasa_total FROM features"
provenance
Inspect a provenance SQLite database produced by featurize --provenance.
structscope provenance ./out/run.sqlite
Architecture
The first implementation slice is CLI-first and keeps crate boundaries aligned with the long-term design:
structscope-core: data model and parsersstructscope-graphs: graph builders and exportstructscope-features: scientific featuresstructscope-store: persisted outputs and query adapter boundarystructscope-events: structured eventsstructscope-provenance: optional lineage capturestructscope-agent: optional eBPF integration boundarystructscope-cli: orchestration
The scientific path is independent of provenance and eBPF so batch processing can remain portable.
Interactions and Contact Graphs
structscope computes advanced chemical and geometric interactions directly from coordinates, and integrates them into its contact graph representations.
1. Interaction Detection Logic
All interaction detectors reside in structscope-features under interactions.rs and are computed based on atom names and distance/angle rules:
- Disulfide Bonds: Formed between two cysteine sidechain sulfur atoms (
CYSSG) within $2.5\text{ Å}$ distance. - Salt Bridges: Formed between sidechain acidic oxygens (on
ASP/GLU) and sidechain basic nitrogens (onLYS/ARG/HIS) within $4.0\text{ Å}$ distance. - Hydrogen Bonds (Polar Contacts): Formed between a nitrogen/oxygen donor-acceptor pair on different residues within $2.4\text{ to }3.5\text{ Å}$.
- Cation-Pi Interactions: Formed between the centroid of an aromatic ring (
PHE/TYR/TRP) and a basic sidechain nitrogen within $6.0\text{ Å}$. - Pi-Pi Stacking: Formed between the centroids of two aromatic rings within $5.5\text{ Å}$:
- Parallel Stacking: Ring normal angle $< 30^\circ$ or $> 150^\circ$.
- Perpendicular Stacking: Ring normal angle between $60^\circ$ and $90^\circ$.
- Hydrophobic Contacts: Formed between sidechain aliphatic/aromatic carbons on different residues within $4.5\text{ Å}$.
2. Contact Graph Representation & Prioritization
Contact graphs are built in structscope-graphs using petgraph. When generating residue or interface contact graphs, the builder can ingest chemical interactions and merge them as prioritized edges.
Decoupled Data Flow
To avoid a circular dependency between structscope-features (which needs graph definitions for feature metrics) and structscope-graphs, structscope-graphs defines a simple intermediate struct:
#![allow(unused)] fn main() { pub struct ChemicalInteraction { pub kind: String, pub res_id_a: String, pub res_id_b: String, pub distance: f64, } }
The CLI orchestrator parses the structures, computes features/interactions, maps them to ChemicalInteraction using residue ID lookups, and passes them to the graph builders.
Prioritization Rules
When multiple potential interaction types overlap between two residues, they are merged into a single edge according to strict precedence rules.
- Backbone Covalent Adjacency (
covalent_adjacency): Always preserved; never overwritten by a chemical contact. - Other Contacts: Overwritten if a higher-priority interaction type is found, or if an interaction of the same type has a shorter distance:
| Priority | Edge Kind |
|---|---|
| 7 (Highest) | disulfide |
| 6 | salt_bridge |
| 5 | hydrogen_bond |
| 4 | cation_pi |
| 3 | pi_pi_parallel / pi_pi_perpendicular |
| 2 | hydrophobic |
| 1 (Lowest) | distance_contact / interface_contact |
3. Export Formats
structscope supports three serialization formats for contact graphs:
GraphML
Standard XML representation of graphs.
- Nodes: Contain node metadata (e.g.,
residue_name,seq_number). - Edges: Contain edge details (
kind,distance).
GML (Graph Modeling Language)
An easy-to-parse ascii representation of graphs:
graph [
directed 0
node [
id 0
label "1nkd:A:1:_"
residue_name "MET"
seq_number 1
]
edge [
source 0
target 1
distance 5.291
kind "covalent_adjacency"
]
]
JSON (Node-Link Format)
A JSON format compatible with web-based visualization tools (e.g., D3.js, cytoscape.js):
{
"nodes": [
{
"id": 0,
"residue_id": "1nkd:A:1:_",
"chain_id": "1nkd:A",
"residue_name": "MET",
"seq_number": 1
}
],
"links": [
{
"source": 0,
"target": 1,
"distance": 5.291,
"kind": "covalent_adjacency"
}
]
}
Citation
If you use structscope in academic work, please cite the repository. The software
is still a work in progress — update the version field to match the tag
you used.
@software{structscope2026,
author = {Amirabadi, Danial Gharaie},
title = {{structscope}: Rust-native structural bioinformatics toolkit},
year = {2026},
publisher = {GitHub},
url = {https://github.com/Danialgharaie/structscope},
version = {0.4.1},
note = {Work in progress. APIs and output schemas may change between releases.}
}
The same entry is checked in at the repository root as CITATION.bib.
Changelog
All notable changes to this project are documented here. This project follows Keep a Changelog conventions.
[Unreleased]
[0.4.1] - 2026-06-11
Added
- GitHub Release Linux x86_64 binary (
structscope-*-x86_64-unknown-linux-gnu.tar.gz) on version tags. - crates.io publishing for workspace crates (
cargo install structscope-cli).
[0.4.0] - 2026-06-11
Added
- Multi-structure compare: pairwise RMSD matrix (CA atoms with sequence alignment by default) and numeric feature deltas against a chosen reference.
- CLI command:
compare <input>— compare two or more structures from a file or directory; prints JSON to stdout or writesmatrix.json+deltas.jsonl(ormatrix.csv+deltas.csvwith--format csv) to--out.
- Reference selection (first match wins):
--reference,--reference-by min:field|max:field,--auto-reference(lowest Ramachandran outliers, clashes, missing backbone), else first input. --delta-fieldsto restrict feature delta columns;--atoms,--align, and--localfor RMSD correspondence (same semantics asrmsd).- Ligand, interface, and quality flags shared with
featurize.
[0.3.1] - 2026-06-11
Added
- Golden fixtures and regression tests for structure quality metrics
(
tests/fixtures/quality/,quality_golden.rs).
[0.3.0] - 2026-06-11
Added
- Structure quality metrics: MolProbity-style Ramachandran classification (favored / allowed / outlier, with Gly/Pro regions), steric clash detection (heavy atoms, configurable VdW overlap), and missing backbone atom checks (N, CA, C, O) over canonical and common variant residues.
- CLI command:
quality <input>— per-residue quality records as JSONL (problems only by default;--all-residuesfor full output).
- Structure-level quality aggregates in
featurize:quality_residue_count, Ramachandran counts,clash_pair_count, andmissing_backbone_residue_count. --clash-overlapflag onfeaturizeandquality(default0.4Å).
[0.2.0] - 2026-06-11
Added
- Parallel execution support for
structscope featurizeusing Rayon, allowing high-performance concurrent parsing and feature extraction controlled via a new--jobs/-jCLI flag. - B-factor (temperature factor) support: extended core
Atommodel, parsed B-factors from PDB, mmCIF, and BinaryCIF, and computed structure-level (bfactor_mean,bfactor_std,bfactor_min,bfactor_max) and residue-level statistics. - Advanced geometric interaction detectors: cation-pi, parallel and perpendicular pi-pi stacking, and hydrophobic carbon-carbon contacts.
- Enhanced contact graphs: integrated chemical/geometric interactions as prioritized edges overlaying standard distance contacts.
- New contact graph formats: added custom exporters for GML and node-link JSON formats, with automatic file extension matching.
- Thread-safe background event and provenance logging architecture using a message-passing channel (
mpsc) to safely write events to SQLite and JSONL from worker threads. - BinaryCIF (
.bcif/.bcif.gz) parsing: a hand-written MessagePack-based decoder covering all seven column encodings (ByteArray, IntegerPacking, RunLength, Delta, FixedPoint, IntervalQuantization, StringArray) and value masks. structscope now ingests PDB, mmCIF, and BinaryCIF. - Structural primitives computed directly from coordinates:
- Solvent accessible surface area (Shrake-Rupley), per-atom and total.
- Relative solvent accessibility (RSA), per residue, normalised by residue-type maxima (Tien et al. 2013).
- DSSP-style secondary structure (Kabsch-Sander hydrogen bonds).
- Backbone dihedrals (phi/psi/omega).
- Optimal superposition and RMSD (quaternion/Kabsch).
- Typed interactions: disulfides, salt bridges, hydrogen bonds.
- Sequence alignment primitive (Needleman-Wunsch) for residue correspondence.
- CLI commands:
rmsd <reference> <mobile>— optimal-superposition RMSD over matched atoms (--atoms ca|backbone|all), with--alignfor sequence-based correspondence between structures of different lengths.residues <input>— per-residue features (SASA, RSA, secondary structure, dihedrals) as JSONL.ligands <input>— per-ligand features (SASA, binding-site residues, protein–ligand interaction counts) as JSONL.
- Protein–ligand features: configurable ligand filter (default excludes water and common ions), structure-level protein–ligand interaction counts, binding-site residue count, and ligand SASA.
- Protein–protein interface metrics: buried surface area (BSA), interface patch area, and Lawrence–Colman shape complementarity per contacting chain pair.
- CLI command:
interfaces <input>— per chain-pair interface features (BSA, area, shape complementarity, contact and residue counts) as JSONL.
- Structure-level interface aggregates and largest-interface fields in
featurize:interface_pair_count, total and max BSA/area/SC, and largest-interface chain IDs. - Structure-level features:
sasa_total,helix/strand/coil_residue_count,disulfide/salt_bridge/hydrogen_bond_count,buried/exposed_residue_count,ligand_sasa_total,ligand_sasa_mean,binding_site_residue_count,protein_ligand_hbond_count,protein_ligand_salt_bridge_count,protein_ligand_hydrophobic_count,protein_ligand_contact_count.
Changed
- Breaking:
ligand_countinfeaturizenow uses the filtered ligand definition (hetero residues minus the default denylist and CLI overrides), not the raw hetero residue count.
Notes
- All primitives emit raw quantities; downstream interpretation is left to the user. Per-residue and per-atom detail is exposed as library functions.