Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

structscope

structscope

Work in progress — APIs, CLI flags, and output schemas may change between releases. Pin a version tag for reproducible workflows.

structscope is a Rust-native structural bioinformatics toolkit for canonical protein structure parsing, graph-native representations, reproducible feature extraction, and analytical outputs.

It parses PDB, mmCIF, and BinaryCIF (with gzip support) into a canonical model, builds residue/atom/interface graphs, and computes raw structural primitives: solvent accessible surface area, relative accessibility, DSSP-style secondary structure, backbone dihedrals, optimal superposition/RMSD, and typed interactions. All primitives emit raw quantities; downstream interpretation is left to the user.

See CLI Usage to get started, Architecture for the crate layout, and the Changelog for recent additions.

CLI Usage

structscope <command> [options]. All commands accept PDB, mmCIF, and BinaryCIF inputs, including gzip-compressed variants (.pdb.gz, .cif.gz, .bcif.gz).

parse

Summarise a structure (chains, residues, atoms, heteroatoms, ligands).

structscope parse 1nkd.cif.gz
structscope parse 1nkd.bcif --format json

featurize

Compute structure-level features and write them to an output directory (JSONL + Parquet). Accepts a single file or a directory to batch-process.

structscope featurize 1nkd.cif.gz --out ./out
structscope featurize ./structures --out ./out --provenance
structscope featurize ./structures --out ./out --provenance -j 4
structscope featurize ./structures --out ./out --ligand-exclude SO4,PO4
structscope featurize dimer.pdb --out ./out --interface-distance 8.0

Optional ligand flags (also on ligands):

  • --ligand-exclude RES[,RES...] — add residue names to the default denylist
  • --ligand-include RES[,RES...] — allowlist mode; only these hetero residues count
  • --binding-distance <Å> — binding-site cutoff (default 5.0)

Optional interface flags (also on interfaces):

  • --interface-distance <Å> — chain-pair contact cutoff (default 8.0)
  • --interface-area-distance <Å> — interface patch area cutoff (default 5.0)
  • --interface-sc-distance <Å> — shape complementarity surface cutoff (default 5.0)

Optional quality flags (also on quality):

  • --clash-overlap <Å> — VdW overlap threshold for steric clashes (default 0.4)
  • --all-residues — emit every evaluated residue, not just problems (default: problems only)

Emitted features include counts (atoms, residues, chains, ligands), graph metrics (contacts, density, clustering), geometry (radius of gyration, SASA), secondary-structure composition, typed-interaction counts, protein–ligand interaction counts, binding-site residue count, ligand SASA, buried/exposed residue counts, protein–protein interface summaries (pair count, total and max BSA/area/SC, largest-interface chain IDs), and structure quality summaries (Ramachandran favored/allowed/outlier counts, clash pairs, missing backbone residues).

ligands

Emit one JSON record per filtered ligand (SASA, binding-site residues, interaction counts).

structscope ligands 1nkd.cif.gz
structscope ligands complex.cif.gz --out ligands.jsonl
structscope ligands complex.cif.gz --ligand-include HEM,NAG --binding-distance 4.0

interfaces

Emit one JSON record per contacting chain pair (BSA, interface patch area, shape complementarity, contact and residue counts).

structscope interfaces 1nkd.cif.gz
structscope interfaces dimer.pdb --out interfaces.jsonl
structscope interfaces dimer.pdb --interface-distance 8.0 --interface-area-distance 5.0

quality

Emit per-residue structure quality (Ramachandran classification, steric clashes, missing backbone atoms) as JSONL. By default only problem residues are emitted; use --all-residues for the full table.

structscope quality 1nkd.cif.gz
structscope quality model.pdb --out quality.jsonl
structscope quality model.pdb --clash-overlap 0.4 --all-residues

residues

Emit one JSON record per residue (SASA, RSA, secondary structure, phi/psi/omega) as JSONL, to stdout or a file.

structscope residues 1nkd.cif.gz
structscope residues 1nkd.cif.gz --out residues.jsonl

compare

Compare two or more structures: pairwise RMSD matrix (CA atoms with sequence alignment by default) and numeric feature deltas against a chosen reference. Accepts a single file or a directory of structures (minimum two inputs).

structscope compare ./models
structscope compare ./models --reference ref.pdb
structscope compare ./models --auto-reference
structscope compare ./models --reference-by min:ramachandran_outlier_count
structscope compare ./models --delta-fields sasa_total,interface_bsa_total --out ./compare-out
structscope compare ./models --out ./compare-out --format csv
structscope compare ./models --atoms backbone --local --reference #0

Without --out, prints a single JSON object to stdout (matrix, deltas, reference metadata, and any parse failures). With --out:

  • --format json (default): matrix.json and deltas.jsonl
  • --format csv: matrix.csv and deltas.csv

Reference selection (first match wins):

  • --reference PATH|#INDEX — explicit path, basename, structure ID, or input index
  • --reference-by min:field or max:field — pick by a numeric featurize field
  • --auto-reference — lowest Ramachandran outliers, then clashes, then missing backbone
  • else the first successfully parsed input

RMSD correspondence flags (same semantics as rmsd):

  • --atoms ca|backbone|all — atom selection (default ca)
  • --align — sequence-based residue correspondence (default on)
  • --local — Smith-Waterman local alignment for partial overlaps

Optional flags shared with featurize:

  • --delta-fields FIELD[,FIELD...] — limit delta columns (default: all numeric features)
  • ligand: --ligand-exclude, --ligand-include, --binding-distance
  • interface: --interface-distance, --interface-area-distance, --interface-sc-distance
  • quality: --clash-overlap

rmsd

Optimal-superposition RMSD between two structures.

# Equal-length structures, matched by atom order:
structscope rmsd ref.pdb mobile.pdb --atoms ca       # or: backbone, all

# Different-length but related structures (sequence-aligned CA atoms):
structscope rmsd ref.pdb mobile.pdb --align

# Partial or domain-level overlap (local Smith-Waterman alignment):
structscope rmsd ref.pdb fragment.pdb --local

Without --align, the two selections must have equal atom counts; the error message hints at --align when they differ.

graph

Export a residue, atom, or interface contact graph. Supported formats are graphml (default), gml, and json. Chemical and geometric interactions (disulfides, salt bridges, hydrogen bonds, cation-pi, pi-pi, and hydrophobic contacts) are automatically resolved and embedded as prioritized edges in residue and interface graphs.

structscope graph 1nkd.cif.gz --graph-type residue --format gml
structscope graph complex.cif.gz --graph-type interface --format json
structscope graph 1nkd.cif.gz --graph-type residue --out graph.graphml

query

Run SQL over a feature Parquet file or featurize output directory. Requires a build with the duckdb feature.

cargo build -p structscope-cli --features duckdb
structscope query ./out --sql "SELECT structure_id, sasa_total FROM features"

provenance

Inspect a provenance SQLite database produced by featurize --provenance.

structscope provenance ./out/run.sqlite

Architecture

The first implementation slice is CLI-first and keeps crate boundaries aligned with the long-term design:

  • structscope-core: data model and parsers
  • structscope-graphs: graph builders and export
  • structscope-features: scientific features
  • structscope-store: persisted outputs and query adapter boundary
  • structscope-events: structured events
  • structscope-provenance: optional lineage capture
  • structscope-agent: optional eBPF integration boundary
  • structscope-cli: orchestration

The scientific path is independent of provenance and eBPF so batch processing can remain portable.

Interactions and Contact Graphs

structscope computes advanced chemical and geometric interactions directly from coordinates, and integrates them into its contact graph representations.

1. Interaction Detection Logic

All interaction detectors reside in structscope-features under interactions.rs and are computed based on atom names and distance/angle rules:

  • Disulfide Bonds: Formed between two cysteine sidechain sulfur atoms (CYS SG) within $2.5\text{ Å}$ distance.
  • Salt Bridges: Formed between sidechain acidic oxygens (on ASP/GLU) and sidechain basic nitrogens (on LYS/ARG/HIS) within $4.0\text{ Å}$ distance.
  • Hydrogen Bonds (Polar Contacts): Formed between a nitrogen/oxygen donor-acceptor pair on different residues within $2.4\text{ to }3.5\text{ Å}$.
  • Cation-Pi Interactions: Formed between the centroid of an aromatic ring (PHE/TYR/TRP) and a basic sidechain nitrogen within $6.0\text{ Å}$.
  • Pi-Pi Stacking: Formed between the centroids of two aromatic rings within $5.5\text{ Å}$:
    • Parallel Stacking: Ring normal angle $< 30^\circ$ or $> 150^\circ$.
    • Perpendicular Stacking: Ring normal angle between $60^\circ$ and $90^\circ$.
  • Hydrophobic Contacts: Formed between sidechain aliphatic/aromatic carbons on different residues within $4.5\text{ Å}$.

2. Contact Graph Representation & Prioritization

Contact graphs are built in structscope-graphs using petgraph. When generating residue or interface contact graphs, the builder can ingest chemical interactions and merge them as prioritized edges.

Decoupled Data Flow

To avoid a circular dependency between structscope-features (which needs graph definitions for feature metrics) and structscope-graphs, structscope-graphs defines a simple intermediate struct:

#![allow(unused)]
fn main() {
pub struct ChemicalInteraction {
    pub kind: String,
    pub res_id_a: String,
    pub res_id_b: String,
    pub distance: f64,
}
}

The CLI orchestrator parses the structures, computes features/interactions, maps them to ChemicalInteraction using residue ID lookups, and passes them to the graph builders.

Prioritization Rules

When multiple potential interaction types overlap between two residues, they are merged into a single edge according to strict precedence rules.

  • Backbone Covalent Adjacency (covalent_adjacency): Always preserved; never overwritten by a chemical contact.
  • Other Contacts: Overwritten if a higher-priority interaction type is found, or if an interaction of the same type has a shorter distance:
PriorityEdge Kind
7 (Highest)disulfide
6salt_bridge
5hydrogen_bond
4cation_pi
3pi_pi_parallel / pi_pi_perpendicular
2hydrophobic
1 (Lowest)distance_contact / interface_contact

3. Export Formats

structscope supports three serialization formats for contact graphs:

GraphML

Standard XML representation of graphs.

  • Nodes: Contain node metadata (e.g., residue_name, seq_number).
  • Edges: Contain edge details (kind, distance).

GML (Graph Modeling Language)

An easy-to-parse ascii representation of graphs:

graph [
  directed 0
  node [
    id 0
    label "1nkd:A:1:_"
    residue_name "MET"
    seq_number 1
  ]
  edge [
    source 0
    target 1
    distance 5.291
    kind "covalent_adjacency"
  ]
]

A JSON format compatible with web-based visualization tools (e.g., D3.js, cytoscape.js):

{
  "nodes": [
    {
      "id": 0,
      "residue_id": "1nkd:A:1:_",
      "chain_id": "1nkd:A",
      "residue_name": "MET",
      "seq_number": 1
    }
  ],
  "links": [
    {
      "source": 0,
      "target": 1,
      "distance": 5.291,
      "kind": "covalent_adjacency"
    }
  ]
}

Citation

If you use structscope in academic work, please cite the repository. The software is still a work in progress — update the version field to match the tag you used.

@software{structscope2026,
  author    = {Amirabadi, Danial Gharaie},
  title     = {{structscope}: Rust-native structural bioinformatics toolkit},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://github.com/Danialgharaie/structscope},
  version   = {0.4.1},
  note      = {Work in progress. APIs and output schemas may change between releases.}
}

The same entry is checked in at the repository root as CITATION.bib.

Changelog

All notable changes to this project are documented here. This project follows Keep a Changelog conventions.

[Unreleased]

[0.4.1] - 2026-06-11

Added

  • GitHub Release Linux x86_64 binary (structscope-*-x86_64-unknown-linux-gnu.tar.gz) on version tags.
  • crates.io publishing for workspace crates (cargo install structscope-cli).

[0.4.0] - 2026-06-11

Added

  • Multi-structure compare: pairwise RMSD matrix (CA atoms with sequence alignment by default) and numeric feature deltas against a chosen reference.
  • CLI command:
    • compare <input> — compare two or more structures from a file or directory; prints JSON to stdout or writes matrix.json + deltas.jsonl (or matrix.csv + deltas.csv with --format csv) to --out.
  • Reference selection (first match wins): --reference, --reference-by min:field|max:field, --auto-reference (lowest Ramachandran outliers, clashes, missing backbone), else first input.
  • --delta-fields to restrict feature delta columns; --atoms, --align, and --local for RMSD correspondence (same semantics as rmsd).
  • Ligand, interface, and quality flags shared with featurize.

[0.3.1] - 2026-06-11

Added

  • Golden fixtures and regression tests for structure quality metrics (tests/fixtures/quality/, quality_golden.rs).

[0.3.0] - 2026-06-11

Added

  • Structure quality metrics: MolProbity-style Ramachandran classification (favored / allowed / outlier, with Gly/Pro regions), steric clash detection (heavy atoms, configurable VdW overlap), and missing backbone atom checks (N, CA, C, O) over canonical and common variant residues.
  • CLI command:
    • quality <input> — per-residue quality records as JSONL (problems only by default; --all-residues for full output).
  • Structure-level quality aggregates in featurize: quality_residue_count, Ramachandran counts, clash_pair_count, and missing_backbone_residue_count.
  • --clash-overlap flag on featurize and quality (default 0.4 Å).

[0.2.0] - 2026-06-11

Added

  • Parallel execution support for structscope featurize using Rayon, allowing high-performance concurrent parsing and feature extraction controlled via a new --jobs / -j CLI flag.
  • B-factor (temperature factor) support: extended core Atom model, parsed B-factors from PDB, mmCIF, and BinaryCIF, and computed structure-level (bfactor_mean, bfactor_std, bfactor_min, bfactor_max) and residue-level statistics.
  • Advanced geometric interaction detectors: cation-pi, parallel and perpendicular pi-pi stacking, and hydrophobic carbon-carbon contacts.
  • Enhanced contact graphs: integrated chemical/geometric interactions as prioritized edges overlaying standard distance contacts.
  • New contact graph formats: added custom exporters for GML and node-link JSON formats, with automatic file extension matching.
  • Thread-safe background event and provenance logging architecture using a message-passing channel (mpsc) to safely write events to SQLite and JSONL from worker threads.
  • BinaryCIF (.bcif / .bcif.gz) parsing: a hand-written MessagePack-based decoder covering all seven column encodings (ByteArray, IntegerPacking, RunLength, Delta, FixedPoint, IntervalQuantization, StringArray) and value masks. structscope now ingests PDB, mmCIF, and BinaryCIF.
  • Structural primitives computed directly from coordinates:
    • Solvent accessible surface area (Shrake-Rupley), per-atom and total.
    • Relative solvent accessibility (RSA), per residue, normalised by residue-type maxima (Tien et al. 2013).
    • DSSP-style secondary structure (Kabsch-Sander hydrogen bonds).
    • Backbone dihedrals (phi/psi/omega).
    • Optimal superposition and RMSD (quaternion/Kabsch).
    • Typed interactions: disulfides, salt bridges, hydrogen bonds.
  • Sequence alignment primitive (Needleman-Wunsch) for residue correspondence.
  • CLI commands:
    • rmsd <reference> <mobile> — optimal-superposition RMSD over matched atoms (--atoms ca|backbone|all), with --align for sequence-based correspondence between structures of different lengths.
    • residues <input> — per-residue features (SASA, RSA, secondary structure, dihedrals) as JSONL.
    • ligands <input> — per-ligand features (SASA, binding-site residues, protein–ligand interaction counts) as JSONL.
  • Protein–ligand features: configurable ligand filter (default excludes water and common ions), structure-level protein–ligand interaction counts, binding-site residue count, and ligand SASA.
  • Protein–protein interface metrics: buried surface area (BSA), interface patch area, and Lawrence–Colman shape complementarity per contacting chain pair.
  • CLI command:
    • interfaces <input> — per chain-pair interface features (BSA, area, shape complementarity, contact and residue counts) as JSONL.
  • Structure-level interface aggregates and largest-interface fields in featurize: interface_pair_count, total and max BSA/area/SC, and largest-interface chain IDs.
  • Structure-level features: sasa_total, helix/strand/coil_residue_count, disulfide/salt_bridge/hydrogen_bond_count, buried/exposed_residue_count, ligand_sasa_total, ligand_sasa_mean, binding_site_residue_count, protein_ligand_hbond_count, protein_ligand_salt_bridge_count, protein_ligand_hydrophobic_count, protein_ligand_contact_count.

Changed

  • Breaking: ligand_count in featurize now uses the filtered ligand definition (hetero residues minus the default denylist and CLI overrides), not the raw hetero residue count.

Notes

  • All primitives emit raw quantities; downstream interpretation is left to the user. Per-residue and per-atom detail is exposed as library functions.