stopfree

A high-performance coding-potential baseline in Rust & Python

Bioinformatics
Rust
Python
Coding Potential
Open Source
Author

Andrew Green

Published

June 16, 2026

How do we tell if a transcript actually codes for a protein? Or rather, how do we prove it doesn’t?

There are loads of complex tools (using SVMs, deep neural networks, and sequence alignments) to predict the coding potential of transcripts. In 2025, Evaluating computational tools for protein-coding sequence detection: Are they up to the task? (Champion et al., RNA) evaluated nine of the most widely used tools for their usefulness in predicing protein coding potential, but also how easy they are to set up.

Only three out of the nine tools significantly outperformed a simple, baseline metric they invented called stopFree. Due to its simplicity, this baseline is also exceptionally robust to common sequence errors like non-AUG starts, truncations, or mis-splicing.

This was the motivation behind my package, stopfree—a Rust implementation of this coding potential baseline, exposed as a Python package using maturin.

You can check out the source code on GitHub at afg1/stopfree or install from PyPi:

pip install stopfree.

NoteTL;DR
  • What is it? A Python library (stopfree) written as a Rust extension that measures the longest stop-codon-free region across all six reading frames of a nucleotide sequence.
  • The Inspiration: The 2025 benchmarking study by Champion et al. demonstrating that complex machine-learning-based coding potential tools struggle to beat a simple “longest stop-free run” baseline.
  • Why Rust? Speed! Scanning millions of transcripts (e.g., within the RNAcentral release pipeline) needs to run as fast as possible. Writing the core in Rust and exposing it via PyO3/maturin makes it blazingly fast, and hopefully safe.
  • Practical Application: RNAcentral uses it as a quality assurance (QA) filter to flag or exclude protein-coding contamination in non-coding RNA datasets.
  • Installation: Simply run pip install stopfree.

The Concept: Longest Stop-Free Runs

The core idea of the stopFree algorithm is simple:

  1. Take a nucleotide sequence.
  2. Translate or scan the sequence across all six possible reading frames (three forward, three reverse).
  3. For each frame, find the longest contiguous block of codons that does not contain any in-frame stop codons (UAA, UAG, or UGA).
  4. Return the maximum of these lengths as the baseline coding potential.

A true protein-coding transcript should have a long stretch without stop codons (since it needs to be translated into a functional protein). In contrast, non-coding RNAs and random sequences should hit stop codons quite frequently by chance. This is something that should have evolutionary pressure on it, so ought to be quite reliable.

Figure 1: Visualizing the six reading frames and the longest stop-codon-free translation region. stopfree calculates these runs and their statistical probabilities.

The Mathematics: Probability Calculation

In the benchmarking paper, the authors just use the stop-free run length as their feature to determine protein coding vs not, and set a threshold based on that to achieve the best AUC.

However, I decided to try and be clever, and calculate the probability of observing a stop-codon-free run of length \(k\) codons purely by chance.

To do this, assume an independent and identically distributed (i.e., i.i.d.) nucleotide model. Because stop codons (UAA, UAG, UGA) are AT-rich, we must correct for the GC content of the individual sequence.

Let \(p_{\text{GC}}\) be the GC fraction of the sequence. Assuming the remaining nucleotides are split evenly between A and T, the probability of drawing each nucleotide is:

\[P(G) = P(C) = \frac{p_{\text{GC}}}{2}\] \[P(A) = P(T) = \frac{1 - p_{\text{GC}}}{2}\]

Under this model, the probability of encountering a stop codon (\(p_{\text{stop}}\)) in any given codon position is the sum of the probabilities of drawing TAA, TAG, or TGA:

\[p_{\text{stop}} = P(\text{TAA}) + P(\text{TAG}) + P(\text{TGA})\]

\[p_{\text{stop}} = P(T) P(A) P(A) + P(T) P(A) P(G) + P(T) P(G) P(A)\]

\[p_{\text{stop}} = \left(\frac{1 - p_{\text{GC}}}{2}\right)^3 + 2 \cdot \left(\frac{1 - p_{\text{GC}}}{2}\right)^2 \left(\frac{p_{\text{GC}}}{2}\right)\]

Factoring this out yields:

\[p_{\text{stop}} = \frac{(1 - p_{\text{GC}})^2}{8} \cdot \left[ (1 - p_{\text{GC}}) + 2 p_{\text{GC}} \right] = \frac{(1 - p_{\text{GC}})^2 (1 + p_{\text{GC}})}{8}\]

So the probability of observing a stop-codon-free run of \(k\) codons (\(p_{\text{nostop}}\)) is:

\[p_{\text{nostop}} = (1 - p_{\text{stop}})^k\]

If a sequence contains a run length \(k\) that yields a very low \(p_{\text{nostop}}\) (e.g., \(p < 0.01\)), we can reject the null hypothesis that this sequence is random/non-coding, indicating high coding potential.


Rust

The authors of the benchmark paper provided a (well hidden) implementation of this algorithm in python, so I could have used that. You could argule I should have, since technically it has been published, but the idea is so simple that I felt confident I could implement it myself.

Since joining RNAcentral ~4 years ago, I’ve had to get familiar with Rust because it is part of the release pipeline. I enjoyed this, because I really enjoy programming anyway, and I love an excuse to learn a new language/tool. Rust sits between what I used to do (C++) and what I do now (Python). Once you get used to the borrow checker and things, it is quite fun to write.

I decided to implement my stopfree package in Rust because I like Rust. But also, the maturin package makes it stupid easy to wrap a Rust library into python, which makes it vastly simpler to interoperate with everything else (e.g. loading a FASTA file, or pulling data from a database). The kicker was the speed, which is extreme - partly due to Rust being compiled, and partly due to being able to exchange a single .iter() with .par_iter() in the code and instantly parallelize across all cores without worrying about synchronisation. As someone who previously worked with OpenMP, and has done stuff with raw C++-11 std::thread, that was so nice.

This makes it sound like I’m good at Rust. I don’t think I am particularly, and while I did most of the work on this library I did have to let Claude take the wheel a bit when the borrow checker didn’t see eye-to-eye with me. The nice thing about this kind of development though is that you can create a test suite that is pretty unambiguous, and then develop against that, so I’m confident the LLM didn’t screw anything up. I’ll write elsewhere about my philosophy on getting the LLM to produce useful code while also helping me to understand WTF is going on.


Installation & Usage

You can install stopfree via pip:

pip install stopfree

To run the package, format your FASTA sequences into a list of (id, sequence) tuples:

import stopfree
from fasta_reader import read_fasta # Or any other FASTA parser

# 1. Read sequences from a FASTA file
all_fasta = []
for item in read_fasta("my_transcripts.fasta"):
    all_fasta.append((item.defline, item.sequence))

# 2. Calculate the longest stop-free run lengths (in codons)
run_lengths = stopfree.calculate_stop_free_runs_with_ids(all_fasta)

# 3. Calculate individual sequence GC contents
gc_contents = stopfree.calculate_gc_content(all_fasta)

# 4. Compute the probability of observing these runs by chance
run_probabilities = stopfree.calculate_run_probability(run_lengths, gc_contents)

# 5. Review results
for seq_id, prob in run_probabilities.items():
    if prob < 0.01:
        print(f"Sequence {seq_id} has significant coding potential (p = {prob:.4e})")

Future Work & Limitations

There’s one massive limitation in the current implementation: it is length confounded. Because right now the tool takes the maximum over N overlapping windows, the probability of at least one being long is larger than the probability of any specific window being long, and this probability grows with N meaning longer RNAs have lower p-values just because they are long.

The fix for this is to properly model the probability of observing the stop-free gap with a geometric distribution, but I’m still getting my head around this, so for now it isn’t included. I basically need to go and write a page of algebra and look at some statistics books to make sure I know what I’m doing with this! It will be added Soon TM.

The other problem is the i.i.d. assumption about the nucleotides. I corrected for GC content, but this has limitations:

  • Dinucleotide Biases: The distribution of nucleotides is not completely independent. For example, CG dinucleotides are depleted in many genomes.

  • Cross-Frame Correlation: The six reading frames are highly correlated because they share the same physical double-stranded DNA sequence.

Some of this might be able to be modeled better with a HMM or something but that is starting to get way more complicated than the stopfree tool was meant to be, so I prefer to accept that the i.i.d assumption is just flawed and hopefully it doesn’t make too much of a difference.

Despite these limitations, the stopfree tool has been integrated with the RNAcentral pipeline and is being used to flag potentially protein-coding transcripts. Since RNAcentral is meant to be a non-coding RNA database, these are things that are then flagged as needing careful interpretation. stopfree is used alongside more established tools like tcode and CPAT, so we should get a comprehensive picture of protein coding potential!