
Publications
Projects, articles, and protocols from ChordexBio affiliates and research community.
Abstract: The ability to accurately encode and represent genetic sequences in machine learning process is critical for advancements in biotechnology, specifically in genetic engineering and synthetic biology. Traditional sequence encoding method face significant limitations in handling sequence variability, maintaining reading frame integrity, and preserving biologically relevant features. This preliminary study presents TIPs-VF (Translator-Interpreter Pre-seeding for Variable-length Fragments), a simple and efficient encoding framework designed to address some of the key challenges in representing genetic sequences for machine learning. The results showed that TIPs- VF enables a variable-length sequence representation that retains biological context while ensuring the alignment of encodings with codon boundary, making it particularly suited for modular genetic construction. TIPs-VF demonstrated superior performance in truncation and fragmentation analysis, sequence homology detection, domain assessment, and splice junction identification. Unlike conventional methods that require fixed-length inputs, TIPs-VF dynamically adapts to sequence length variations, preserving essential features such as domain similarities and sequence motifs. Additionally, TIPs-VF improves open reading frame recognition and enhances the identification of vector parts and plasmid elements by unifying sequence embeddings with the three possible open reading frame. Overall, TIPs-VF offers a robust, biologically meaningful encoding framework that overcomes the constraints of traditional sequence representations by incorporating sequence, length, and positional awareness. The TIPs-VF encoding infrastructure is available at https://tips.logiacommunications.com.
Article
The Add-to-Cart Revolution in Biologics Development: From Laboratory Modeling to Digital Ordering
Abstract: The biologics development industry faces a fundamental disconnect between advances in technological capabilities and persistent operational inefficiencies, characterized by extended timelines, low success rates, and substantial cost barriers. This review examines the potential for an "add-to-cart" model that could transform biologics development from custom, institution-specific processes to standardized, accessible approaches, which is analogous to e-commerce. The add-to-cart model was proposed as a paradigm that enables researchers to specify therapeutic requirements through standardized interfaces and access pre-optimized biological candidates from integrated databases or platforms. This model could potentially improve utility, accessibility, cost-effectiveness, speed, exploration capabilities, and applicability compared to traditional methods. This paper reports areas and inefficiencies in the current development landscapes that could be addressed, while historical precedents from plasmid registries, semiconductor manufacturing, software development, and automotive industries illustrate transformation patterns that could inform this add-to-cart future of biologics development. Key enablers include large-scale biological databases, AI-powered design engines, automated synthesis platforms, standardized interfaces, and appropriate regulatory frameworks. Implementation could fundamentally alter biologics development from bench to bedside, democratizing therapeutic innovation while maintaining scientific rigor. The convergence of computational capabilities, biological databases, and infrastructure development creates conditions for the emergence of the model. Collectively, systematic approaches to biologics development may represent an inevitable evolution toward an add-to-cart approach, providing more efficient and accessible therapeutic innovation.
Article
DOI: 10.2139/ssrn.5388929
Covary: A translation-aware framework for alignment-free phylogenetics using machine learning
Abstract: In large-scale phylogenetic analysis, incorporating translation awareness is critical to account for the genotypic and phenotypic dimensions underlying biological diversification. Covary is a machine learning-based framework that analyzes, clusters, and compares genetic sequences through alignment-free, translation-aware embeddings. By integrating codon-boundary and intra-sequence positional information into a unified vector representation, Covary encodes mutational patterns alongside translation-level constraints. This design enables discrimination of frameshift-inducing mutations, substitutions, and other biologically meaningful sequence variations relevant to evolutionary relationships. Despite inherent sensitivity to k-mer-based distortions, Covary accurately clustered sequences, identified species, and reconstructed phylogenetic trees across diverse datasets, including human TP53 variants, ribosomal gene markers (18S and 16S), and complete genomes from viral, bacterial, and archaeal taxa. The resulting topologies were comparable to those produced by multiple sequence alignment (MASA)-based implementations like ETE3, with near-linear scalability demonstrated by the successful analysis of nearly a thousand SARS-CoV-2 genomes within minutes. The versatility and interpretability of Covary across mutation-, gene-, and genome-level analyses underscore its potential as a biologically informed, data-driven tool for bioinformatics, comparative genomics, taxonomy, ecology, and evolutionary studies. Covary is available online at https://github.com/mahvin92/Covary or at https://covary.chordexbio.com.
Article
Machine learning-based phylogenetic analysis using Covary
Abstract: This protocol describes the operational workflow of Covary, a machine learning-based framework for large-scale phylogenetic analysis and species identification. This protocol enables users to perform phylogenetic inference directly from sequence data without requiring coding experience or local software installation. The workflow is applicable to v2.1 of Covary and accepts multi-FASTA sequence files as input (or training data) and provides configurable parameters for encoding, neural network inference, and downstream analysis. Covary is designed to scale to thousands of sequences and produces interoperable outputs compatible with Matplotlib and other python-based libraries, R, and other downstream visualization and statistical tools. The protocol is optimized for execution in a Google Colab environment, eliminating software maintenance and platform dependency. This protocol focuses on the methodical operations of Covary for tree reconstruction and species identification. Covary is available at https://github.com/mahvin92/Covary.
Protocol
Rapid, large-scale and multi-species phylogenomic analysis using Covary
Abstract: This protocol describes the use of Covary for fast, large-scale, multi-species phylogenomic analysis using complete genome sequences. The workflow is optimized for comparative phylogenomic inference across diverse taxa and is demonstrated using datasets associated with genomes of outbreak-causing viruses (SARS-CoV-2, dengue virus, measles virus, and alphainfluenza virus). Covary enables alignment-free phylogenomic analysis by encoding genomic sequences into translation-aware vector representations and applying machine learning–based similarity inference. The protocol supports thousands-scale datasets, requires no coding experience, and is designed to run in a Google Colab environment without local software installation or maintenance. This protocol focuses on phylogenomic-scale analysis across multiple viral species. Covary is accessible at https://covary.chordexbio.com/ or on GitHub at https://github.com/mahvin92/Covary.
Protocol
Machine learning approach to multi-locus Y-chromosome STR sequence profiling using Covary
Abstract: This protocol describes the use of Covary for rapid, alignment-free analysis of short tandem repeat (STR) sequence variation using Y-chromosomal STR loci from the NCBI STRSeq BioProject (PRJNA380347). The workflow demonstrates the ability of Covary to (i) compare STR sequence similarity directly from raw sequence data, (ii) perform multi-locus STR analysis in a single run without locus-wise concatenation, and (iii) resolve locus-specific and inter-locus relationships using machine learning-derived vector representations. Traditional forensic STR analysis relies on length-based allele designation and locus-by-locus interpretation. In contrast, this protocol illustrates a sequence-level, machine learning approach that captures internal repeat structure, flanking variation, and compositional features across multiple STR loci simultaneously. Covary enables scalable STR sequence comparison without manual alignment, custom scripting, or local software installation. This protocol is optimized for execution in Google Colab and may be adapted for applications in forensic genomics, population genetics, and STR database exploration.
Protocol
Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary
Abstract: Rapid phylogenomic analysis is essential for outbreak surveillance and large-scale viral comparative genomics, yet conventional alignment-based workflows remain computationally intensive and difficult to deploy at scale. Covary is a computational framework designed for large-scale biological sequence analysis. It is a translation-aware, alignment-free machine learning framework that encodes genomic information into biologically informed vector representations, enabling efficient genome-scale comparison without multiple sequence alignment (MSA). Here, Covary was applied to thousands-scale analysis of outbreak-causing viral genomes to assess its scalability and biological resolution. A total of 4,000 complete genomes of SARS-CoV-2, dengue virus, measles virus, and alphainfluenza virus were retrieved from the NCBI Viral Genomes Resource, of which 3,831 passed quality filtering and were analyzed using Covary. Results showed that Covary rapidly processed all genomes and consistently grouped sequences according to expected taxonomic assignments and known ingroup structure, including SARS-CoV-2 Pango lineages, dengue virus subtypes, measles virus geographic origin, and alphainfluenza virus clades. Covary completed the analysis in 45 minutes on free-tier Google Colab, inferring genome-wide relationships using modest computational resources. These results demonstrate that Covary enables rapid, alignment-free phylogenomic analysis of thousands of outbreak-causing viral genomes without requiring advanced computational infrastructure. In conclusion, Covary represents a scalable, deploy-ready machine learning pipeline for genome-informed outbreak surveillance and monitoring systems.
Article
High-throughput isoform-wide miRNome sequence reconstruction in the TCGA-LUAD cohort using FAS2rDNA
Abstract: Large-scale miRNome studies frequently rely on coordinate-based annotations or raw sequencing datasets that are computationally expensive to reprocess and difficult to integrate into sequence-centric analytical workflows. This protocol presents an isoform-wide reconstruction of miRNA sequences from the TCGA-LUAD cohort using FAS2rDNA, enabling direct derivation of strand-aware nucleotide sequences without reanalyzing bulk sequencing data. By reconstructing sequences directly from genomic coordinates, the workflow provides a faster, more scalable, and reproducible alternative for generating miRNA isoform–resolved FASTA datasets. The reconstructed miRNome sequences generated through this protocol are directly applicable to machine learning–based modeling, isoform-level molecular discovery, and integrative miRNA landscape analysis. Applied to the TCGA-LUAD cohort, this workflow facilitates high-resolution exploration of miRNA isoform diversity with the broader objective of improving molecular understanding of lung adenocarcinoma and supporting data-driven strategies aimed at reducing cancer-related mortality.
Protocol
Abstract: MicroRNA (miRNA) sequence composition and isoform diversity play important roles in post-transcriptional regulation and contribute to biological variability across cancer types. Large-scale resources such as The Cancer Genome Atlas (TCGA) provide a standardized foundation for exploratory miRNome research; however, TCGA miRNA datasets are typically distributed as expression matrices without direct access to reconstructed, isoform-resolved sequence outputs. This limits the application of sequence-based analyses, including pan-cancer comparisons and machine learning workflows that require explicit nucleotide representations. FAS2rDNA-Colab is a cloud-based workflow that reconstructs FASTA-formatted DNA/cDNA sequences using genomic coordinates/annotations. This protocol extends FAS2rDNA-Colab for the reconstitution of isoform-wide miRNome sequences from TCGA-derived miRNA expression data. By reconstituting FASTA-formatted miRNA sequences across multiple cancer cohorts, the protocol enables pan-cancer and isoform-level comparisons without reliance on predefined probe sets or raw sequencing reprocessing. The resulting reconstructed miRNomes can be used for sequence validation, exploratory comparative analyses, and downstream computational modeling.
Protocol


