Covary Simulate | Interactive Pipeline Simulation

About this simulation

This simulation demonstrates the logical flow of a Covary run using 16S rRNA sequences from Thermus species using version 2.0. Some workflow may have been upgraded or improved in newer versions. It walks through all 8 steps of the Colab notebook: parameter setting, sequence input, QC, encoding, dimensionality reduction, embedding visualization, deep learning, and results. The outputs are programmatically generated to match real Covary behavior — not a live ML run. To run a real analysis, open Covary on Google Colab →

covary_simulation.ipynb — Google Colab

Ready Open real notebook →

Step 1 — Set parameters

Covary's run behavior is controlled by a set of parameters in Step 1 of the Colab notebook. The simulation uses the following predefined configuration (⚠️ fields that require user attention in the real notebook are highlighted):

k-mer size

include_N

"no"

perplexity (t-SNE)

n_neighbors (UMAP)

linkage methods

ward, avg, complete, single

reduction methods

PCA, t-SNE, UMAP

⚠️ input file

thermus_16s.fasta

fig_size

(25, 20)

# Step 1. Set parameters (Other parameters can be modified to user preferences)
kmer_size     = 6
include_N     = "no"
perplexity    = 30
n_neighbors   = 15
fig_size  = (25, 20)
linkages      = ["ward", "average", "complete", "single"]
reductions    = ["pca", "tsne", "umap"]

Step 2 — FASTA input

In the Colab notebook, you upload your multi-FASTA file in Step 2. The simulation uses 6 representative Thermus 16S rRNA sequences as simulation reference.

>NR_181868.1 Thermus brevis strain SYSU G05001 16S rRNA, partial GACATGCAAGTCGAGCGGGGCGGGTTTATACCTGCCCAGCGGCGGACGGGTGAGTAACGC GTGGGTGACCTACCTGGAAGAGGCGGACAACCTGGGGAAACCCAGGCTAATCCGCCATGT >NR_181790.1 Thermus brevis strain SYSU G02001 16S rRNA, partial GACATGCAAGTCGAGCGGGGCGGGTTTATACCTGCCCAGCGGCGGACGGGTGAGTAACGC GTGGGTGACCTACCTGGAAGAGGCGGACAACCTGGGGAAACCCAGGCTAATCCGCCATGT >NR_180714.1 Thermus sediminis strain L198 16S rRNA, partial GACATGCAAGTCGTGCGGGCCGTGGGGTTTCTCACGGCTAGCGGCGGACGGGTGAGTAAC GCGTGGGTGACCTACCCGGAAGAGGGGGACAACCTGGGGAAACCCAGGCTAATCCCCCAT >NR_043469.1 Thermus thermophilus strain HB27 16S rRNA, partial GACATGCAAGTCGAGCGGGGCAGCTTAAGCTTGCTTCTTGATGCAAGTCGAGCGGGGCAG CTTGGCTTGCTTCTTGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG >NR_074599.1 Thermus aquaticus strain YT-1 16S rRNA, partial GACATGCAAGTCGAGCGGTGCACTTAAGCTTGCTTCTTAATCGATCGATCGATCGATCGA TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA >NR_025900.1 Thermus scotoductus strain SA-01 16S rRNA, partial GACATGCAAGTCGAACGGGGCGGGTTTATACCTGCCCAGCGGCGGACGGGTGAGTAACGC GTGGGTAACCTACCCGGAAGAGGGGGACAACCTGGGGAAACCCAGGCTAATACCCCATGT

6 sequences loaded · thermus_16s.fasta · ~1.2 kb total

Step 3 — Quality control & preprocessing

Covary automatically performs QC on the input sequences: removing whitespace, filtering sequences with invalid (non-ATCGN) characters, and reporting sequence metrics. Since include_N = "no", entries with ambiguous bases would be excluded.

Covary v2.1 — QC Module ─────────────────────────────────

Step 4 — k-mer encoding (Covary-encoder)

Each sequence is encoded into a numeric vector using Covary's translation-aware, k-mer-based encoding logic. With k=6 (by default), there are 4⁴ = 256 possible k-mers. The encoder captures relative proximity, directional alignment, and translation awareness — not just frequency counts.

Covary-encoder — k=6 ─────────────────────────────────

Step 5 — Dimensionality reduction & distance matrix

After encoding, Covary computes a pairwise Euclidean distance matrix between all sequence vectors. This matrix drives the clustering and dendrogram construction downstream.

Sequence	T. brevis G05	T. brevis G02	T. sediminis	T. thermophilus	T. aquaticus	T. scotoductus
Click "Compute Matrix" to generate distances

Step 6 — Embedding projection (t-SNE)

Covary reduces high-dimensional embeddings to 2D using t-SNE, PCA, and UMAP for visualization. Below is the simulated t-SNE scatter plot — sequences cluster by species-level similarity. Two T. brevis entries appear in close proximity; T. thermophilus and T. aquaticus form a distinct clade.

Dim-1 Dim-2

Click "Render Embeddings" to visualize

Step 7 — Deep learning & dendrogram

Covary trains a deep learning autoencoder on the embedding representations, refining the distance structure. The refined distances are used to construct hierarchical dendrograms using Ward, Average, Complete, and Single linkage methods.

Deep learning with autoencoder ─────────────────────────────────

Dendrogram — Ward linkage (t-SNE)

Step 8 — Results & downloads

Covary outputs are automatically packaged as a ZIP file. In the real notebook, results download automatically or can be retrieved from /content/covary_results. The simulated run produced the following:

Sequences processed

Reduction methods

Linkage dendrograms

Output files

Output file manifest — covary_results.zip

[Embeddings] ✓ pca_embeddings.tsv ✓ tsne_embeddings.tsv ✓ umap_embeddings.tsv [Scatter plots] ✓ pca_scatter.png ✓ tsne_scatter.png ✓ umap_scatter.png [Heatmaps] ✓ pca_heatmap.png ✓ tsne_heatmap.png ✓ umap_heatmap.png [Dendrograms] ✓ tsne_dendrogram_ward.png ✓ tsne_dendrogram_average.png ✓ tsne_dendrogram_complete.png ✓ tsne_dendrogram_single.png ──────────────────────────────── ✓ Simulation complete — 36 files generated

This was a simulation based on predefined data. To run a real Covary analysis on your own sequences:

Open Covary v2.1 on Colab → View on GitHub Version notes

Step 1 of 8 — Set parameters

How the real Covary pipeline works

📥

1. Input

Upload multi-FASTA with DNA sequences (ATCG, for RNA → convert U to T)

🔢

2. Encode

k-mer translation-aware embeddings via Covary-encoder — no alignment needed

🧠

3. Learn

Deep learning autoencoder (assisted by TIPs-VF representation logic) refines embeddings into a latent space

📊

4. Visualize

PCA, t-SNE, UMAP scatter plots · distance heatmaps · dendrograms

Covary simulation