Evolocity¶
Evolocity constructs ordered protein sequence landscapes by using the local evolutionary predictions enabled by language models to enable global evolutionary insight.
Evolocity uses the change in languge model likelihoods to estimate directionality between two biological sequences. Then, over an entire sequence similarity network, this procedure is used to direct network edges. Finally, network diffusion analysis can identify roots, order sequences in pseudotime, and identify mutations driving the velocity.
Evolocity is a fork of the scVelo tool for RNA velocity and relies on many aspects of the Scanpy library for high-dimensional biological data analysis. Like Scanpy and scVelo, evolocity makes use of anndata, an extremely convenient way to store and organizing biological data.
Quick Start¶
API example¶
Below is a quick Python example of using evolocity to load and analyze sequences in a FASTA file.
import evolocity as evo
import scanpy as sc
# Load sequences and compute language model embeddings.
fasta_fname = 'data.fasta'
adata = evo.pp.featurize_fasta(fasta_fname)
# Construct sequence similarity network.
evo.pp.neighbors(adata)
# Run evolocity analysis.
evo.tl.velocity_graph(adata)
# Embed network and velocities in two-dimensions and plot.
sc.tl.umap(adata)
evo.tl.velocity_embedding(adata)
evo.pl.velocity_embedding_grid(adata)
evo.pl.velocity_embedding_stream(adata)
Installation¶
We recommend Python v3.6 or higher.
PyPI, Virtualenv, or Anaconda¶
Install evolocity from PyPI using:
python -m pip install evolocity
If you get a Permission denied
error, use python -m pip install evolocity --user
instead.
evolocity - Evolutionary velocity with protein language models
API¶
Import evolocity as:
import evolocity as evo
After reading the data (evo.pp.featurize_fasta
) or loading an in-built dataset (evo.datasets.*
),
the typical workflow consists of subsequent calls of
preprocessing (evo.pp.*
), analysis tools (evo.tl.*
), and plotting (evo.pl.*
).
Some utilities (evo.utils.*
) are also provided to facilitate data analysis.
Preprocessing (pp)¶
Featurization (language model embedding)
|
Embeds a list of sequences. |
|
Embeds a FASTA file. |
Landscape (nearest neighbors graph construction)
|
Construct sequence similarity neighborhood graph. |
Tools (tl)¶
Velocity estimation
|
Computes velocity scores at each edge in the graph. |
|
Projects the velocities into any embedding. |
Pseudotime and trajectory inference
|
Computes terminal states (root and end points). |
|
Computes pseudotime based on the evolocity graph. |
Interpretation
|
Aligns and one-hot-encodes sequences. |
|
Score mutations by associated evolocity. |
|
Runs a random walk on the evolocity graph. |
Plotting (pl)¶
Also see scanpy’s plotting API for additional visualization functionality, including UMAP scatter plots.
Velocity embeddings
|
Scatter plot of velocities on the embedding. |
|
Scatter plot of velocities on a grid. |
|
Stream plot of velocities on the embedding. |
|
Contour plot of pseudotime with velocity grid. |
Mutation interpretation
|
Heat map of per-residue velocity scores. |
|
Scatter plot of mutations. |
Datasets¶
Influenza A nucleoprotein. |
|
Eukaryotic cytochrome c. |
Settings¶
|
Set resolution/size, styling and format of figures. |
Evolocity of influenza A nucleoprotein¶
Here, we will walk through the steps for running an evolocity analysis on nucleoprotein sequences as described in our paper. This notebook is also available on Colab.
First, we will need to install scanpy and evolocity:
[ ]:
!pip install scanpy evolocity
Now, import the necessary packages
[13]:
import evolocity as evo
import scanpy as sc
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as ss
We have provided the nucleoprotein sequences, and their corresponding language model embeddings, within an AnnData object, which is a convenient way to store high-dimensional biological data.
We have also already constructed the nearest neighbors graph, which we used to learn a two-dimensional UMAP embedding and perform graph-based Louvain clustering. We have also provided the associated metadata including subtype and sampling year.
[6]:
adata = evo.datasets.nucleoprotein()
adata
[6]:
AnnData object with n_obs × n_vars = 3304 × 1280
obs: 'n_seq', 'seq', 'gene_id', 'embl_id', 'subtype', 'year', 'date', 'country', 'host', 'resist_adamantane', 'resist_oseltamivir', 'virulence', 'transmission', 'seqlen', 'homology', 'gong2013_step', 'louvain'
uns: 'louvain', 'neighbors', 'umap'
obsm: 'X_umap'
obsp: 'connectivities', 'distances'
Now, we can use scanpy to plot the UMAP and color according to sampling year.
[7]:
evo.set_figure_params(dpi_save=500, figsize=(5, 5))
sc.pl.umap(adata, color='year', edges=True,)

We can also color the same UMAP according to subtype:
[6]:
sc.pl.umap(adata, color='subtype', edges=True,)

Now we need to install and download the ESM-1b language model. We first install the package:
[8]:
!pip install fair-esm
Collecting fair-esm
Downloading https://files.pythonhosted.org/packages/b7/40/3e757e044a9283a9b80b872aab32246fb0e4d725a57a0ef6ab7b9ef5e1c9/fair_esm-0.3.1-py3-none-any.whl
Installing collected packages: fair-esm
Successfully installed fair-esm-0.3.1
Now we need to use evolocity to compute scores for each edge in the nearest neighbors graph.
The command below is the most time-consuming step of this tutorial. First, the command below will download ESM-1b (which takes about 25 minutes).
Then, it will compute the sequence likelihoods and the velocity scores. Likelihood computation runs much faster with GPU hardware acceleration (which you can set by going to “Edit” -> “Notebook settings”). Likelihood computation takes about 25 minutes (on a GPU) and score computation takes about 40 minutes.
[9]:
evo.tl.velocity_graph(adata)
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt" to /root/.cache/torch/hub/checkpoints/esm1b_t33_650M_UR50S.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm1b_t33_650M_UR50S-contact-regression.pt" to /root/.cache/torch/hub/checkpoints/esm1b_t33_650M_UR50S-contact-regression.pt
100%|██████████| 3304/3304 [26:33<00:00, 2.07it/s]
0%| | 0/3304 [00:00<?, ?it/s]
100%|██████████| 3304/3304 [38:49<00:00, 1.42it/s]
Now, we can project the velocities into two-dimensional UMAP space and visualize as a streamplot:
[15]:
evo.tl.velocity_embedding(adata, basis='umap', scale=1.)
ax = evo.pl.velocity_embedding_stream(
adata, basis='umap', min_mass=4., smooth=1., density=1.2,
color='year', show=False,
)
sc.pl._utils.plot_edges(ax, adata, 'umap', 0.1, '#aaaaaa')
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)

We can also compute terminal states (the root and end nodes) according to a diffusion process and visualize them:
[38]:
evo.tl.terminal_states(adata)
sc.pl.umap(
adata, color=[ 'root_nodes', 'end_points' ],
color_map=plt.cm.get_cmap('magma').reversed(),
edges=True,
)
computing terminal states
identified 7 regions of root nodes and 5 regions of end points .
finished (0:00:00) --> added
'root_nodes', root nodes of Markov diffusion process (adata.obs)
'end_points', end points of Markov diffusion process (adata.obs)

Now, we can use the root nodes to compute and visualize diffusion pseudotime on the graph:
[39]:
evo.tl.velocity_pseudotime(adata)
sc.pl.umap(
adata, color='velocity_pseudotime', edges=True, cmap='inferno',
)

Finally, we can compute the correlation between pseudotime and sampling time to confirm that evolocity captures the temporal evolution of nucleoprotein!
[40]:
nnan_idx = (np.isfinite(adata.obs['year']) &
np.isfinite(adata.obs['velocity_pseudotime']))
adata_nnan = adata[nnan_idx]
print('Pseudotime-time Spearman r = {}, P = {}'
.format(*ss.spearmanr(adata_nnan.obs['velocity_pseudotime'],
adata_nnan.obs['year'],
nan_policy='omit')))
Pseudotime-time Spearman r = 0.48792075278872044, P = 4.1389753064609613e-197
Evolocity of cytochrome c¶
Here we will analyze cytochrome c sequences from eukaryotes as obtained from UniProt. This tutorial is also available on Colab.
Also, see our tutorial on how to perform evolocity analysis on influenza A nucleoprotein for a more detailed walkthrough.
First, we need to install the required packages:
[ ]:
!pip install scanpy evolocity
Now we need to import the packages:
[2]:
import evolocity as evo
import matplotlib.pyplot as plt
import numpy as np
import scanpy as sc
import seaborn as sns
And now we can download the data for cytochrome c.
We have already computed the language model embeddings for these sequences and also parsed the metadata in the 'tax_group'
entry in adata.obs
. We have also constructed the nearest neighbors graph and run the UMAP algorithm to embed the graph into two dimensions.
[ ]:
adata = evo.datasets.cytochrome_c()
Now, we can plot the graph in UMAP space colored by taxonomy.
[4]:
evo.set_figure_params(dpi_save=500, figsize=(6, 4))
sc.pl.umap(adata, color='tax_group', edges=True,)

Now we need to install ESM-1b so we can compute the velocity scores
[ ]:
!pip install fair-esm
Now we can compute the velocity scores for each edge in the sequence landscape. Downloading the model can take up to 25 minutes. Computing the likelihoods with GPU acceleration takes around 2 minutes and computing the velocity scores also takes around 2 minutes.
[ ]:
evo.tl.velocity_graph(adata)
Now, we can project the velocities into two-dimensional UMAP space and then visualize the velocities as a streamplot:
[30]:
evo.tl.velocity_embedding(adata, basis='umap', scale=1.)
evo.pl.velocity_embedding_stream(
adata, basis='umap', min_mass=1., smooth=1., linewidth=1.,
color='tax_group',
)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
