Evolocity

Evolocity constructs ordered protein sequence landscapes by using the local evolutionary predictions enabled by language models to enable global evolutionary insight.

'Evolocity overview'

Evolocity uses the change in languge model likelihoods to estimate directionality between two biological sequences. Then, over an entire sequence similarity network, this procedure is used to direct network edges. Finally, network diffusion analysis can identify roots, order sequences in pseudotime, and identify mutations driving the velocity.

Evolocity is a fork of the scVelo tool for RNA velocity and relies on many aspects of the Scanpy library for high-dimensional biological data analysis. Like Scanpy and scVelo, evolocity makes use of anndata, an extremely convenient way to store and organizing biological data.

Quick Start

Installation

You should be able to install evolocity using pip:

python -m pip install evolocity

API example

Below is a quick Python example of using evolocity to load and analyze sequences in a FASTA file.

import evolocity as evo
import scanpy as sc

# Load sequences and compute language model embeddings.
fasta_fname = 'data.fasta'
adata = evo.pp.featurize_fasta(fasta_fname)

# Construct sequence similarity network.
evo.pp.neighbors(adata)

# Run evolocity analysis.
evo.tl.velocity_graph(adata)

# Embed network and velocities in two-dimensions and plot.
sc.tl.umap(adata)
evo.tl.velocity_embedding(adata)
evo.pl.velocity_embedding_grid(adata)
evo.pl.velocity_embedding_stream(adata)

Installation

We recommend Python v3.6 or higher.

PyPI, Virtualenv, or Anaconda

Install evolocity from PyPI using:

python -m pip install evolocity

If you get a Permission denied error, use python -m pip install evolocity --user instead.

Dependencies

Using fast neighbor search via hnswlib further requires (optional):

pip install pybind11 hnswlib

evolocity - Evolutionary velocity with protein language models

API

Import evolocity as:

import evolocity as evo

After reading the data (evo.pp.featurize_fasta) or loading an in-built dataset (evo.datasets.*), the typical workflow consists of subsequent calls of preprocessing (evo.pp.*), analysis tools (evo.tl.*), and plotting (evo.pl.*). Some utilities (evo.utils.*) are also provided to facilitate data analysis.

Preprocessing (pp)

Featurization (language model embedding)

pp.featurize_seqs(seqs[, model_name, mkey, …])

Embeds a list of sequences.

pp.featurize_fasta(fname[, model_name, …])

Embeds a FASTA file.

Landscape (nearest neighbors graph construction)

pp.neighbors(adata[, n_neighbors, n_pcs, …])

Construct sequence similarity neighborhood graph.

Tools (tl)

Velocity estimation

tl.velocity_graph(adata[, model_name, mkey, …])

Computes velocity scores at each edge in the graph.

tl.velocity_embedding(data[, basis, vkey, …])

Projects the velocities into any embedding.

Pseudotime and trajectory inference

tl.terminal_states(data[, vkey, groupby, …])

Computes terminal states (root and end points).

tl.velocity_pseudotime(adata[, vkey, …])

Computes pseudotime based on the evolocity graph.

Interpretation

tl.onehot_msa(adata[, reference, …])

Aligns and one-hot-encodes sequences.

tl.residue_scores(adata[, basis, scale, …])

Score mutations by associated evolocity.

tl.random_walk(data[, root_node, …])

Runs a random walk on the evolocity graph.

Plotting (pl)

Also see scanpy’s plotting API for additional visualization functionality, including UMAP scatter plots.

Velocity embeddings

pl.velocity_embedding(adata[, basis, vkey, …])

Scatter plot of velocities on the embedding.

pl.velocity_embedding_grid(adata[, basis, …])

Scatter plot of velocities on a grid.

pl.velocity_embedding_stream(adata[, basis, …])

Stream plot of velocities on the embedding.

pl.velocity_contour(adata[, ptkey, …])

Contour plot of pseudotime with velocity grid.

Mutation interpretation

pl.residue_scores(adata[, percentile_keep, …])

Heat map of per-residue velocity scores.

pl.residue_categories(adata[, positions, …])

Scatter plot of mutations.

Datasets

datasets.nucleoprotein()

Influenza A nucleoprotein.

datasets.cytochrome_c()

Eukaryotic cytochrome c.

Settings

set_figure_params([style, dpi, dpi_save, …])

Set resolution/size, styling and format of figures.

References

Code: Github repo

Paper: If you use evolocity, please consider citing our paper

Evolocity of influenza A nucleoprotein

Here, we will walk through the steps for running an evolocity analysis on nucleoprotein sequences as described in our paper. This notebook is also available on Colab.

First, we will need to install scanpy and evolocity:

[ ]:
!pip install scanpy evolocity

Now, import the necessary packages

[13]:
import evolocity as evo
import scanpy as sc
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as ss

We have provided the nucleoprotein sequences, and their corresponding language model embeddings, within an AnnData object, which is a convenient way to store high-dimensional biological data.

We have also already constructed the nearest neighbors graph, which we used to learn a two-dimensional UMAP embedding and perform graph-based Louvain clustering. We have also provided the associated metadata including subtype and sampling year.

[6]:
adata = evo.datasets.nucleoprotein()
adata

[6]:
AnnData object with n_obs × n_vars = 3304 × 1280
    obs: 'n_seq', 'seq', 'gene_id', 'embl_id', 'subtype', 'year', 'date', 'country', 'host', 'resist_adamantane', 'resist_oseltamivir', 'virulence', 'transmission', 'seqlen', 'homology', 'gong2013_step', 'louvain'
    uns: 'louvain', 'neighbors', 'umap'
    obsm: 'X_umap'
    obsp: 'connectivities', 'distances'

Now, we can use scanpy to plot the UMAP and color according to sampling year.

[7]:
evo.set_figure_params(dpi_save=500, figsize=(5, 5))
sc.pl.umap(adata, color='year', edges=True,)
_images/evolocity_nucleoprotein_7_0.png

We can also color the same UMAP according to subtype:

[6]:
sc.pl.umap(adata, color='subtype', edges=True,)
_images/evolocity_nucleoprotein_9_0.png

Now we need to install and download the ESM-1b language model. We first install the package:

[8]:
!pip install fair-esm
Collecting fair-esm
  Downloading https://files.pythonhosted.org/packages/b7/40/3e757e044a9283a9b80b872aab32246fb0e4d725a57a0ef6ab7b9ef5e1c9/fair_esm-0.3.1-py3-none-any.whl
Installing collected packages: fair-esm
Successfully installed fair-esm-0.3.1

Now we need to use evolocity to compute scores for each edge in the nearest neighbors graph.

The command below is the most time-consuming step of this tutorial. First, the command below will download ESM-1b (which takes about 25 minutes).

Then, it will compute the sequence likelihoods and the velocity scores. Likelihood computation runs much faster with GPU hardware acceleration (which you can set by going to “Edit” -> “Notebook settings”). Likelihood computation takes about 25 minutes (on a GPU) and score computation takes about 40 minutes.

[9]:
evo.tl.velocity_graph(adata)
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt" to /root/.cache/torch/hub/checkpoints/esm1b_t33_650M_UR50S.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm1b_t33_650M_UR50S-contact-regression.pt" to /root/.cache/torch/hub/checkpoints/esm1b_t33_650M_UR50S-contact-regression.pt
100%|██████████| 3304/3304 [26:33<00:00,  2.07it/s]
  0%|          | 0/3304 [00:00<?, ?it/s]

100%|██████████| 3304/3304 [38:49<00:00,  1.42it/s]

Now, we can project the velocities into two-dimensional UMAP space and visualize as a streamplot:

[15]:
evo.tl.velocity_embedding(adata, basis='umap', scale=1.)
ax = evo.pl.velocity_embedding_stream(
    adata, basis='umap', min_mass=4., smooth=1., density=1.2,
    color='year', show=False,
)
sc.pl._utils.plot_edges(ax, adata, 'umap', 0.1, '#aaaaaa')
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
_images/evolocity_nucleoprotein_15_1.png

We can also compute terminal states (the root and end nodes) according to a diffusion process and visualize them:

[38]:
evo.tl.terminal_states(adata)
sc.pl.umap(
    adata, color=[ 'root_nodes', 'end_points' ],
    color_map=plt.cm.get_cmap('magma').reversed(),
    edges=True,
)
computing terminal states
    identified 7 regions of root nodes and 5 regions of end points .
    finished (0:00:00) --> added
    'root_nodes', root nodes of Markov diffusion process (adata.obs)
    'end_points', end points of Markov diffusion process (adata.obs)
_images/evolocity_nucleoprotein_17_1.png

Now, we can use the root nodes to compute and visualize diffusion pseudotime on the graph:

[39]:
evo.tl.velocity_pseudotime(adata)
sc.pl.umap(
    adata, color='velocity_pseudotime', edges=True, cmap='inferno',
)
_images/evolocity_nucleoprotein_19_0.png

Finally, we can compute the correlation between pseudotime and sampling time to confirm that evolocity captures the temporal evolution of nucleoprotein!

[40]:
nnan_idx = (np.isfinite(adata.obs['year']) &
            np.isfinite(adata.obs['velocity_pseudotime']))

adata_nnan = adata[nnan_idx]

print('Pseudotime-time Spearman r = {}, P = {}'
      .format(*ss.spearmanr(adata_nnan.obs['velocity_pseudotime'],
                            adata_nnan.obs['year'],
                            nan_policy='omit')))
Pseudotime-time Spearman r = 0.48792075278872044, P = 4.1389753064609613e-197

Evolocity of cytochrome c

Here we will analyze cytochrome c sequences from eukaryotes as obtained from UniProt. This tutorial is also available on Colab.

Also, see our tutorial on how to perform evolocity analysis on influenza A nucleoprotein for a more detailed walkthrough.

First, we need to install the required packages:

[ ]:
!pip install scanpy evolocity

Now we need to import the packages:

[2]:
import evolocity as evo
import matplotlib.pyplot as plt
import numpy as np
import scanpy as sc
import seaborn as sns

And now we can download the data for cytochrome c.

We have already computed the language model embeddings for these sequences and also parsed the metadata in the 'tax_group' entry in adata.obs. We have also constructed the nearest neighbors graph and run the UMAP algorithm to embed the graph into two dimensions.

[ ]:
adata = evo.datasets.cytochrome_c()

Now, we can plot the graph in UMAP space colored by taxonomy.

[4]:
evo.set_figure_params(dpi_save=500, figsize=(6, 4))
sc.pl.umap(adata, color='tax_group', edges=True,)
_images/evolocity_cytochrome_c_7_0.png

Now we need to install ESM-1b so we can compute the velocity scores

[ ]:
!pip install fair-esm

Now we can compute the velocity scores for each edge in the sequence landscape. Downloading the model can take up to 25 minutes. Computing the likelihoods with GPU acceleration takes around 2 minutes and computing the velocity scores also takes around 2 minutes.

[ ]:
evo.tl.velocity_graph(adata)

Now, we can project the velocities into two-dimensional UMAP space and then visualize the velocities as a streamplot:

[30]:
evo.tl.velocity_embedding(adata, basis='umap', scale=1.)
evo.pl.velocity_embedding_stream(
    adata, basis='umap', min_mass=1., smooth=1., linewidth=1.,
    color='tax_group',
)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
_images/evolocity_cytochrome_c_13_1.png

Indices and tables