evolocity.pp.featurize_seqs¶

evolocity.pp.featurize_seqs(seqs, model_name='esm1b', mkey='model', embed_batch_size=3000, use_cache=False, cache_namespace='protein')¶

Embeds a list of sequences.

Takes a list of sequences and returns an Anndata object with sequence embeddings in the adata.X matrix.

Parameters

seqs : list: List of protein sequences.
model_name : str (default: ‘esm1b’): Language model used to compute likelihoods.
mkey : str (default: ‘model’): Name at which language model is stored.
embed_batch_size : int (default: 3000): Batch size to embed sequences. Lower to fit into GPU memory.
use_cache : bool (default: False): Cache embeddings to disk for faster future loading.
cache_namespace : str (default: ‘protein’): Namespace at which to store cache.

Returns

Returns an Anndata object with the attributes
.X – Matrix where rows correspond to sequences and columns are language model embedding dimensions
seq (.obs) – Sequences corresponding to rows in adata.X
model (.uns) – language model