evolocity.pp.featurize_seqs¶
-
evolocity.pp.
featurize_seqs
(seqs, model_name='esm1b', mkey='model', embed_batch_size=3000, use_cache=False, cache_namespace='protein')¶ Embeds a list of sequences.
Takes a list of sequences and returns an
Anndata
object with sequence embeddings in the adata.X matrix.- Parameters
- seqs : list
List of protein sequences.
- model_name : str (default: ‘esm1b’)
Language model used to compute likelihoods.
- mkey : str (default: ‘model’)
Name at which language model is stored.
- embed_batch_size : int (default: 3000)
Batch size to embed sequences. Lower to fit into GPU memory.
- use_cache : bool (default: False)
Cache embeddings to disk for faster future loading.
- cache_namespace : str (default: ‘protein’)
Namespace at which to store cache.
- Returns
Returns an
Anndata
object with the attributes.X – Matrix where rows correspond to sequences and columns are language model embedding dimensions
seq (.obs) – Sequences corresponding to rows in adata.X
model (.uns) – language model