evolocity.pp.featurize_seqs

evolocity.pp.featurize_seqs(seqs, model_name='esm1b', mkey='model', embed_batch_size=3000, use_cache=False, cache_namespace='protein')

Embeds a list of sequences.

Takes a list of sequences and returns an Anndata object with sequence embeddings in the adata.X matrix.

Parameters
seqs : list

List of protein sequences.

model_name : str (default: ‘esm1b’)

Language model used to compute likelihoods.

mkey : str (default: ‘model’)

Name at which language model is stored.

embed_batch_size : int (default: 3000)

Batch size to embed sequences. Lower to fit into GPU memory.

use_cache : bool (default: False)

Cache embeddings to disk for faster future loading.

cache_namespace : str (default: ‘protein’)

Namespace at which to store cache.

Returns

  • Returns an Anndata object with the attributes

  • .X – Matrix where rows correspond to sequences and columns are language model embedding dimensions

  • seq (.obs) – Sequences corresponding to rows in adata.X

  • model (.uns) – language model