evolocity.pp.featurize_fasta¶

evolocity.pp.featurize_fasta(fname, model_name='esm1b', mkey='model', embed_batch_size=3000, fasta_metadata_record=False, use_cache=True, cache_namespace=None)¶

Embeds a FASTA file.

Takes a FASTA file containing sequences and returns an Anndata object with sequence embeddings in the adata.X matrix.

An optional argument (fasta_metadata_record) allows for loading metadata directly from the FASTA file.

Parameters

fname : str: Path to FASTA file.
model_name : str (default: ‘esm1b’): Language model used to compute likelihoods.
mkey : str (default: ‘model’): Name at which language model is stored.
embed_batch_size : int (default: 3000): Batch size to embed sequences. Lower to fit into GPU memory.
fasta_metadata_record : bool (default: False): If True, assumes metadata is storred in FASTA record as key=value pairs that are separated by vertical bar “|” characters. Otherwise, does not attempt to load metadata from the FASTA.
use_cache : bool (default: False): Cache embeddings to disk for faster future loading.
cache_namespace : str (default: ‘protein’): Namespace at which to store cache.

Returns

Returns an Anndata object with the attributes
.X – Matrix where rows correspond to sequences and columns are language model embedding dimensions
seq (.obs) – Sequences corresponding to rows in adata.X
model (.uns) – language model