Language Models for Protein Data: Evolutionary Scale Modeling

About this Event

Protein Language Models (pLMs) such as ESM-2 represent a significant advance in computational biology. ESM-2, an encoder-only model utilizing Rotary Position Embeddings (RoPE), is tailored for protein sequences comprising standard and non-standard amino acids, deletion markers, and complex formation indicators. The innovative ESMFold extends ESM-2’s capabilities, enabling rapid protein folding predictions without reliance on Multiple Sequence Alignment, demonstrating a noteworthy acceleration compared to the established AlphaFold2. This presentation canvasses a spectrum of ESM-2 applications, ranging from assessing mutation effects and evolutionary trajectories to predicting protein-protein interactions and facilitating in silico directed evolution via EvoProtGrad. We will also discuss fine-tuning of ESM-2 for various sequence and token classification tasks such as predicting gene ontology terms, predicting binding sites of protein sequences, and predicting post translational modification sites. We will also briefly discuss geometric compression as measured by the intrinsic dimension of embeddings obtained from ESM-2, how this tracks information theoretic compression, how this is related to Low Rank Adaptations (LoRA), potentially detecting AI generated proteins, and applications to curriculum learning strategies.

Speakers

Amelie Schreiber holds a Master’s in Mathematics and has a keen interest in the practical applications of protein language models (pLMs). Her work involves refining these models using techniques like Low Rank Adaptation (LoRA), and their quantized versions (QLoRA), to better predict protein functions, binding sites, and post translational modifications. She employs a mathematical perspective to understand the structure within pLMs, using tools such as persistent homology to investigate the intrinsic dimension of model embeddings in order to (1) inform curriculum learning strategies for both large language models and protein language models, (2) to choose optimal ranks for LoRA and QLoRA, and (3) to better understand the relationship between information theoretic compression, geometric compression, and generalization capabilities of models. Amelie is also exploring ways to adapt large language models to annotate proteins through instruction fine-tuning with QLoRA, which could help bridge gaps in computational biology. As an independent researcher, her efforts are focused on the intersection of mathematics, biology, and natural language processing to contribute to our understanding of proteins.