Arun Balajiee

PhD Student, Intelligent Systems Program, University of Pittsburgh

Machine learning-based design of proteins, (small molecules and beyond)

16 Sep 2020 - Arun Balajiee

Speaker: Jennifer Listgarten

Date: 09/16/2020

Dr. Listgarten is a Professor at UCB and is well-known in the field of protein engineering using machine learning. The concept behind her work stems from the idea of collecting richer data from 3D space and time images in medical scans such as in MRIs. She shows it is possible to build models that can learn specific features from these images as has been shown with the use of Hybrid Variational AutoEncoder-Linear Mixed Models

Before getting into the main motive of her talk, she presented an overview of a few applications of building a black box oracle or predictor-generator model that can learn to predict a location of editing DNA sequences on target, where the regression model would provide a value that is in a different space and is converted to a proxy value that can be interpreted as the correct offset in the DNA sequence. This type of complex mapping requires a harmonization of the model to learn jointly based on the input from the target site and from the actual dataset to generate the editing sequence. This type of harmonization of learning is discussed a little more deeply in their work on use of latent variable models to put the two groups into same scale

She then proceeded to explain the concepts behind engineering proteins in DNA sequences using Machine Learning techniques, for the purposes of enhancing certain dormant genetic features such as flourescence of insects, gene therapy through virus- delivery, antibody based drug design, genetic editing and many such uses cases in the field of genetics. The idea is to not use a neural based model, but insted a generative model that is an ensemble of all possible optimal solutions of discrete classifiers. This is so that the issue of unusability of a gradient based solution can also be avoided. Among the many challenges in the field is the representation of the different amino acid molecules in protein without just the representation as strings.

She further touched upon the subject of two models that can specifically work on these use cases – evolutionary models and reinforcement learning techniques. In the idea of evolutionary data models, the pool of genetic features that are determined to be the causal factors for an outcome such as a disease or protein engineering are iteratively filtered out through different generations of data samples.

What I found the most interesting part of the talk was how the neural models learn some features of the training dataset and then how its generation is quite different from what the model is originally tasked to learn. She illustrated this by explaining that in the field of computer vision a lot of times the neural models are allowed to run on a dataset of images, such as of fruits and then features that are learnt are blocked off to produce an output of what the model “thinks” is the true image of a fruit (for example a banana). In this case, they notice that the model actual does a poor job at reproducing the image of a banana. This is shocking to note considering that a lot of the state of the art neural models that out perform other machine learning models are in fact not really articifially “intelligent”.

Another notiable aspect of the talk was that these generative models had to be assumed to be optimal. The validation datasets were simply not feasible in these cases because the solutions were literally to achieve breakthrough solutions and explore things that don’t have a precedent. This makes it equally impossible to decide whether the training data has to be simulated or has to be collected from real world wet lab conditions.

Dr. Listgarten concluded her talk with a short discussion on her results which seemed to show that the model that she and her collaborators built could not only optimize the process of choosing a good oracle with less uncertainty but also generate a near perfect genetic sequence that could be practically applied.

From this talk, there are many interesting take aways:

Overall, it was an informative talk and most of the time was spent in learning new concepts that I wouldn’t have probably learnt just by reading topics in ML or attending a class with a prescribed syllabus. This was touching more towards the uncertainties of the validation of a model that has to be applied to a very complex and intricate task such as genetic sequencing.