Proteins play a crucial role for an organism and are involved in almost all biological processes. Studying the structures and functions of proteins can significantly advance the field of life sciences because their biological functions and architectures are intertwined.
AI-based protein structure prediction technologies have significantly improved the accuracy of predictions in recent years. AlphaFold2 is an example of an AI-based protein structure prediction pipeline that has achieved near-experimental accuracy. Multiple sequence alignments (MSAs) and templates are the main inputs used by these sophisticated algorithms to extract co-evolutionary information from homologous sequences. However, scanning MSAs and models from protein databases is labor intensive and usually takes several hours.
Using only primary protein sequences, researchers from Baidu Inc. and BioMap are trying to test the limits of rapid protein structure prediction. They propose HelixFold-Single, an end-to-end MSA-free protein structure prediction pipeline. A large-scale PLM serves as the basis for the model, and the second crucial element consists of the fundamentals related to the folding of AlphaFold2.
Researchers asserted that learning co-evolutionary knowledge for MSA-free prediction can be achieved through a large-scale protein language model (PLM) instead of MSA and models. Large-scale language models have had great success in natural language processing in recent years, which is similar to the study of proteins. The ability to learn new languages increases significantly as the model parameters are increased.
PLMs have been adopted in advanced work to improve the performance of many downstream activities, including the estimation of secondary structures and functions. PLMs can highlight the long-term relationship along protein sequences and improve downstream protein-related activities using self-supervised learning on large-scale untagged proteins.
To learn domain knowledge, PLM can encode a primary sequence into a unique sequence representation and a pairwise residual-residual representation. The representation is then processed, geometric information is learned, and atomic coordinate predictions are made using the EvoFormer and AlphaFold2’s structure module. Wiring the two components produces an end-to-end differentiable pattern.
There are two training stages in HelixFold-Single.
- The first step is to train the large-scale PLM using the masked language prediction task using billions of untagged primary sequences.
- In the second step, the entire model is trained using AlphaFold2-generated augmentations and experimental ground-truth protein structures.
The researchers tested their method on the CASP14 and CAMEO datasets, comparing them to AlphaFold2 and RoseTTAFold. The results show that on proteins with enough homologous sequences, HelixFoldSingle achieves an accuracy comparable to these approaches.
The team states that HelixFold-Single outperforms MSA-based techniques in prediction efficiency and could be used for protein-related tasks requiring large numbers of predictions. They also examine the performance of HelixFold-Single on targets with different homologous sequences. The results suggest that HelixFold-Single can make accurate structure predictions for the most studied proteins.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'HELIXFOLD-SINGLE: MSA-FREE PROTEIN STRUCTURE PREDICTION BY USING PROTEIN LANGUAGE MODEL AS AN ALTERNATIVE'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and code. Please Don't Forget To Join Our ML Subreddit
Tanushree Shenwai is an intern consultant at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of Artificial Intelligence in various fields. She is passionate about exploring new technological advancements and applying them to real life.