Powerful algorithms can ‘predict’ the biological language of cancer and Alzheimer’s disease

Powerful algorithms used by Netflix, Amazon and Facebook can ‘predict’ the biological language of cancer and neurodegenerative diseases like Alzheimer’s, scientists have discovered.

Big data created over decades of research has been invested in a computer language model to determine whether artificial intelligence can detect more advanced discoveries than humans.

Academics based at St John’s College, University of Cambridge, have discovered that machine learning technology can decipher the ‘biological language’ of cancer, Alzheimer’s disease and other neurodegenerative diseases.

Their revolutionary study was published in a scientific journal PNAS today (April 8, 2021) and could be used in the future to “correct grammatical errors within disease-causing cells.”

Professor Tuomas Knowles, lead author and associate at St John’s College, said: “The introduction of machine learning technology in neurodegenerative disease and cancer research is absolutely changeable in the game. Ultimately, the goal will be to use artificial intelligence to develop targeted drugs that will dramatically alleviate symptoms or prevent dementia altogether. “

Every time Netflix recommends a series to watch or Facebook suggests someone to make friends, the platforms use powerful machine learning algorithms to make highly educated people guess what people will do next. Voice assistants like Alex and Siri can even recognize individual people and immediately ‘address’ you.

Dr. Kadi Liis Saar, the first author and researcher at St John’s College, used similar machine learning technology to train a comprehensive language model to look at what happens when something goes wrong with proteins in the body to cause disease.

She said: “The human body is home to thousands and thousands of proteins, and scientists do not yet know the function of many of them. We asked a neural network-based language model to learn the language of proteins.

“We explicitly asked the program to learn the language of variable biomolecular condensates – droplets of proteins found in cells – that scientists really need to understand to break down the language of biological function and malfunction that cause cancer and neurodegenerative diseases like Alzheimer’s. We found that it could learn without being explicitly told what scientists have already discovered about the language of proteins over decades of research. “

Proteins are large, complex molecules that play many crucial roles in the body. They do most of the work in cells and are needed for the structure, function and regulation of body tissues and organs – antibodies are, for example, a protein that works to protect the body.

Alzheimer’s, Parkinson’s and Huntington’s disease are the three most common neurodegenerative diseases, but scientists believe there are several hundred.

In Alzheimer’s disease, which affects 50 million people worldwide, proteins become naughty, build up and kill healthy nerve cells. A healthy brain has a quality control system that effectively removes these potentially dangerous masses of proteins, known as aggregates.

Scientists now think that some disrupted proteins also form liquid-like condensate-like protein droplets that have no membrane and blend freely with each other. Unlike irreversible protein aggregates, protein condensates can form and reform and are often compared to wax stains to change shape in lava lamps.

Professor Knowles said: “Protein condensates have been attracting a lot of attention in the scientific world lately because they control key events in the cell, such as gene expression – how our DNA is converted into proteins – and protein synthesis – how cells make proteins.

“Any deficiencies associated with these protein droplets can lead to diseases like cancer. Therefore, the introduction of natural language processing technology into research into the molecular origin of protein defects is vital if we want to be able to correct grammatical errors within disease-causing cells.”

Dr Saar said: “We have fed the algorithm all the data stored about known proteins so that it can learn and predict the language of proteins in the same way these models learn about human language and how WhatsApp knows how to suggest words to you.

“Then we could ask him about a specific grammar that only causes some proteins to form condensate inside the cells. It’s a very challenging problem and unlocking will help us learn the rules of disease language.”

Machine learning technology is evolving at a rapid pace due to increasing data availability, increased computing power, and technical advances that have created more powerful algorithms.

Further use of machine learning could transform future research into cancer and neurodegenerative diseases. The discoveries could be made beyond what scientists currently know and speculate about diseases, and potentially beyond what the human brain can understand without the help of machine learning.

Dr. Saar explained, “Machine learning can be freed from the constraints of what researchers think are the goals of scientific research, and that will mean finding new connections that we haven’t even imagined yet. It’s really exciting.”


St John’s College, University of Cambridge

Journal reference:

Saar, KL, and others. (2021) Learning the molecular grammar of protein condensates from sequence determinants and embedded elements. PNAS. doi.org/10.1073/pnas.2019053118.