Using Machine Learning to Detect Mutations Occurring in RNA Splicing

6 min readApr 28, 2019

Retinitis pigmentosa (RP) is one of the most common eye disease in the world, affecting nearly 1–3000 people, and is one of the most common inherited retinal dystrophies.

Retinitis Pigmentosa 38 — variation caused by the MERTK defective gene

Retinal dystrophies are inherited genetic diseases that cause severe loss in vision over time, as a result of loss of function in the retina.

For many genetic diseases like RP, there’s been a lot of promising research about how mutations affecting the splicing process have been seen to perhaps be a cause for the disease. According to a paper in Nature, nearly 38% of autosomal dominant (1 copy of a mutant gene and 1 healthy one from two parents) forms of RP showed mutations impacting splicing. These mutations affected genes coding for spliceosome factors, which are crucial to the process of splicing.

A patient with Hutchinson-Gilford progeria

However, for many rare diseases that are known to be likely caused by mutations affecting splicing such as Hutchinson-Gilford progeria, it’s unlikely that a drug will be available anytime soon, due to a lack of data. Few people with forms of progeria are even likely to pass the age of 13 years old.

These children won’t likely make it to high school.

This is the cause for dozens of other genetic disorders.

How can we solve this problem, and make the process of developing drugs for genetic diseases caused by splicing-related mutations more affordable?

Before we look into how ML is being used to assess the impacts of mutations on splice sites and identify key targets for disease, let’s go deep into understanding splicing.

How does RNA Splicing Work

As DNA is transcribed into RNA it needs to be edited to remove non-coding regions, or introns, shown in green. This editing process is called splicing, which involves removing the introns, leaving only the yellow, protein-coding regions, called exons. The main goal of RNA is to remove all non-coding regions of the gene.

RNA splicing begins with the assembly of helper proteins at the intron/exon borders.

One end of the intron is cut, forming a loop between both sides of the intron

These splicing factors act as guides to small nuclear ribo proteins to form a splicing machine, called the spliceosome, which is made up of RNA and 150 other proteins. The duty of the spliceosome is to bring to exons on either end side of the intron of the DNA strand very close together, to be cut. One end of the intron is cut and then folded back on itself, forming a loop.

The edited RNA is released by the spliceosome

The spliceosome then cuts the RNA, made up of the exons/coding-regions, releasing the introns loop, and joins the two exons together. The edited RNA and intron are then released and the spliceosome’s various proteins disassemble. This process is repeated for every intron in the RNA.

Between the introns and exons, there are two splice sites, the acceptor site located at the end of the first exon, and the donator site, located at the beginning of the second exon. These two make up the splice sites. Often, due to genetic variants, the acceptor and donor site where splicing occur differ, and parts of the intron are included. These mutations are potentially disease causing, which makes the process of splicing so important.

Different locations of the donor and acceptor site occur due to different variants in the gene

If you’d like to go deeper into understanding splicing, understanding how the spliceosome functions, and more, I’d recommend this video from the Cold Spring Harbour Laboratory

Understanding COSSMO — Predicting Competitive Alternative Splice Site Selection using Deep Learning

In genetics and computational biology, we refer to splicing codes as computational models which predict the site at which splicing occurs. Splicing codes will help us better understand the different sequences being produced and their functions.

Other models have the function of predicting if a exon was constitutively spliced or alternatively spliced. Constitutive splicing is where all the strands are spliced the same way in removing introns, resulting in the exons maintaining the same order the entire time — producing the same protein. Alternative splicing, on the other hand, is when particular exons from the gene are included or excluded in creating mRNA, producing different proteins.

How COSSMO works, on the other hand, is it predicts a usage distribution of multiple splice sites in a given gene, and also of alternatively spliced sites conditional on constitutive donor sites, and vice versa.

What this means, is the model could predict at which locations the donor and acceptor sites of introns would be cut out through alternative or constitutive splicing. To do this, they had to obtain the inherent strength of each splice site, as in alternative splicing, numerous splice sites are competing for the spliceosome’s recognition, as not all exons are included in forming mRNA. This process can form many different variations of proteins, so it is important to model this function, especially in the event of different mutations in the gene.

The PCI index among different acceptor sites

This is why COSSMO also predict the percent-selected index (PCI) for the sites, so the chance of the different site being used, using the softmax activation function on each score produced for a splice site.

COSSMO will allow us to learn more about splicing errors and mutations affecting splicing, which cause about 15% of genetic diseases. You can even try COSSMO right over here.

Imagine being able to detect what target mutations affect splicing, and go on to form diseases like retinistis pigmentosa or Hutchinson-Gilford progeria! We’d be able to create therapeutics far more accurately and make it far more affordable to develop a drug!

And that’s exactly what Deep Genomics, the company behind COSSMO, are working on, developing oligonucleotide therapeutics using AI. Their goal, is to bring down the cost of creating therapeutics by incorporating AI into drug discovery and development. Other companies such as Cyclica and Atomwise are focusing on using AI for creating drugs as well.

The implications of COSSMO and splicing codes are huge, and I can’t wait to see what Deep Genomics focuses on next!

Next steps

If you enjoyed this article, be sure to follow these steps to keep in touch with my future projects and articles!

Connect with me on Linkedin, to hear about my future developments and future projects. I’m currently looking into computational biology and machine learning!
My website is now up with all my content, as well as my Github.
Be sure to subscribe to my monthly newsletter to see new projects, conferences I go to, and articles I put out!
Feel free to email me at seyonec@gmail.com to talk about this project and more!

Using Machine Learning to Detect Mutations Occurring in RNA Splicing

How does RNA Splicing Work

Understanding COSSMO — Predicting Competitive Alternative Splice Site Selection using Deep Learning

Next steps

Written by Seyone Chithrananda