Apriori Algorithm on Covid-19 virus genome sequence

6 min readJun 24, 2020

Covid-19 pandemic has occupied the whole world and it has ignited many research and development project ranging from vaccine development to app which ensures handling social distancing.

I this post we will see how we can use one of well known data mining algorithm Apriori algorithm.

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. If you are new to this topic refer the below wiki post on Apriori algorithm, so that it will you a breif idea about it.

Apriori algorithm

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It…

en.wikipedia.org

World has already seen outbreak of SARC and Mers visruses. And lot of research has already been carried out on them. So if we can correlate SARC and Mers genomic sequence with Covid-19 genomic sequence then we can use them to fight against this pandemic.

Most of these type of viruses uses receptors present on there cell body to enter into human cell and use our human cell as factory to produce more viral bodies. There are mainly two types of cycle carried out by virus in our body :

The lytic cycle involves the reproduction of viruses using a host cell to manufacture more viruses; the viruses then burst out of the cell.
The lysogenic cycle involves the incorporation of the viral genome into the host cell genome, infecting it from within.

If virus is blocked from entering the cell we can easily counter its attack as they can not start their replication cycle. This can achieved by blocking those receptor protein structure on virus cell body which acts as a lock and key mechanism to enter into human cell.

And also also by finding similarities in the cellular structure of Covid-19 with other viruses we can probably use the use the existing vaccines to get immunity against the Covid-19. This process will help to save lot of time and resources in developing new vaccines.

So below i am share a simple approach how we can do this using well known Apriori algorithm.

Here i will be using python and BioPython module for this analysis.

You can get BioPython module from this link :

Biopython

See also our News feed and Twitter. Biopython is a set of freely available tools for biological computation written in…

biopython.org

To start using this module lets us download genomic sequence of Covid-19, SRAC and Mers from Kaggle repository or NCBI website.

Now let us dive into the code

This is to import the required library and load the genome data

import Bio as b
from Bio import SeqIO
from Bio import pairwise2
from Bio.pairwise2 import format_alignments=[]
s.append(SeqIO.read("genome/cov2.fasta","fasta"))
s.append(SeqIO.read("genome/mers.fasta","fasta"))
s.append(SeqIO.read("genome/sars.fasta","fasta"))

Now we shall identify the subsequnece which codes for protein using start and stop codons. These codons acts as a delimiter in genomic sequence.

aminoAcid=[]
for s in RnaSeq:
    temp=[]
    for i in range(len(s)-3):
        if(s[i]=='A' and s[i+1]=='U' and s[i+2]=='G'):
            temp.append('*')
        elif(s[i]=='U' and s[i+1]=='A' and s[i+2]=='A'):
            temp.append('*')
        elif(s[i]=='U' and s[i+1]=='A' and s[i+2]=='G'):
            temp.append('*')
        elif(s[i]=='U' and s[i+1]=='G' and s[i+2]=='A'):
            temp.append('*')
        
        temp.append(s[i])
    aminoAcid.append(temp)

With some knowledge we can say that these cell receptor proteins are large in size hence they are encoded by a larger genomic sequence. Hence we can ignore the smaller sequences which encodes for some other proteins this will speed up of process. Here i am randomly taking only those sequence which are larger than 20 but always we change this and try out with other specific lenght.

proteins=[]
for ar in aminoAcid:
    temp=[]
    tempstr=''
    i=0
    while(i<len(ar)):
        if(ar[i]=='*'):
            if(len(tempstr)>=20): # taking only those sequence which are greater than 20.
                temp.append(tempstr)
            tempstr=''
            i+=3
        else:
            tempstr+=ar[i]
            i+=1
    proteins.append(temp)

Now let us perform pair wise comparisons of subsequences among Covid-19, SARC and Mers viruses using pairwise2 method of BioPython library.

commonProteins=[]
for i in range(1,len(proteins)):
    temp=[]
    for j in range(len(proteins[0])):
        for k in range(len(proteins[i])):
            matchResult=pairwise2.align.globalxx(proteins[0][j],proteins[i][k],one_alignment_only = True)
            for m in matchResult:
                if(m[2]/m[4]>0.5):
                    temp.append(tuple([proteins[0][j],proteins[i][k],m[0],m[1],m[2]/m[4]]))
    commonProteins.append(temp)

We can even further preprocess this data so that it matches our requirements. And lets us store this preprocessed data in a csv file which can be used later while applying Apriori algorithm.

import pandas as pddf1=pd.DataFrame(commonProteins[0])
df2=pd.DataFrame(commonProteins[1])df1.to_csv("covid-sarc.csv",index=False)
df2.to_csv("covid-mers.csv",index=False)

Now let us see how we can apply this preprocessed data in Apriori algorithm

We will be considering each sequence as unique item. And we will find out the association between these item using Ariori algorithm.

To understand this lets us take an example

Let us consider a customer buys a laptop from a shop, now the owner of shop wants to findout what are the chances of that customer buying laptop bag or some software for it from his shop in near future. Here he runs Apriori algorithm to find out this based on the data which he has collect from the past transactions. In the same way we shall consider sequence from SARC or Mers as laptop in the above example and laptop bag would represent our respective Covid-19 genomic sequence. In this way we can find the association between SARC, Mers and Covid-19.

Below is the code snippet for Apriori algorithm

import pandas as pd
import itertoolsdata = pd.read_csv('path to your csv file')minimum_support_count = 5
records = []
for i in range(0, 20):
    records.append([str(data.values[i,j]) for j in range(0, 72)])items = sorted([item for sublist in records for item in sublist if item != 'nan'])def stage_1(items, minimum_support_count):
    c1 = {i:items.count(i) for i in items}
    l1 = {}
    for key, value in c1.items():
        if value >= minimum_support_count:
           l1[key] = value 
    
    return c1, l1def stage_2(l1, records, minimum_support_count):
    l1 = sorted(list(l1.keys()))
    L1 = list(itertools.combinations(l1, 2))
    c2 = {}
    l2 = {}
    for iter1 in L1:
        count = 0
        for iter2 in records:
            if sublist(iter1, iter2):
                count+=1
        c2[iter1] = count
    for key, value in c2.items():
        if value >= minimum_support_count:
            if check_subset_frequency(key, l1, 1):
                l2[key] = value 
    
    return c2, l2
    
def stage_3(l2, records, minimum_support_count):
    l2 = list(l2.keys())
    L2 = sorted(list(set([item for t in l2 for item in t])))
    L2 = list(itertools.combinations(L2, 3))
    c3 = {}
    l3 = {}
    for iter1 in L2:
        count = 0
        for iter2 in records:
            if sublist(iter1, iter2):
                count+=1
        c3[iter1] = count
    for key, value in c3.items():
        if value >= minimum_support_count:
            if check_subset_frequency(key, l2, 2):
                l3[key] = value 
        
    return c3, l3def stage_4(l3, records, minimum_support_count):
    l3 = list(l3.keys())
    L3 = sorted(list(set([item for t in l3 for item in t])))
    L3 = list(itertools.combinations(L3, 4))
    c4 = {}
    l4 = {}
    for iter1 in L3:
        count = 0
        for iter2 in records:
            if sublist(iter1, iter2):
                count+=1
        c4[iter1] = count
    for key, value in c4.items():
        if value >= minimum_support_count:
            if check_subset_frequency(key, l3, 3):
                l4[key] = value 
        
    return c4, l4def sublist(lst1, lst2):
    return set(lst1) <= set(lst2)
    
def check_subset_frequency(itemset, l, n):
    if n>1:    
        subsets = list(itertools.combinations(itemset, n))
    else:
        subsets = itemset
    for iter1 in subsets:
        if not iter1 in l:
            return False
    return Truec1, l1 = stage_1(items, minimum_support_count)
c2, l2 = stage_2(l1, records, minimum_support_count)
c3, l3 = stage_3(l2, records, minimum_support_count)
c4, l4 = stage_4(l3, records, minimum_support_count)
print("L1 => ", l1,c1)
print("L2 => ", l2)
print("L3 => ", l3)
print("L4 => ", l4)itemlist = {**l1, **l2,**l3,**l4}def support_count(itemset, itemlist):
    return itemlist[itemset]sets = []
for iter1 in list(l2.keys()):
    subsets = list(itertools.combinations(iter1, 2))
    sets.append(subsets)list_l3 = list(l2.keys())
for i in range(0, len(list_l3)):
    for iter1 in sets[i]:
        a = iter1
        b = set(list_l3[i]) - set(iter1)
        confidence = (support_count(list_l3[i], itemlist)/support_count(iter1, itemlist))*100
        print("Confidence{}->{} = ".format(a,b), confidence)

This is my first ever post on medium, so if you have and suggestion or comments please put in the response section. So that it will help me to make better contents in future.

Thanks & Regards,

Manthan M Kulakarni