Multi Hit, Dropoff Percentage and NCM-2: Three Improvements in BLAST
Correspondence Address :
Dr Deepak Garg,Computer Science Department, Thapar University,
Patilala-147004, India. Email : deep108@yahoo.com, Ph: +91-9815599654
Various algorithms are in use in medical processes to improve the speed, sensitivity and accuracy of the computations and analyses involved in those experiments.
The aim of this paper is to suggest three improvements, namely Multi Hit, Dropoff percentage and NCM-2 in the BLAST algorithm.
BLAST (Basic Local Alignment Search Tool) is a popular tool used for determining the patterns in genomic sequences. As the data is increasing exponentially, the need for advanced and complex algorithms for improving the accuracy, speed and sensitivity of pattern discovery tools in bioinformatics is also increasing.
First Improvement: The initialization of the word matches in a pairwise sequence alignment works either on single hit or two-hit algorithms. Instead, if we use a 3-hit or n-hit in general then the results improve in general and improve dramatically for some specific species and sequences.
Second Improvement: BLAST is using a drop-off score to calculate the highest scoring pairs between two sequences. A change has been proposed to calculate the threshold score that determines the inclusion of the subsequence in the result. Instead of using a drop-off score, if we use a drop-off percentage, it gives better results for some sequences.
Third Improvement: We propose an NCM-2 approach for normalizing BLAST values for simple regions. This approach is based upon the natural properties of the Amino acid sequences. The algorithms have been run on Linux ES platform with Compaq Presario 2GB RAM and compared to the original BLAST.
Nowadays, medical practitioners are increasingly relying on sequence comparison tools to understand the various properties of the genomes under consideration. BLAST is being used for pairwise alignments instead of multisequence alignments. BLAST can perform local as well as global alignments.
In case of global alignments, it sacrifices some of the advantages of the local alignments. It is the most popular tool being used by the biologists and by people involved in the clinical trials of various drugs. Detection of the SARS disease source and then finding a remedy for it is a good example of success of BLAST-like tools.
These kinds of algorithms are also being used in pulse oximetry to remove the background noise. These tools have been used for solving various other medical problems. The examples and references are available at NCBI [http://ncbi.nlm.nih.gov].
The central idea of the BLAST algorithm is that a statistically significant alignment that will be of use to the doctors is likely to contain a high-scoring pair of aligned words. BLAST first scans the database of genomes for words that score at least T when aligned with some word within the query genome sequence. Any aligned word pair satisfying this condition is called a `hit`. In the second part, BLAST checks whether each hit lies within an alignment with a score sufficient for the sequence to be part of the result set. It will be achieved by extending a hit in both directions until the running alignment’s score has dropped more than X below the maximum score yet attained. This extension step is computationally quite expensive. The extension step typically accounts for two third time of BLAST execution time. It is therefore desirable to reduce the number of extensions performed (1),(2).
A two-hit algorithm was made to solve this problem. It is observed that an HSP (High Scoring Pairs) of interest is much longer than a single word pair and may therefore entail multiple hits on the same diagonal. The multiple hits should be in relatively short distance of one another. Specifically, a window length A is chosen, and it invokes an extension only when two non-overlapping hits are found within the distance A of one another on the same diagonal. Any hit that overlaps with the most recent one is ignored. We require two hits rather than one to invoke an extension. Therefore, the threshold parameter T must be lowered to retain comparable sensitivity. The effect is that many more single hits are found, but only a small fraction has an associated second hit on the same diagonal that triggers an extension. The great majority of hits may be dismissed after the minor calculation of looking up, for the appropriate diagonal, the co-ordinate of the most recent hit. After checking whether it is within distance A of the current hit’s coordinate, the old is finally replaced with the new co-ordinate. Empirically, the computation saved by requiring fewer extensions more than offsets the extra computation required to process the larger number of hits.
Multi hit
There are very rare chances that the occurrence of disease patterns will only be at one or two places in the genome sequence. Generally, it will be spread at various places in the genome sequence. When we look up the database with the query sequences to find some similarity between the sequence under scrutiny and the other sequences present in the database, the result will have the details of any functional or relational match between the two (3),(4).
The proposed new approach extends a two-hit algorithm to an N-hit algorithm. Here, the value of N can be given by the user, depending upon the requirements. The value of N will then act as a tradeoff between speed and sensitivity. According to the N-hit algorithm, we can choose a window length A. The algorithm step invokes an extension only when N non-overlap
A test set was constructed by selecting the 15 families from the Pfam database that are high in ranking. The Pfam has about 5000 protein families. Forty protein sequences from these families were chosen randomly. This made up a total of 600 sequences. This test set contains fragments as well as full protein sequences. Some proteins have multiple domains and some do not have any domain at all.
This data set has both protein fragments and full protein sequences. The SEG was run with the parameter values W = 12, K(1) = 2.2 bit and K(2) = 2.5 bit. The threshold score was 40 for CAST. The cutoff = 0.01, max-search offset =4 and min-search-offset=1 was used for XNU.
XNU, SEG and CAST use masking operations. Here, DR represents the detection ratio and is calculated as
DR = (Number of domains in masked sequence/ Number of domains in unmasked given sequence) * 100
HR represents Hit ratio and is calculated as
HR = (number of masked residues outside the domains of the sequence/number of masked residues in the filtered sequence) * 100
The objective of filtering is to mask non-domain regions without masking the domain regions. The detection ratio may decrease by masking the domain regions. We get a maximum detection ratio. NCM-2 does not have any masking operation and the Pfam database entries themselves have their domains identified with the algorithm inbuilt into them. Therefore, NCM-2 detects 100% of the simple regions. Any algorithm without masking will identify 100% of the sequences. The objective of masking was to filter out high scoring database subsequences coming in the result of the alignment showing higher similarity. The same task is now being performed by the semantic introduced above. It is based on the composition of the sequence and also depends on the number of matches, mismatches, neutral matches and conservative matches. Time is saved because there is no need to perform the masking operation. The same saving in time can be utilized to calculate the probabilities related to C,N,M+,M- of NCM-2. As shown in Table/Fig 4], NCM-2 will detect 100 percent of the sequences due to non masking as compared to XNU(97.2 ), SEG (93.4 ), CAST(98.4). This method also takes care of the non-genuine sequences and eliminates them from the list of significant matches.
By introducing the new approach of multi-hit with drop off percentage, there was a significant improvement in the running time of BLAST. The biologist also has the flexibility to get the result of his own choice by changing either the value of the number of hits or the drop off percentage or both. The introduction of multiple hits will have an effect on the sensitivity if the number of hits selected is more. In the case of three hits, the algorithm shows an improvement ranging from 11 to 26 percent on different sequences. The introduction of the drop off percentage results in less number of calculations because there is no need to go to the scoring matrix and the resulting improvement ranges from 1 to 5 percent. The combined algorithm implementation of multi-hit with drop off percentage shows an improvement in the range of 7 to 21 percent. Therefore, using this method, the biologist can derive the dual benefits of flexibility and speed without compromising on any other aspect. Due to NCM-2 no information will be lost and genuine sequences will travel to the result of the homology searches.
- Emerging Sources Citation Index (Web of Science, thomsonreuters)
- Index Copernicus ICV 2017: 134.54
- Academic Search Complete Database
- Directory of Open Access Journals (DOAJ)
- Embase
- EBSCOhost
- Google Scholar
- HINARI Access to Research in Health Programme
- Indian Science Abstracts (ISA)
- Journal seek Database
- Popline (reproductive health literature)
- www.omnimedicalsearch.com