Algorithmic Complexity

Lot of people do coding these days, but how many of them have developed correct algorithmic thinking which is necessary for efficient and robust coding… Algorithmic Complexity is one of the key points everyone who codes should consider as a starting point when learning efficient, robust and modular coding. This blog post will be spread through four sections: * Introduction * Types of Notations for Time Complexity * How to calculate Time Complexity * Common running Time

Introduction

Algorithmic complexity is denoting how fast or slow your algorithm performs. The complexity is defined as a numerical function T(n) - time versus the input size n. Time taken by your algorithm is defined without depending on the implementation details. However, T(n) does depend on the implementation you used while developing your algorithm. A developed algorithm will take different amounts of time on the same inputs depending on various factors as: processor speed; instruction set, disk speed, brand of compiler and etc. The way around is to estimate efficiency of each algorithm asymptotically. Time T(n) is measured as the number of elementary “steps” (defined in any way), provided each such step takes constant time.

For examples let’s consider addition of two integers (a,b). Two integers are added digit by digit (or bit by bit), and this will define a “step” in our computational model. Therefore, we say that addition of two n-bit integers takes n steps. Consequently, the total computational time is T(n) = t * n, where t is time taken by addition of two bits. On different computers, additon of two bits might take different time, say t1 and t2, thus the additon of two n-bit integers takes T(n) = t1 * n and T(n) = t2* n respectively. This shows that different machines result in different slopes, but time T(n) grows linearly as input size increases.

Donald Ervin Knuth writes in the Leaders in Computing: Changing the digital world:

“An algorithm must be seen to be believed.”

So let’s get to understand how to calculate time complexity for algorithms with some examples.

Types of Notations for Time Complexity

  1. Big Oh [O(n)]-> “fewer than or the same as” iterations
  2. Big Omega [Ω(n)] -> “more than or the same as” iterations
  3. Big Theta [Θ(n)] -> “the same as” iterations
  4. Little Oh [o(n)] -> “fewer than” iterations
  5. Little Omega [Ω(n)] -> “more than” iterations

The rest of the post will be related to the Bio Oh complexity, because it can guarantee the maximum performance of an algorithm. It is used when considering the worst possible case by taking the highest order of a polynomial function and ignoring all the constants value since they aren’t too influential for sufficiently large input. Big Oh is natural to give a limit on how bad that worst case can be, not how good it can be. That is, why you want to give an upper bound on its degree of badness (O(n)). Similarly, when you want to give a lower bound on how good the best-case is (i.e, even on good inputs, there’s still a limit on how fast the algorithm can go; what is that limit?), there comes Big-Omega which is associated with best-case.

How to calculate Time Complexity

Basic Operations (arithmetic, comparisons, accessing array’s elements, assignment):

The running time is always constant O(1)

Example : read(n) // O(1) a = 10; // O(1) a = 1.000.000.000.000.000.000 // O(1)

If then else statement: Taking the maximum running time from multiple statements.

Example:

import sys
ball =  input()                             # (1+1) = 2
game = ''                             
guest = 0                                   # 1
if ball < 2:                                # 1
      game = "No game!";                    # 1
else:
      game = "Start the game!";             # 1
      guest= game + 1;                      # 1+1 = 2

So, the complexity of the above code is T(n) = 2 + 1 + 1 + max(1, 1+2) = 8. With that being said its Big Oh is still constant T(n) = O(1).

Loops (for, while, repeat): Time for this statement is the number of iterations multiplied by the number of operations inside that iteration.

Example: total = 0; # 1 for i in range(1, ball): # (1+1)*n = 2n total = total + i; # (1+1)*n = 2n Its complexity is T(n) = 1+4n. So, T(n) = O(n).

Nested Loop (loop inside loop): Since there is at least one loop inside the main loop, running time of this statement is O(n^2) or O(n^3).

Example: for i in (1,age): # (1+1)*n = 2n for j in (1,age): # (1+1)n*n = 2n^2 x = x + 1; # (1+1)n*n = 2n^2 print x; # (n*n) = n^2

Time complexity for the nested loop above is T(n) = 2n + 5n^2, so T(n) = O(n^2)

Common Running Times

Okay let’s go through an example again in order to understand the most common running times. Assume we have a genetics book that has chapter (the “Important”) which have unique names and mini sections (the “Less Important”) which may not have unique names. An achronym is assigned to at most one mini section or chapter. We will also assume that it takes constant time to flip to a specific page.

  • O(1) (worst case): Given the page that a chapter’s name is on and the chapter name, find the achronym.
  • O(1) (average case): Given the page that a mini section’s name is on and its name, find the achronym.
  • O(log n): Given a mini section’s name, find the achronym by picking a random point about halfway through the part of the book you haven’t searched yet, then checking to see whether the mini section’s name is at that point. Then repeat the process about halfway through the part of the book where the mini section’s name lies. (This is a binary search for a mini section’s name.)
  • O(n): Find all mini sections which achronyms contain the digit “3”.
  • O(n): Given an achronym, find the mini section or chapter with that label.
  • O(n log n): Our genetics book was spirally binded, and by time the binder lost it function, you dropped the book and all the pages were out. You qickly collected all the pages and inserted them in a random order. Fix the ordering so that it’s correct by looking at the first word of mini section name on each page and then putting that page in the appropriate spot in a new, empty book.
  • O(n^2) – Quadratic Time - Bubble Sort algorithm
  • O(n^3) – Cubic Time - Same strategy as with O(n^2)
  • O(2^n) – Exponential Time - Bruthe Force algorithm
  • O(n!) – Factorial Time - Travel Salesman Problem

Words of conclusion

I hope this post helps you get a basic understanding of algorithmic complexity. From the first hand, I do know that algorithmic complexity understanding is more than important for a quality piece of code. You don’t want to spend your youth waiting for the function to execute :P. So efficient, systematically analyzed and robust algorithm is definetely what defines the quality.

Read More

Useful Oneliners in Bioinformatics

In the past few months I have beeing working with a lot of biological data, which ended up having size of ~400GB. Big Data on the road! While working with all these different types of biological data such as ChIP-seq, RNA-seq, DNase-seq, DNA-seq and all the different formats each of the data types uses, I have learned that your preprocessing scripts have to be efficient and quick.

In this post I will share some of the (bash) oneline scripts I have been using in my work. They might not have been the most optimized ones, but using these was definitely faster than writing a code block in python. The scripts bellow use command line utilities as awk, sed, and grep.

  • AWK: created by Aho, Weinberger & Kernighan
  • SED: stream editor
  • GREP: global regular expression print

Oneline scripts


# converts the fasta sequence lower case nucleotides to upper case 
$ awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' chr4.fa > upper_chr4.fa 

# removes lines with particular string
$ sed '/variableStep chrom=chr/d' K562_dnase_signals.wig > K562_dnase_signals_noHeader.wig

# prints the difference between the 3rd and 2nd column, then sorts it numericaly, and prints just unique values
$ awk 'BEGIN{FS=OFS="\t"}{print $3-$2}' sorted_peaks_tresholds.bed | sort -k1 - | uniq 

# loop for cluster job sumbmission for extracting certain columns from multiple files
for ch in `seq 1 22`
do

bsub -q short -W 2:00 -R "rusage[mem=20000]" -e error.err -N "cut -f1,2,3,5 dnase_peak_chr${ch}.narrowPeak > dnase_peaks_HCT116_chr${ch}.narrowPeak"
 echo "Chromosome${ch} is done ..."
done

# removes headers from a file
$ grep -v ">" file.fasta > file_without_header.txt


# sorts file by second column numericaly
$ sort -k2 -n test.wig > K562_chip_signal_chr21_filled.wig

# finds exact match for the 3rd column
$ awk '$3 == "chr1"' file

# splits human genome fasta file into chromosomes
$ cat hg19.genome.fa | awk 'BEGIN { chr="" } { if ($1~"^>") chr=substr($1,2); print $0 > chr".fa" }'

# splits dnase file into chromosomes
$ awk '{ print >>  "dnase_" $1 ".wig" }' DNASE.K562.fc.signal.noHeader.wig

# removes first line from the file
$ sed '1d' test.dat > tmp.dat

# renames multiple files
$ find . -name "*.bg" -exec sh -c 'mv "$1" "${1%.bg}.wig"' _ {} \;

# finds the distance between start and end in the file
$ awk 'BEGIN{FS=OFS="\t"}{print $3-$2}' DNASE.HeLa-S3.peak | less

# creates a new file with the lines between chromosome 1 and chromosome 2 
$ sed -n "/.*chromosome 1.*/,/.*chromosome 2.*/p" HumanGenome38.fa > hg38_chr1.fa

# extract the last line 
$ sed -i '$ d' hg38_chr1.fa

# takes only chromosome 1 from whole data
$ grep -P 'chr1\t' test27.bg > K27ac_chr1.bg

# searches for a specific word within your filename
$ grep -r 'your_text' your.file 

# changes the name in column
$ awk -v OFS='\t' '$1=="chr"$1 {print}' your.file | less -S 

# remove a specific word and then save as a new.file
$ grep -v "your_text" your.file > new.file 

# extract just the 1000 lines of a file
$ awk -v OFS='\t' 'NR%1000=1 {print}' your.file

# get all the paired-ends (mates) 
$ awk -v OFS='\t' '$7="=" {print}' 

# exract columns after skiping one line of the file
$ awk -F 'NR>1 {print $1 "," $2 "," $3 "," $4}' Dnase.K562.relaxed.peaks.bed > relaxed_peaks_K562.bed

I hope these oneliners will be helpful and handy shortcut in your data processings as well. Enjoy your work and stay tuned for more :)

Read More