Useful Oneliners in Bioinformatics

In the past few months I have beeing working with a lot of biological data, which ended up having size of ~400GB. Big Data on the road! While working with all these different types of biological data such as ChIP-seq, RNA-seq, DNase-seq, DNA-seq and all the different formats each of the data types uses, I have learned that your preprocessing scripts have to be efficient and quick.

In this post I will share some of the (bash) oneline scripts I have been using in my work. They might not have been the most optimized ones, but using these was definitely faster than writing a code block in python. The scripts bellow use command line utilities as awk, sed, and grep.

AWK: created by Aho, Weinberger & Kernighan
SED: stream editor
GREP: global regular expression print

Oneline scripts


# converts the fasta sequence lower case nucleotides to upper case 
$ awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' chr4.fa > upper_chr4.fa 

# removes lines with particular string
$ sed '/variableStep chrom=chr/d' K562_dnase_signals.wig > K562_dnase_signals_noHeader.wig

# prints the difference between the 3rd and 2nd column, then sorts it numericaly, and prints just unique values
$ awk 'BEGIN{FS=OFS="\t"}{print $3-$2}' sorted_peaks_tresholds.bed | sort -k1 - | uniq 

# loop for cluster job sumbmission for extracting certain columns from multiple files
for ch in `seq 1 22`
do

bsub -q short -W 2:00 -R "rusage[mem=20000]" -e error.err -N "cut -f1,2,3,5 dnase_peak_chr${ch}.narrowPeak > dnase_peaks_HCT116_chr${ch}.narrowPeak"
 echo "Chromosome${ch} is done ..."
done

# removes headers from a file
$ grep -v ">" file.fasta > file_without_header.txt


# sorts file by second column numericaly
$ sort -k2 -n test.wig > K562_chip_signal_chr21_filled.wig

# finds exact match for the 3rd column
$ awk '$3 == "chr1"' file

# splits human genome fasta file into chromosomes
$ cat hg19.genome.fa | awk 'BEGIN { chr="" } { if ($1~"^>") chr=substr($1,2); print $0 > chr".fa" }'

# splits dnase file into chromosomes
$ awk '{ print >>  "dnase_" $1 ".wig" }' DNASE.K562.fc.signal.noHeader.wig

# removes first line from the file
$ sed '1d' test.dat > tmp.dat

# renames multiple files
$ find . -name "*.bg" -exec sh -c 'mv "$1" "${1%.bg}.wig"' _ {} \;

# finds the distance between start and end in the file
$ awk 'BEGIN{FS=OFS="\t"}{print $3-$2}' DNASE.HeLa-S3.peak | less

# creates a new file with the lines between chromosome 1 and chromosome 2 
$ sed -n "/.*chromosome 1.*/,/.*chromosome 2.*/p" HumanGenome38.fa > hg38_chr1.fa

# extract the last line 
$ sed -i '$ d' hg38_chr1.fa

# takes only chromosome 1 from whole data
$ grep -P 'chr1\t' test27.bg > K27ac_chr1.bg

# searches for a specific word within your filename
$ grep -r 'your_text' your.file 

# changes the name in column
$ awk -v OFS='\t' '$1=="chr"$1 {print}' your.file | less -S 

# remove a specific word and then save as a new.file
$ grep -v "your_text" your.file > new.file 

# extract just the 1000 lines of a file
$ awk -v OFS='\t' 'NR%1000=1 {print}' your.file

# get all the paired-ends (mates) 
$ awk -v OFS='\t' '$7="=" {print}' 

# exract columns after skiping one line of the file
$ awk -F 'NR>1 {print $1 "," $2 "," $3 "," $4}' Dnase.K562.relaxed.peaks.bed > relaxed_peaks_K562.bed

I hope these oneliners will be helpful and handy shortcut in your data processings as well. Enjoy your work and stay tuned for more :)

Written on July 28, 2016