Useful Oneliners in Bioinformatics
In the past few months I have beeing working with a lot of biological data, which ended up having size of ~400GB. Big Data on the road! While working with all these different types of biological data such as ChIP-seq, RNA-seq, DNase-seq, DNA-seq and all the different formats each of the data types uses, I have learned that your preprocessing scripts have to be efficient and quick.
In this post I will share some of the (bash) oneline scripts I have been using in my work. They might not have been the most optimized ones, but using these was definitely faster than writing a code block in python. The scripts bellow use command line utilities as awk, sed, and grep.
- AWK: created by Aho, Weinberger & Kernighan
- SED: stream editor
- GREP: global regular expression print
Oneline scripts
# converts the fasta sequence lower case nucleotides to upper case
$ awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' chr4.fa > upper_chr4.fa
# removes lines with particular string
$ sed '/variableStep chrom=chr/d' K562_dnase_signals.wig > K562_dnase_signals_noHeader.wig
# prints the difference between the 3rd and 2nd column, then sorts it numericaly, and prints just unique values
$ awk 'BEGIN{FS=OFS="\t"}{print $3-$2}' sorted_peaks_tresholds.bed | sort -k1 - | uniq
# loop for cluster job sumbmission for extracting certain columns from multiple files
for ch in `seq 1 22`
do
bsub -q short -W 2:00 -R "rusage[mem=20000]" -e error.err -N "cut -f1,2,3,5 dnase_peak_chr${ch}.narrowPeak > dnase_peaks_HCT116_chr${ch}.narrowPeak"
echo "Chromosome${ch} is done ..."
done
# removes headers from a file
$ grep -v ">" file.fasta > file_without_header.txt
# sorts file by second column numericaly
$ sort -k2 -n test.wig > K562_chip_signal_chr21_filled.wig
# finds exact match for the 3rd column
$ awk '$3 == "chr1"' file
# splits human genome fasta file into chromosomes
$ cat hg19.genome.fa | awk 'BEGIN { chr="" } { if ($1~"^>") chr=substr($1,2); print $0 > chr".fa" }'
# splits dnase file into chromosomes
$ awk '{ print >> "dnase_" $1 ".wig" }' DNASE.K562.fc.signal.noHeader.wig
# removes first line from the file
$ sed '1d' test.dat > tmp.dat
# renames multiple files
$ find . -name "*.bg" -exec sh -c 'mv "$1" "${1%.bg}.wig"' _ {} \;
# finds the distance between start and end in the file
$ awk 'BEGIN{FS=OFS="\t"}{print $3-$2}' DNASE.HeLa-S3.peak | less
# creates a new file with the lines between chromosome 1 and chromosome 2
$ sed -n "/.*chromosome 1.*/,/.*chromosome 2.*/p" HumanGenome38.fa > hg38_chr1.fa
# extract the last line
$ sed -i '$ d' hg38_chr1.fa
# takes only chromosome 1 from whole data
$ grep -P 'chr1\t' test27.bg > K27ac_chr1.bg
# searches for a specific word within your filename
$ grep -r 'your_text' your.file
# changes the name in column
$ awk -v OFS='\t' '$1=="chr"$1 {print}' your.file | less -S
# remove a specific word and then save as a new.file
$ grep -v "your_text" your.file > new.file
# extract just the 1000 lines of a file
$ awk -v OFS='\t' 'NR%1000=1 {print}' your.file
# get all the paired-ends (mates)
$ awk -v OFS='\t' '$7="=" {print}'
# exract columns after skiping one line of the file
$ awk -F 'NR>1 {print $1 "," $2 "," $3 "," $4}' Dnase.K562.relaxed.peaks.bed > relaxed_peaks_K562.bed
I hope these oneliners will be helpful and handy shortcut in your data processings as well. Enjoy your work and stay tuned for more :)