GENOMEDATA
GENOMEDATA
A format for efficient storage of multiple tracks of numeric data anchored to a genome.
Provides a way to store and access large-scale functional genomics data in a format which is both space-efficient and allows efficient random-access.
Genomedata is implemented as one or more HDF5 files, but Genomedata provides a transparent interface to interact with your underlying data without having to worry about the mess of repeatedly parsing large data files or having to keep them in memory for random access.
- The Genomedata hierarchy:
Each Genome contains many Chromosomes Each Chromosome contains many Supercontigs Each Supercontig contains one continuous data set Each continuous data set is a numpy.array of floating point numbers with a column for each data track and a row for each base in the data set.
- Why have Supercontigs?
Genomic data seldom covers the entire genome but instead tends to be defined in large but scattered regions. In order to avoid storing the undefined data between the regions, chromosomes are divided into separate supercontigs when regions of defined data are far enough apart. They also serve as a convenient chunk since they can usually fit entirely in memory.
- Implementation
Genomedata archives are implemented as one or more HDF5 files. The API handles both single-file and directory archives transparently, but the implementation options exist for several performance reasons.
Use a directory with few chromosomes/scaffolds:
- Parallel load/access
- Smaller file sizes
Use a single file with many chromosomes/scaffolds:
- More efficient access with many chromosomes/scaffolds
- Easier archive distribution
Note:
The default behavior is to implement the Genomedata archive as a directory if there are fewer than 100 sequences being loaded and as a single file otherwise.
Creation
A Genomedata archive contains sequence and may also contain numerical data associated with that sequence. You can easily load sequence and numerical data into a Genomedata archive with the genomedata-load command (see command details additional details):
genomedata-load [-t trackname=signalfile]... [-s sequencefile]... GENOMEDATAFIL
The underlying commands are still installed and may be used if more fine-grained control is required (for instance, parallel data loading or adding additional tracks later). The commands and required ordering are:
- genomedata-load-seq
- genomedata-open-data
- genomedata-load-data
- genomedata-close-data
Entire data tracks can later be replaced with the following pipeline:
- genomedata-erase-data
- genomedata-load-data
- genomedata-close-data
Genomedata archives must include the underlying genomic sequence and can only be created with genomedata-load-seq. A Genomedata archive can be created without any tracks, however, using the following pipeline:
- genomedata-load-seq
- genomedata-close-data
Note:
A call to h5repack after genomedata-close-data may be used to transparently compress the data.
Let see a short examle for creating a Genomedata archieve
From following files create a Genomedata archieve:
- Two sequence file:
1. A text file, chr1.fa
2. A compressed text file, chrY.fa.gz
- Two signal files:
1. signal_low.wigFix
2. signal_high.bed.gz
- A text file, chr1.fa:
>chr1
taaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
accctaaccctaaccctaaccctaaccct
- A compressed text file, chrY.fa.gz:
>chrY
ctaaccctaaccctaaccctaaccctaaccctaaccctCTGaaagtggac
- A signal file signal_low.wigFix:
fixedStep chrom=chr1 start=5 step=1
0.372
-2.540
0.371
-2.611
0.372
-2.320
- A signal file signal_high.beg.gz
chrY 0 12 4.67
chrY 20 23 9.24
chr1 1 3 2.71
chr1 3 6 1.61
chr1 6 24 3.14
Following commands are used to create a Genomedata archieve (genome.test) from the previously defined files:
genomedata-load -s chr1.fa -s chrY.fa.gz -t low=signal_low.wigFix \
-t high=signal_high.bed.gz genomedata.test
Also, the following pipeline can be followed:
genomedata-load-seq genomedata.test chr1.fa chrY.fa.gz
genomedata-open-data genomedata.test low high
genomedata-load-data genomedata.test low < signal_low.wigFix
zcat signal_high.bed.gz | genomedata-load-data genomedata.test high
genomedata-close-data genomedata.test
Note:
chr1.fa and chrY.fa.gz could also be combined into a single sequence file with two sequences.
- It is important that the sequence names (chrY, chr1) in the signal files match the sequence identifiers in the sequence files exactly.