|
|
@@ -23,37 +23,48 @@
|
|
|
% where are limits (e.g. BAM)
|
|
|
% what is our focus (and maybe 'why')
|
|
|
|
|
|
-\chapter{Datatypes}
|
|
|
-\label{chap:datatypes}
|
|
|
-As described in previous chapters \ac{DNA} can be represented by a String with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
|
|
|
-Storing a single \textit{A} with \ac{ascii} encoding requires 8 bit (\,excluding magic bytes and the bytes used to mark \ac{EOF})\, since there are at least $2^8$ or 128 displayable symbols. Since the \ac{DNA} buildingblocks only require a minimum of four letters, two bits are needed e.g.: \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. Depending on the sequencing method, more than four letters are used. The complex process of sequencing \ac{DNA} is not 100\% preceice, so additional Letters are used to mark nucelotides that could not or could only partly get determined.\\
|
|
|
+\chapter{File Types Used to Store DNA}
|
|
|
+\label{chap:filetypes}
|
|
|
+As described in previous chapters \ac{DNA} can be represented by a string with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
|
|
|
+The \ac{ascii} table is a characterset, registered in 1975 and to this day still in use to encode texts digitally. For the purpose of communication bigger charactersets replaced \ac{ascii}. It is still used in situations where storage is short.
|
|
|
+% todo grund dass ascii abgelöst wurde -> zu wenig darstellungsmöglichkeiten. Pro heute -> weniger overhead pro character
|
|
|
+Storing a single \textit{A} with \ac{ascii} encoding, requires 8 bit (\,excluding magic bytes and the bytes used to mark \ac{EOF})\ . Since there are at least $2^8$ or 128 displayable symbols. The buildingblocks of \ac{DNA} require a minimum of four letters, so two bits are needed
|
|
|
+% cout out examples. Might be needed later or elsewhere
|
|
|
+% \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}.
|
|
|
+In most tools, more than four symbols are used. This is due to the complexity in sequencing \ac{DNA}. It is not 100\% preceice, so additional symbols are used to mark nucelotides that could not or could only partly get determined. Further a so called quality score is used to indicate the certainty, for each single nucleotide, that is was sequenced correctly as what was stored.\\
|
|
|
More common everyday-usage text encodings like unicode require 16 bits per letter. So settling with \ac{ascii} has improvement capabilitie but is, on the other side, more efficient than using bulkier alternatives like unicode.\\
|
|
|
|
|
|
-Several people and groups have developed different fileformats to store genomes. Unfortunally for this work, there is no defined standard filetype or set of filetypes therefor one has to gather information on which types exist and how they function by themself. In order to not go beyond scope, this work will focus only on fileformats that fullfill following factors:\\
|
|
|
+Several people and groups have developed different fileformats to store genomes. Unfortunally for this work, there is no defined standard filetype or set of filetypes, therefore one has to gather information by themselve. In order to not go beyond scope, this work will focus only on fileformats that fullfill following criteria:\\
|
|
|
\begin{itemize}
|
|
|
\item{the format has reputation, either through a scientific paper, that prooved its superiority to other relevant tools or through a broad ussage of the format.}
|
|
|
- \item{the format does not include \ac{IUPAC} codes besides A, C, G and T \autocite{iupac}.}
|
|
|
- \item{the format is open source.}
|
|
|
+ \item{the format should no specialize on only one type of \ac{DNA}.}
|
|
|
+ \item{the format mainly stores nucleotide seuqences and does not neccesarily include \ac{IUPAC} codes besides A, C, G and T \autocite{iupac}.}
|
|
|
+ \item{the format is open source. Otherwise improvements can not be tested, without buying the software and/or requesting permission to disassemble and reverse engineer the software or parts of it.}
|
|
|
\item{the compression methode used in the format is based on probabilities.}
|
|
|
\end{itemize}
|
|
|
|
|
|
-Some common fileformats would be:
|
|
|
+Information on available formats where gathered through various Internet platforms \autocite{ensembl, ucsc, ga4gh}.
|
|
|
+Some common fileformats found:
|
|
|
\begin{itemize}
|
|
|
% which is relevant?
|
|
|
- \item{FASTA}
|
|
|
- \item{\ac{FASTQ}}
|
|
|
- \item{twoBit}
|
|
|
- \item{SAM/BAM}
|
|
|
- \item{VCF}
|
|
|
- \item{BED}
|
|
|
+ \item{BED} % \autocite{bed}
|
|
|
+ \item{CRAM} % \autocite{cram}
|
|
|
+ \item{FASTA} % \autocite{}
|
|
|
+ \item{\ac{FASTQ}} % \autocite{}
|
|
|
+ \item{GFF} % \autocite{}
|
|
|
+ \item{SAM/\ac{BAM}} % \autocite{}
|
|
|
+ \item{twoBit}% \autocite{}
|
|
|
+ \item{VCF}% \autocite{}
|
|
|
+
|
|
|
\end{itemize}
|
|
|
% src: http://help.oncokdm.com/en/articles/1195700-what-is-a-bam-fastq-vcf-and-bed-file
|
|
|
-
|
|
|
-Since methods to store this kind of Data are still in development, there are many more filetypes. The few, mentioned above are used by different organisations and researchers and are backed by a scientific publication. % todo find sources to both points in last sentence
|
|
|
-%rewrite:
|
|
|
-In order to not go beyond the scope, this paper will only focuse on compression tools which are using standard formats.
|
|
|
+Since methods to store this kind of Data are still in development, there are many more filetypes. The few, mentioned above are used by different organisations and
|
|
|
+%todo calc percentage
|
|
|
+are backed by scientific papers.\\
|
|
|
+Considering the first criteria, by searching through anonymously accesable \ac{ftp} servers, only two formats are used commonly: FASTA or its extension \ac{FASTQ} and the \ac{BAM} Format. %todo <- add ftp servers to cite
|
|
|
|
|
|
\section{\ac{FASTQ}}
|
|
|
+% todo add some fasta knowledge
|
|
|
Is a text base format for storing sequenced data. It saves nucleotides as letters and in extend to that, the quality values are saved.
|
|
|
\ac{FASTQ} files are split into multiples of four, each four lines contain the informations for one sequence. The exact structure of \ac{FASTQ} format is as follows:
|
|
|
\texttt{
|
|
|
@@ -76,33 +87,3 @@ The quality value shows the estimated probability of error in the sequencing pro
|
|
|
\caption{SAM/BAM file structure example}
|
|
|
\label{k_datatypes:bam-struct}
|
|
|
\end{figure}
|
|
|
-
|
|
|
-\subsection{Valid Symbols}
|
|
|
-%- char set restrictions: \ac{ascii} in range ! to ~ apart from '@'
|
|
|
-%- ... as regex (todo: describe in words):
|
|
|
-% This RegEx -> \[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
|
|
|
-The regulare expression, shown above, filters touple of characters from a to z in lower and uppercase, numbers and several special characters. There are minor differences between the special chars first and second element of the touple, in the latter '*' and '=' are allowed which are not allowed in the first.
|
|
|
-
|
|
|
-%\subsection{Index File}
|
|
|
-
|
|
|
-% notes
|
|
|
-%- may contian metadata: magic string, ref seq, bins, offsets, unmapped reads
|
|
|
-%- allows viewing BAM data (localy and remote via ftp/http)
|
|
|
-%- file extention: <filename>.bam.bai
|
|
|
-
|
|
|
-%- stores more data than \ac{FASTQ}
|
|
|
-
|
|
|
-% src: https://support.illumina.com/help/BS_App_RNASeq_Alignment_OLH_1000000006112/Content/Source/Informatics/BAM-Format.htm
|
|
|
-%- allignment section includes
|
|
|
-% - RG Read Group
|
|
|
-% - BC Barcode Tag
|
|
|
-% - SM Single-end alignment quality
|
|
|
-% - AS Paired-end alignment quality
|
|
|
-% - NM Edit distance tag
|
|
|
-% - XN amplicon name tag
|
|
|
-
|
|
|
-%- BAM index files nameschema: <filename>.bam.bai
|
|
|
-
|
|
|
-\section{Compressed Reference-oriented Ailgnment Map}
|
|
|
-\ac{CRAM} was developed as an alternative to the \ac{SAM} and \ac{BAM} Format. It specification is maintained by \ac{GA4GH}. It features both lossy and lossless compression mode. Since it is not relevant to this work, the lossy compression is ignored from here on. Even though it is part of \ac{GA4GH} suite, the file format can be used independently.\\
|
|
|
-The format saves data in containers which consist out of slices. Each slice is represented by a line in the file. Container and slices each store metadata in a header. Data is stored as blocks in slices, in a compressed form.
|