il y a 3 ans · 1302948d38
--- a/latex/tex/kapitel/k2_dna_structure.tex
+++ b/latex/tex/kapitel/k2_dna_structure.tex
@@ -29,9 +29,6 @@ To strengthen the understanding of how and where biological information is store
 
				 \end{figure}
			
 
				 
			
 
				 All living organisms, like plants and animals, are made of cells (a human body can consist out of several trillion cells) \cite{cells}.\\
			
 
				-\autocite{cells}\\
			
 
				-\textcite{cells}\\
			
 
				-\footcite{cells}\\
			
 
				 A cell in itself is a living organism; The smallest one possible. It consists out of two layers from which the inner one is called nucleus. The nucleus contains chromosomes and those chromosomes hold the genetic information in form of \ac{DNA}. 
			
 
				  
			
 
				 \acs{DNA} is often seen in the form of a double helix. A double helix consists, as the name suggests, of two single helix. 
			
@@ -39,7 +36,7 @@ A cell in itself is a living organism; The smallest one possible. It consists ou
 
				 \begin{figure}[ht]
			
 
				   \centering
			
 
				   \includegraphics[width=15cm]{k2/dna.png}
			
 
				-  \caption{A purely diagrammatic figure of the components \acs{DNA} is made of. The smaller, inner rods symbolize nucleotide links and the outer ribbons the phosphate-sugar chains \autocite{dna_structure}.}
			
 
				+  \caption{A purely diagrammatic figure of the components \acs{DNA} is made of. The smaller, inner rods symbolize nucleotide links and the outer ribbons the phosphate-sugar chains \cite{dna_structure}.}
			
 
				   \label{k2:dna-struct}
			
 
				 \end{figure}
			
 
				 
			
--- a/latex/tex/kapitel/k3_datatypes.tex
+++ b/latex/tex/kapitel/k3_datatypes.tex
@@ -34,7 +34,7 @@ In most tools, more than four symbols are used. This is due to the complexity in
 
				 More common everyday-usage text encodings like unicode require 16 bits per letter. So settling with \acs{ASCII} has improvement capabilities but is, on the other side, more efficient than using bulkier alternatives like unicode.\\
			
 
				 
			
 
				 % differences between information that is store
			
 
				-Formats for storing uncompressed genomic data, can be sorted into several categories. Three noticable ones would be \autocite{survey}:
			
 
				+Formats for storing uncompressed genomic data, can be sorted into several categories. Three noticable ones would be \cite{survey}:
			
 
				 \begin{itemize}
			
 
				 	\item sequenced reads
			
 
				 	\item aligned data
			
@@ -46,7 +46,7 @@ Aligned data is somwhat simliar to sequenced reads with the difference that inst
 
				 The focus of this work lays on compression of sequenced data but not on the likelyhood of how accurate the data might be. Therefore, only formats that include sequenced reads will be worked with.\\
			
 
				 
			
 
				 % my criteria
			
 
				-Several people and groups have developed different file formats to store genomes. Unfortunaly, the only standard for storing genomic data is fairly new \autocite{isompeg, mpeg}. Therefore, formats and tools implementing this standard are mostly still in development. In order to not go beyond scope, this work will focus only on file formats that fulfill following criteria:\\
			
 
				+Several people and groups have developed different file formats to store genomes. Unfortunaly, the only standard for storing genomic data is fairly new \cite{isompeg, mpeg}. Therefore, formats and tools implementing this standard are mostly still in development. In order to not go beyond scope, this work will focus only on file formats that fulfill following criteria:\\
			
 
				 \begin{itemize}
			
 
				   \item{the format has reputation}
			
 
				 	\begin{itemize}
			
@@ -54,47 +54,47 @@ Several people and groups have developed different file formats to store genomes
 
				 		\item through a broad ussage of the format determined by its use on ftp servery that focus on supporting scientific research.
			
 
				 	\end{itemize}
			
 
				   \item{the format should not specialize on only one type of \acs{DNA}.}
			
 
				-  \item{the format stores nucleotide seuqences and does not neccesarily include \ac{IUPAC} codes besides A, C, G and T \autocite{iupac}.}
			
 
				+  \item{the format stores nucleotide seuqences and does not neccesarily include \ac{IUPAC} codes besides A, C, G and T \cite{iupac}.}
			
 
				   \item{the format is open source. Otherwise, improvements can not be tested, without buying the software and/or requesting permission to disassemble and reverse engineer the software or parts of it.}
			
 
				 \end{itemize}
			
 
				 
			
 
				-Information on available formats where gathered through various Internet platforms \autocite{ensembl, ucsc, ga4gh}. 
			
 
				+Information on available formats where gathered through various Internet platforms \cite{ensembl, ucsc, ga4gh}. 
			
 
				 Some common file formats found:
			
 
				 \begin{itemize}
			
 
				 % which is relevant? 
			
 
				   \item{\ac{FASTA}} 
			
 
				   \item{\ac{FASTq}} 
			
 
				-  \item{\ac{SAM}/\ac{BAM}} \autocite{bam, sam12}
			
 
				-  \item{\ac{CRAM}} \autocite{bam, sam12}
			
 
				-  \item{twoBit} \autocite{twobit}
			
 
				+  \item{\ac{SAM}/\ac{BAM}} \cite{bam, sam12}
			
 
				+  \item{\ac{CRAM}} \cite{bam, sam12}
			
 
				+  \item{twoBit} \cite{twobit}
			
 
				   %\item{VCF} genotype format -> anaylses differences of two seq
			
 
				 \end{itemize}
			
 
				 
			
 
				 % groups: sequence data, alignment data, haplotypic
			
 
				 % src: http://help.oncokdm.com/en/articles/1195700-what-is-a-bam-fastq-vcf-and-bed-file
			
 
				-Since methods to store this kind of Data are still in development, there are many more file formats. From the selection listed above, \acs{FASTA} and \acs{FASTq} seem to have established the reputation of a inoficial standard for sequenced reads \autocite{survey, geco, vertical, cram-origin}. \\
			
 
				+Since methods to store this kind of Data are still in development, there are many more file formats. From the selection listed above, \acs{FASTA} and \acs{FASTq} seem to have established the reputation of a inoficial standard for sequenced reads \cite{survey, geco, vertical, cram-origin}. \\
			
 
				 Considering the first criteria, by searching through anonymously accesable \acs{ftp} servers, only two formats are used commonly: FASTA or its extension \acs{FASTq} and the \acs{BAM} Format \acs{ftp-igsr, ftp-ncbi, ftp-ensembl}.
			
 
				 
			
 
				 
			
 
				 \subsection{\acs{FASTA} and \acs{FASTq}}
			
 
				-The rather simple \acs{FASTA} format consists of two repeated sections. The first section consists of one line and stores metadata about the sequenced genome and the file itself. This line, also called header, contains a comment section starting with \texttt{>} followed by a custom text \autocite{alok17, Cock_2009}. The comment section is usually used to store information about the sequenced genome and sometimes metadata about the file itself like its size in bytes.\\
			
 
				+The rather simple \acs{FASTA} format consists of two repeated sections. The first section consists of one line and stores metadata about the sequenced genome and the file itself. This line, also called header, contains a comment section starting with \texttt{>} followed by a custom text \cite{alok17, Cock_2009}. The comment section is usually used to store information about the sequenced genome and sometimes metadata about the file itself like its size in bytes.\\
			
 
				 The other section contains the sequenced genome whereas each nucleotide is represented by character \texttt{A, C, G or T}. There are three more nucleotide characters that store additional information and some characters for representing amino acids, but in order to not go beyond scope, only \texttt{A, C, G, and T} will be paid attention to.\\
			
 
				 The second section can take multiple lines and is determined by a empty line. After that the file end is reached or another touple of header and sequence can be found.\\
			
 
				 % fastq
			
 
				-In addition to its predecessor, \acs{FASTq} files contain a quality score. The file content consists of four sections, wherby no section is stored in more than one line. All four lines contain information about one sequence. The exact structure of \acs{FASTq} is formated in this order \autocite{illufastq}:
			
 
				+In addition to its predecessor, \acs{FASTq} files contain a quality score. The file content consists of four sections, wherby no section is stored in more than one line. All four lines contain information about one sequence. The exact structure of \acs{FASTq} is formated in this order \cite{illufastq}:
			
 
				 \begin{itemize}
			
 
				 	\item Line 1: Sequence identifier aka. Title, starting with an @ and an optional description.
			
 
				 	\item Line 2: The seuqence consisting of nucleoids, symbolized by A, T, G and C.
			
 
				 	\item Line 3: A '+' that functions as a seperator. Optionally followed by content of Line 1.
			
 
				 	\item Line 4: quality line(s). consisting of letters and special characters in the \acs{ASCII} scope.
			
 
				 \end{itemize}
			
 
				-The quality scores have no fixed format. To name a few, there is the sanger format, the solexa format introduced by Solexa Inc., the Illumina and the QUAL format which is generated by the PHRED software \autocite{Cock_2009}.\\
			
 
				+The quality scores have no fixed format. To name a few, there is the sanger format, the solexa format introduced by Solexa Inc., the Illumina and the QUAL format which is generated by the PHRED software \cite{Cock_2009}.\\
			
 
				 The quality value shows the estimated probability of error in the sequencing process.
			
 
				 
			
 
				-\label{r2:sam}
			
 
				+\label{k2:sam}
			
 
				 \subsection{Sequence Alignment Map}
			
 
				 % src https://github.com/samtools/samtools
			
 
				-\acs{SAM} often seen in its compressed, binary representation \acs{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs \autocite{bam}. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 \autocite{rfcansi}. The structure is more complex than the one in \acs{FASTq} and described best, accompanied by an example:
			
 
				+\acs{SAM} often seen in its compressed, binary representation \acs{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs \cite{bam}. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 \cite{rfcansi}. The structure is more complex than the one in \acs{FASTq} and described best, accompanied by an example:
			
 
				 
			
 
				 \begin{figure}[ht]
			
 
				   \centering
			
@@ -103,5 +103,5 @@ The quality value shows the estimated probability of error in the sequencing pro
 
				   \label{k2:bam-struct}
			
 
				 \end{figure}
			
 
				 
			
 
				-Compared to \acs{FASTA} \acs{SAM} and further compression forms, store more information. As displayed in \ref{k_2:bam-struct} this is done by adding, identifier for Reads e.g. \textbf{+r003}, aligning subsequences and writing additional symbols like dots e.g. \textbf{ATAGCT......} in the split alignment +r004 \autocite{bam}. A full description of the information stored in \acs{SAM} files would be of little value to this work, therefore further information on is left out but can be found in \autocite{sam12} or at \autocite{bam}.
			
 
				+Compared to \acs{FASTA} \acs{SAM} and further compression forms, store more information. As displayed in \ref{k_2:bam-struct} this is done by adding, identifier for Reads e.g. \textbf{+r003}, aligning subsequences and writing additional symbols like dots e.g. \textbf{ATAGCT......} in the split alignment +r004 \cite{bam}. A full description of the information stored in \acs{SAM} files would be of little value to this work, therefore further information on is left out but can be found in \cite{sam12} or at \cite{bam}.
			
 
				 Samtools provide the feature to convert a \acs{FASTA} file into \acs{SAM} format. Since there is no way to calulate mentioned, additional information from the information stored in \acs{FASTA}, the converted files only store two lines. The first one stores metadata about the file and the second stores the nucleotide sequence in just one line.
			
--- a/latex/tex/kapitel/k4_algorithms.tex
+++ b/latex/tex/kapitel/k4_algorithms.tex
@@ -32,8 +32,8 @@ Data contains information. In digital data  clear, physical limitations delimit
 
				 % excurs information vs data
			
 
				 The boundaries of information, when it comes to storing capabilities, can be illustrated by using the example mentioned above. A drive with the capacity of 1 Gigabyte could contain a book in form of images, where the content of each page is stored in a single image. Another, more resourceful way would be storing just the text of every page in \acs{UTF-16}. The information, the text would provide to a potential reader would not differ. Changing the text encoding to \acs{ASCII} and/or using compression techniques would reduce the required space even more, without loosing any information.\\
			
 
				 % excurs end
			
 
				-In contrast to lossless compression, lossy compression might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typically not necessary to persist the origin information. This works with certain audio and pictures formats, and in network protocols \autocite{cnet13}.
			
 
				-For \acs{DNA} a lossless compression is needed. To be precise a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Methods from both fields, that aquired reputation, are described in detail below \autocite{cc14, moffat20, moffat_arith, alok17}.\\
			
 
				+In contrast to lossless compression, lossy compression might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typically not necessary to persist the origin information. This works with certain audio and pictures formats, and in network protocols \cite{cnet13}.
			
 
				+For \acs{DNA} a lossless compression is needed. To be precise a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Methods from both fields, that aquired reputation, are described in detail below \cite{cc14, moffat20, moffat_arith, alok17}.\\
			
 
				 
			
 
				 \subsection{Dictionary coding}
			
 
				 \textbf{Disclaimer}
			
@@ -58,18 +58,18 @@ The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv cre
 
				 
			
 
				 
			
 
				 \subsection{Shannons Entropy}
			
 
				-The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \autocite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
			
 
				+The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \cite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
			
 
				 
			
 
				 % todo insert Fig. 1 shannon_1948
			
 
				 \begin{figure}[H]
			
 
				   \centering
			
 
				   \includegraphics[width=15cm]{k4/com-sys.png}
			
 
				-  \caption{Schematic diagram of a general communication system by Shannons definition. \autocite{Shannon_1948}}
			
 
				+  \caption{Schematic diagram of a general communication system by Shannons definition. \cite{Shannon_1948}}
			
 
				   \label{k4:comsys}
			
 
				 \end{figure}
			
 
				 
			
 
				 Altering \ref{k4:comsys} would show how this can be applied to other technology like compression. The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same phyiscal actor. 
			
 
				-Transmitter and receiver would be changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time \autocite{Shannon_1948}.\\
			
 
				+Transmitter and receiver would be changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time \cite{Shannon_1948}.\\
			
 
				 
			
 
				 Shannons Entropy provides a formular to determine the 'uncertainty of a probability distribution' in a finite field.
			
 
				 
			
@@ -88,7 +88,7 @@ Shannons Entropy provides a formular to determine the 'uncertainty of a probabil
 
				 %  \label{k4:entropy}
			
 
				 %\end{figure}
			
 
				 
			
 
				-He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite probability space. Then x in X are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{eq:info-in-bits} \autocite{delfs_knebl,Shannon_1948}:
			
 
				+He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite probability space. Then x in X are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{eq:info-in-bits} \cite{delfs_knebl,Shannon_1948}:
			
 
				 
			
 
				 \begin{equation}\label{eq:info-in-bits}
			
 
				  log_2(\frac{1}{prob(x)}) \equiv - log_2(prob(x)).
			
@@ -109,8 +109,8 @@ He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite prob
 
				 \label{k4:arith}
			
 
				 \subsection{Arithmetic coding}
			
 
				 This coding method is an approach to solve the problem of wasting memeory due to the overhead which is created by encoding certain lenghts of alphabets in binary. For example: Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective and thinking a step further: Less storage would be required, if there would be a possibility to encode more than one letter in two bit.\\
			
 
				-Dr. Jorma Rissanen described arithmetic coding in a publication in 1976 \autocite{ris76}. % Besides information theory and math, he also published stuff about dna
			
 
				-This works goal was to define an algorithm that requires no blocking. Meaning the input text could be encoded as one instead of splitting it and encoding the smaller texts or single symbols. He stated that the coding speed of arithmetic coding is comparable to that of conventional coding methods \autocite{ris76}.  
			
 
				+Dr. Jorma Rissanen described arithmetic coding in a publication in 1976 \cite{ris76}. % Besides information theory and math, he also published stuff about dna
			
 
				+This works goal was to define an algorithm that requires no blocking. Meaning the input text could be encoded as one instead of splitting it and encoding the smaller texts or single symbols. He stated that the coding speed of arithmetic coding is comparable to that of conventional coding methods \cite{ris76}.  
			
 
				 
			
 
				 % unusable because equation is only half correct
			
 
				 \mycomment{
			
@@ -147,13 +147,13 @@ To store as few informations as possible and due to the fact that fractions in b
 
				 For the decoding process to work, the \ac{EOF} symbol must be be present as the last symbol in the text. The compressed file will store the probabilies of each alphabet symbol as well as the floatingpoint number. The decoding process executes in a simmilar procedure as the encoding. The stored probabilies determine intervals. Those will get subdivided, by using the encoded floating point as guidance, until the \ac{EOF} symbol is found. By noting in which interval the floating point is found, for every new subdivision, and projecting the probabilies associated with the intervals onto the alphabet, the origin text can be read.\\
			
 
				 % rescaling
			
 
				 % math and computers
			
 
				-In computers, arithmetic operations on floating point numbers are processed with integer representations of given floating point number \autocite{ieee-float}. The number 0.4 + would be represented by $4\cdot 10^-1$.\\
			
 
				+In computers, arithmetic operations on floating point numbers are processed with integer representations of given floating point number \cite{ieee-float}. The number 0.4 + would be represented by $4\cdot 10^-1$.\\
			
 
				 Intervals for the first symbol would be represented by natural numbers between 0 and 100 and $... \cdot 10^-x$. \texttt{x} starts with the value 2 and grows as the intgers grow in length, meaning only if a uneven number is divided. For example: Dividing a uneven number like $5\cdot 10^-1$ by two, will result in $25\cdot 10^-2$. On the other hand, subdividing $4\cdot 10^y$ by two, with any negativ real number as y would not result in a greater \texttt{x} the length required to display the result will match the length required to display the input number.\\
			
 
				 % example
			
 
				 \begin{figure}[H]
			
 
				   \centering
			
 
				   \includegraphics[width=15cm]{k4/arith-resize.png}
			
 
				-  \caption{Illustrative rescaling in arithmetic coding process. \autocite{witten87}}
			
 
				+  \caption{Illustrative rescaling in arithmetic coding process. \cite{witten87}}
			
 
				   \label{k4:rescale}
			
 
				 \end{figure}
			
 
				 
			
@@ -164,9 +164,9 @@ The described coding is only feasible on machines with infinite percission. As s
 
				 \subsection{Huffman encoding}
			
 
				 % list of algos and the tools that use them
			
 
				 D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The Shannon-Fano coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
			
 
				-Even though his work was released in 1952, the method he developed is in use  today. Not only tools for genome compression but in compression tools with a more general ussage \autocite{rfcgzip}.\\ 
			
 
				-Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bits, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bits used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \autocite{huf52}. 
			
 
				-As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how the symbols will be encoded. As in arithmetic coding, the probability of a letter is calculated for given text. The binary tree will be constructed after following guidelines \autocite{alok17}:
			
 
				+Even though his work was released in 1952, the method he developed is in use  today. Not only tools for genome compression but in compression tools with a more general ussage \cite{rfcgzip}.\\ 
			
 
				+Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bits, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bits used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \cite{huf52}. 
			
 
				+As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how the symbols will be encoded. As in arithmetic coding, the probability of a letter is calculated for given text. The binary tree will be constructed after following guidelines \cite{alok17}:
			
 
				 % greedy algo?
			
 
				 \begin{itemize}
			
 
				   \item Every symbol of the alphabet is one leaf.
			
@@ -177,14 +177,14 @@ As with other codings, a set of symbols must be defined. For any text constructe
 
				 \end{itemize}
			
 
				 %todo tree building explanation
			
 
				 % storytime might need to be rearranged
			
 
				-A often mentioned difference between Shannon-Fano and Huffman coding, is that first is working top down while the latter is working bottom up. This means the tree starts with the lowest weights. The nodes that are not leafs have no value ascribed to them. They only need their weight, which is defined by the weights of their individual child nodes \autocite{moffat20, alok17}.\\
			
 
				+A often mentioned difference between Shannon-Fano and Huffman coding, is that first is working top down while the latter is working bottom up. This means the tree starts with the lowest weights. The nodes that are not leafs have no value ascribed to them. They only need their weight, which is defined by the weights of their individual child nodes \cite{moffat20, alok17}.\\
			
 
				 
			
 
				-Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \texttt{$W_{i}$} and codeword length as \texttt{$L_{i}$} for the node \texttt{$K_{i}$}. Then will \texttt{$L_{av}$} be the average length for \texttt{L} in a finite chain of symbols, with a distribution that is mapped onto \texttt{W} \autocite{huf}.
			
 
				+Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \texttt{$W_{i}$} and codeword length as \texttt{$L_{i}$} for the node \texttt{$K_{i}$}. Then will \texttt{$L_{av}$} be the average length for \texttt{L} in a finite chain of symbols, with a distribution that is mapped onto \texttt{W} \cite{huf}.
			
 
				 \begin{equation}\label{eq:huf}
			
 
				   L_{av}=\sum_{i=0}^{n-1}w_{i}\cdot l_{i}
			
 
				 \end{equation}
			
 
				 The equation \eqref{eq:huf} describes the path, to the desired state, for the tree. The upper bound \texttt{n} is assigned the length of the input text. The touple in any node \texttt{K} consists of a weight \texttt{$w_{i}$}, that also references a symbol, and the length of a codeword \texttt{$l_{i}$}. This codeword will later encode a single symbol from the alphabet. Working with digital codewords, an element in \texttt{l} contains a sequence of zeros and ones. Since there in this coding method, there is no fixed length for codewords, the premise of \texttt{prefix free code} must be adhered to. This means there can be no codeword that match the sequence of any prefix of another codeword. To illustrate this: 0, 10, 11 would be a set of valid codewords but adding a codeword like 01 or 00 would make the set invalid because of the prefix 0, which is already a single codeword.\\
			
 
				-With all important elements described: the sum that results from this equation is the average length a symbol in the encoded input text will require to be stored \autocite{huf52, moffat20}.
			
 
				+With all important elements described: the sum that results from this equation is the average length a symbol in the encoded input text will require to be stored \cite{huf52, moffat20}.
			
 
				 
			
 
				 % example
			
 
				 % todo illustrate
			
@@ -218,7 +218,7 @@ Since original versions of the files licensed by University of Aveiro could not
 
				 Following function calls in all three files led to the conclusion that the most important function is defined as \texttt{arithmetic\_encode} in \texttt{arith.c}. In this function the actual artihmetic encoding is executed. This function has no redirects to other files, only one function call \texttt{ENCODE\_RENORMALISE} the remaining code consists of arithmetic operations only.
			
 
				 % if there is a chance for improvement, this function should be consindered as a entry point to test improving changes.
			
 
				 
			
 
				-%useless? -> Both, \texttt{bitio.c} and \texttt{arith.c} are pretty simliar. They were developed by the same authors, execpt for Radford Neal who is only mentioned in \texttt{arith.c}, both are based on the work of A. Moffat \autocite{moffat_arith}.
			
 
				+%useless? -> Both, \texttt{bitio.c} and \texttt{arith.c} are pretty simliar. They were developed by the same authors, execpt for Radford Neal who is only mentioned in \texttt{arith.c}, both are based on the work of A. Moffat \cite{moffat_arith}.
			
 
				 %\subsection{genie} % genie
			
 
				 \subsection{Samtools} % samtools 
			
 
				 \subsubsection{BAM}
			
@@ -233,7 +233,7 @@ Data is split into blocks. Each block stores a header consisting of three bits.
 
				 	\item \texttt{01}		Compressed with a fixed set of Huffman codes.	
			
 
				 	\item \texttt{10}		Compressed with dynamic Huffman codes.
			
 
				 \end{itemize}
			
 
				-The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \autocite{rfc1951}.
			
 
				+The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.
			
 
				 % lz77 part
			
 
				 As described in \ref{k4:lz77} a compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.
			
 
				 The \acs{LZ77} algorithm is executed before the huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.
			
@@ -258,14 +258,14 @@ For a text consisting out of \texttt{C} and \texttt{G}, following codes would be
 
				 \rmfamily
			
 
				 
			
 
				 Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented with a 0. To maintain prefix-free codes, the two remaining codes are not allowed to start with a 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.
			
 
				-With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \autocite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference.\\
			
 
				+With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \cite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference.\\
			
 
				 
			
 
				 % example header, alphabet, data block?
			
 
				 BGZF extends this by creating a series of blocks. Each can not extend a limit of 64 Kilobyte. Each block contains a standard gzip file header, followed by compressed data.\\
			
 
				 
			
 
				 \subsubsection{CRAM}
			
 
				-The improvement of \acs{BAM} \autocite{cram-origin} called \acs{CRAM}, also features a block structure \autocite{bam}. The whole file can be seperated into four sections, stored in ascending order: File definition, a CRAM Header Container, multiple Data Container and a final CRAM EOF Container.\\
			
 
				-The File definition consists of 26 uncompressed bytes, storing formating information and a identifier. The CRAM header contains meta information about Data Containers and is optionally compressed with gzip. This container can also contain a uncompressed zero-padded section, reseved for \acs{SAM} header information \autocite{bam}. This saves time, in case the compressed file is altered and its compression need to be updated. The last container in a \acs{CRAM} file serves as a indicator that the \acs{EOF} is reached. Since in addition information about the file and its structure is stored, a maximum of 38 uncompressed bytes can be reached.\\
			
 
				+The improvement of \acs{BAM} \cite{cram-origin} called \acs{CRAM}, also features a block structure \cite{bam}. The whole file can be seperated into four sections, stored in ascending order: File definition, a CRAM Header Container, multiple Data Container and a final CRAM EOF Container.\\
			
 
				+The File definition consists of 26 uncompressed bytes, storing formating information and a identifier. The CRAM header contains meta information about Data Containers and is optionally compressed with gzip. This container can also contain a uncompressed zero-padded section, reseved for \acs{SAM} header information \cite{bam}. This saves time, in case the compressed file is altered and its compression need to be updated. The last container in a \acs{CRAM} file serves as a indicator that the \acs{EOF} is reached. Since in addition information about the file and its structure is stored, a maximum of 38 uncompressed bytes can be reached.\\
			
 
				 A Data Container can be split into three sections. From this sections the one storing the actual sequence consists of blocks itself, displayed in \ref FIGURE as the bottom row.
			
 
				 \begin{itemize}
			
 
				 	\item Container Header.
			
--- a/latex/tex/kapitel/k5_feasability.tex
+++ b/latex/tex/kapitel/k5_feasability.tex
@@ -169,7 +169,7 @@ Following criteria is reqiured for test data to be appropriate:
 
				   \item{The file is publicly available and free to use (for research).}
			
 
				 \end{itemize}
			
 
				 A second, bigger set of testfiles were required. This would verify the test results are not limited to small files. The size of 1 Gigabyte per file, would hold over five times as much data as the first set.
			
 
				-Since there are multiple open \ac{FTP} servers which distribute a variety of files, finding a suitable first set is rather easy. The ensembl database featured defined criteria, so the first available set called Homo\_sapiens.GRCh38.dna.chromosome were chosen \autocite{ftp-ensembl}. This sample includes over 20 chromosomes, whereby considering the filenames, one chromosome is contained in each single file. After retrieving and unpacking the files, write privileges on them was withdrawn. So no tool could alter any file contents.
			
 
				+Since there are multiple open \ac{FTP} servers which distribute a variety of files, finding a suitable first set is rather easy. The ensembl database featured defined criteria, so the first available set called Homo\_sapiens.GRCh38.dna.chromosome were chosen \cite{ftp-ensembl}. This sample includes over 20 chromosomes, whereby considering the filenames, one chromosome is contained in each single file. After retrieving and unpacking the files, write privileges on them was withdrawn. So no tool could alter any file contents.
			
 
				 Finding a second, bigger set happened to be more complicated. \acs{FTP} offers no fast, reliable way to sort files according to their size, regardless of their position. Since available servers \acs{ftp-ensembl, ftp-ncbi, ftp-isgr} offer several thousand files, stored in variating, deep directory structures, mapping filesize, filetype and file path takes too much time and resources to be done in this work. This problematic combined with a easily triggered overflow in the samtools library, resulted in a set of several, manualy searched and tested files which lacks in quantity. The variety of different species in this set \acs{DNA} provide a additional, interesting factor.\\
			
 
				  
			
 
				 % todo make sure this needs to stay.
			
--- a/latex/tex/kapitel/k6_results.tex
+++ b/latex/tex/kapitel/k6_results.tex
@@ -1,13 +1,22 @@
 
				 \chapter{Results and Discussion}
			
 
				-The two tables \ref{t:effectivity}, \ref{t:efficiency} contain raw measurement values for the two goals, described in \ref{k5:goals}. The first table visualizes how long each compression procedure took, in milliseconds. The second one contains file sizes in bytes. Each row contains information about one of the \texttt{Homo\_sapiens.GRCh38.dna.chromosome.}x\texttt{.fa} files. To improve readability, the filename in all tables were replaced by \texttt{File}. To determine which file was compressed, simply replace the placeholder with the number following \texttt{File}.\\
			
 
				+The two tables \ref{t:effectivity}, \ref{t:efficiency} contain raw measurement values for the two goals, described in \ref{k5:goals}. The first table visualizes how long each compression procedure took, in milliseconds. The second one contains file sizes in bytes. Each row contains information about one of the files following this naming scheme:
			
 
				+
			
 
				+\texttt{Homo\_sapiens.GRCh38.dna.chromosome.}x\texttt{.fa}
			
 
				+
			
 
				+To improve readability, the filename in all tables were replaced by \texttt{File}. To determine which file was compressed, simply replace the placeholder with the number following \texttt{File}.\\
			
 
				 
			
 
				 \section{Interpretation of Results}
			
 
				-The units milliseconds and bytes store a high persicion for measurements. Unfortunately they are harder to read and compare to the human eye. Therefore, starting with comparing sizes, \ref{t:sizepercent} contains each file size in percentage, in relation to the respective source file. The compression with \acs{GeCo} with the file Homo\_sapiens.GRCh38.dna.chromosome.11.fa resulted in a file that were only 17.6\% as big.\\
			
 
				+The units milliseconds and bytes store a high precision. Unfortunately they are harder to read and compare, solely by the readers eyes. Therefore the data was altered. Sizes in \ref{t:sizepercent} are displayed in percentage, in relation to the respective source file. Meaning the compression with \acs{GeCo} on:
			
 
				+
			
 
				+Homo\_sapiens.GRCh38.dna.chromosome.11.fa 
			
 
				 
			
 
				+resulted in a compressed file which were only 17.6\% as big.
			
 
				+Runtimes in \ref{t:time} were converted into seconds and have been rounded to two decimal places.
			
 
				+Also a line was added to the bottom of each table, showing the average percentage or runtime for each process.\\
			
 
				 \label{t:sizepercent}
			
 
				 \sffamily
			
 
				 \begin{footnotesize}
			
 
				-  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				+  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				     \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
			
 
				         {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
			
 
				         \\
			
@@ -40,8 +49,8 @@ The units milliseconds and bytes store a high persicion for measurements. Unfort
 
				     \bottomrule
			
 
				   \end{longtable}
			
 
				 \end{footnotesize}
			
 
				-
			
 
				 \rmfamily
			
 
				+
			
 
				 Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} methode improved this by rughly 2.5\%. \acs{GeCo} provided the greatest reduction with 78.53\%. This gap of about 4\% comes with a comparatively great sacrifice in time.\\
			
 
				 
			
 
				 \label{t:time}
			
@@ -84,13 +93,68 @@ Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} m
 
				 
			
 
				 As \ref{t:time} is showing, the average compression duration for \acs{GeCo} is at 42.57s. That is a little over 33s, or 78\% longer than the average runtime of samtools for compressing into the \acs{CRAM} format.\\
			
 
				 Since \acs{CRAM} requires a file in \acs{BAM} format, the third row is calculated by adding the time needed to compress into \acs{BAM} with the time needed to compress into \acs{CRAM}. 
			
 
				-While \acs{SAM} format is required for compressing a \acs{FASTA} into \acs{BAM} and further into \acs{CRAM}, in itself it does not features no compression. However, the conversion from \acs{SAM} to \acs{FASTA} can result in a decrease in size. At first this might be contra intuitive, since as described in \ref{k2:sam} \acs{SAM} stores more information than \acs{FASTA}. This can be explained by comparing the sequence storing mechanism. A \acs{FASTA} sequence section can be spread over multiple lines whereas \acs{SAM} files store a sequence in just one line, converting can result in a \acs{SAM} file that is smaller than the original \acs{FASTA} file.
			
 
				+While \acs{SAM} format is required for compressing a \acs{FASTA} into \acs{BAM} and further into \acs{CRAM}, in itself it does not features no compression. However, the conversion from \acs{SAM} to \acs{FASTA} can result in a decrease in size. At first this might be contra intuitive since, as described in \ref{k2:sam} \acs{SAM} stores more information than \acs{FASTA}. This can be explained by comparing the sequence storing mechanism. A \acs{FASTA} sequence section can be spread over multiple lines whereas \acs{SAM} files store a sequence in just one line, converting can result in a \acs{SAM} file that is smaller than the original \acs{FASTA} file.
			
 
				 % (hi)storytime
			
 
				 Before interpreting this data further, a quick view into development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing.\\
			
 
				-% interpret bit files and compare
			
 
				+% todo interpret bit files and compare
			
 
				+
			
 
				+% big tables
			
 
				+Reviewing \ref{t:recal-time} one will notice, that \acs{GeCo} reached a runtime over 60 seconds on every run. Instead of displaying the runtime solely in seconds, a leading number followed by an m indicates how many minutes each run took.
			
 
				+
			
 
				+\label{t:recal-size}
			
 
				+\sffamily
			
 
				+\begin{footnotesize}
			
 
				+  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				+    \caption[Compression Effectivity for greater files]                       % Caption für das Tabellenverzeichnis
			
 
				+        {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
			
 
				+        \\
			
 
				+    \toprule
			
 
				+     \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\
			
 
				+    \midrule
			
 
				+			%geco bam and cram in percent
			
 
				+			File 1& 1.00& 6.28& 5.38\\
			
 
				+			File 2& 0.98& 6.41& 5.52\\
			
 
				+			File 3& 1.21& 8.09& 7.17\\
			
 
				+			File 4& 1.20& 7.70& 6.85\\
			
 
				+			File 5& 1.08& 7.58& 6.72\\
			
 
				+			File 6& 1.09& 7.85& 6.93\\
			
 
				+			File 7& 0.96& 5.83& 4.63\\
			
 
				+      &&&\\
			
 
				+			\textbf{Total}	1.07& 7.11& 6.17\\
			
 
				+    \bottomrule
			
 
				+  \end{longtable}
			
 
				+\end{footnotesize}
			
 
				+\rmfamily
			
 
				+
			
 
				+\label{t:recal-time}
			
 
				+\sffamily
			
 
				+\begin{footnotesize}
			
 
				+  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				+    \caption[Compression Effectivity for greater files]                       % Caption für das Tabellenverzeichnis
			
 
				+        {Compression duration in seconds} % Caption für die Tabelle selbst
			
 
				+        \\
			
 
				+    \toprule
			
 
				+     \textbf{ID.} & \textbf{\acs{GeCo} } & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM} } \\
			
 
				+    \midrule
			
 
				+			%compress time for geco, bam and cram in seconds
			
 
				+			File 1 & 1m58.427& 16.248& 23.016\\
			
 
				+			File 2 & 1m57.905& 15.770& 22.892\\
			
 
				+			File 3 & 1m09.725& 07.732& 12.858\\
			
 
				+			File 4 & 1m13.694& 08.291& 13.649\\
			
 
				+			File 5 & 1m51.001& 14.754& 23.713\\
			
 
				+			File 6 & 1m51.315& 15.142& 24.358\\
			
 
				+			File 7 & 2m02.065& 16.379& 23.484\\
			
 
				+      &&&\\
			
 
				+			\textbf{Total}	 & 1m43.447& 13.474& 20.567\\
			
 
				+    \bottomrule
			
 
				+  \end{longtable}
			
 
				+\end{footnotesize}
			
 
				+\rmfamily
			
 
				+
			
 
				+In both tables \ref{t:recal-time} and \ref{t:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{t:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~1.3 Gigabyte). Closely folled by file one and two (\~1.2 Gigabyte). 
			
 
				+% todo greater filesize means better compression
			
 
				 
			
 
				 \section{View on Possible Improvements}
			
 
				-% todo explain new findings
			
 
				 S. Petukhov described new findings about the distribution of nucleotides. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours is revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} can be determined:\\
			
 
				 %\%C ≈ Σ\%CN ≈ Σ\%NС ≈ Σ\%CNN ≈ Σ\%NCN ≈ Σ\%NNC ≈ Σ\%CNNN ≈ Σ\%NCNN ≈ Σ\%NNCN ≈ Σ\%NNNC\\