3 tahun lalu · 9cecb1ede9
--- a/latex/result/thesis.pdf
+++ b/latex/result/thesis.pdf
--- a/latex/tex/kapitel/abkuerzungen.tex
+++ b/latex/tex/kapitel/abkuerzungen.tex
@@ -13,7 +13,8 @@
 
				   \acro{EOF}{End of File}
			
 
				   \acro{GA4GH}{Global Alliance for Genomics and Health}
			
 
				   \acro{IUPAC}{International Union of Pure and Applied Chemistry}
			
 
				-  \acro{LZ77}{Lempel Ziv 77}
			
 
				+  \acro{LZ77}{Lempel Ziv 1977}
			
 
				+  \acro{LZ78}{Lempel Ziv 1978}
			
 
				   \acro{SAM}{Sequence Alignment Map}
			
 
				   \acro{BAM}{Binary Alignment Map}
			
 
				 \end{acronym}
			
--- a/latex/tex/kapitel/k1_introduction.tex
+++ b/latex/tex/kapitel/k1_introduction.tex
@@ -1,9 +1,31 @@
 
				 \chapter{Introduction}
			
 
				 % general information and intro
			
 
				-Understanding how things in our cosmos work, was and still is a pleasure the human being always wants to fullfill. Getting insights into the rawest form of organic live is possible through storing and studying information, embeded in genecode. Since live is complex, there is a lot information, which requires a lot of memory.\\
			
 
				-% [...] Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
			
 
				+Understanding how things in our cosmos work, was and still is a pleasure the human being always wants to fullfill. Getting insights into the rawest form of organic live is possible through storing and studying information, embeded in genetic codes. Since live is complex, there is a lot of information, which requires a lot of memory.\\
			
 
				+% ...Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
			
 
				 % compression values and goals
			
 
				-With compression tools this problem is reduced. Compressed data requires less space and less time to be tranported over networks. This advantage is scaleable and since there is much to discover about genomes, new findings in this field are nothing unusuall. From some of this findings, new tools can be developed which optimally increase two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data.\\
			
 
				-% [...]
			
 
				+With compression tools, the problem of storing information got restricted. Compressed data requires less space and therefore less time to be tranported over networks. This advantage is scaleable and since genetic information needs a lot of storage, even in a compressed state, improvements are welcomed. Since this field is, compared to others, like computer theorie and compression approchaes, relatively new, there is much to discover and new findings are not unusual. From some of this findings, new tools can be developed. They optimally increase two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data.\\
			
 
				+% ...
			
 
				 % more exact explanation
			
 
				-New discoveries in the universal rules of stochastical organisation of genomes might provide a base for new algoriths and therefore new tools for genome compression. The aim of this work is to analyze the current state of the art for probabilistic compression tools and their algorithms, and ultimately determine whether mentioned discoveries are already used. If this is not the case, there will be an analysation of how this new approach could improve compression methods.\\
			
 
				+
			
 
				+% actual focus in short and simple terms
			
 
				+New discoveries in the universal rules of stochastical organisation of genomes might provide a base for new algoriths and therefore new tools or an improvement of existing ones for genome compression. The aim of this work is to analyze the current state of the art for probabilistic compression tools and their algorithms, and ultimately determine whether mentioned discoveries are already used. \texttt{might be thrown out due to time limitations} -> If this is not the case, there will be an analysation of how and where this new approach could be imprelented and if it would improve compression methods.\\
			
 
				+
			
 
				+% focus and structure of work in greater detail 
			
 
				+To reach a common ground, the first pages will give the reader a quick overview on the structure of human DNA. There will also be an superficial explanation for some basic terms, used in biology and computer science. The knowledge basis of this work is formed by describing differences in file formats, used to store genome data. In addition to this, a section relevant for compression will follow. This will go through the state of the art in coding theory.\\
			
 
				+In order to meassure an improvement, first a baseline must be set. Therefore the efficiency and effecitity of suiteable state of the art tools will be meassured. To be as precise as possible, the main part of this work focuses on setting up an environment, picking input data, installing and executing tools and finaly meassuring and documenting the results.\\
			
 
				+With this information, a static code analysis of mentioned tools follows. This will show where a possible new algorithm or an improvement to an existing one could be implemented. Running a proof of concept implementation under the same conditions and comparing runtime and compression ratio to the defined baseline shows the potential of the new approach for compression with probability algorithms.
			
 
				+
			
 
				+% todo: 
			
 
				+%   explain: coding 
			
 
				+%   find uniform representation for: {letter;symbol;char} {dna;genome;sequence}
			
 
				+
			
 
				+%- COMPRESSION APPROACHES
			
 
				+% - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
			
 
				+% - HUFFMAN ENCODING
			
 
				+% - PROBABILITY APPROACHES (WITH BASE?)
			
 
				+%
			
 
				+%- COMPARING TOOLS
			
 
				+% 
			
 
				+%- POSSIBLE IMPROVEMENT
			
 
				+% - \acs{DNA}S STOCHASTICAL ATTRIBUTES 
			
 
				+% - IMPACT ON COMPRESSION
			
--- a/latex/tex/kapitel/k2_dna_structure.tex
+++ b/latex/tex/kapitel/k2_dna_structure.tex
@@ -17,14 +17,12 @@
 
				 %- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
			
 
				 %- IMPACT ON COMPRESSION
			
 
				 
			
 
				-\chapter{Structure Of Biological Data}
			
 
				+\chapter{Structure of Human Genetic Data}
			
 
				 To strengthen the understanding of how and where biological information is stored, this section starts with a quick and general rundown on the structure of any living organism.\\
			
 
				 % todo add picture
			
 
				 All living organisms, like plants and animals, are made of cells (a human body can consist out of several trillion cells) \cite{cells}.
			
 
				-A cell in itself is a living organism; The smalles one possible. It has two layers from which the inner one is called nucleus. The nucleus contains chromosomes and those chromosomes hold the genetic information in form of \ac{DNA}. 
			
 
				-% nucelosome and histone?
			
 
				+A cell in itself is a living organism; The smalles one possible. It consists out of two layers from which the inner one is called nucleus. The nucleus contains chromosomes and those chromosomes hold the genetic information in form of \ac{DNA}. 
			
 
				  
			
 
				-\section{DNA}
			
 
				 \ac{DNA} is often seen in the form of a double helix. A double helix consists, as the name suggestes, of two single helix. 
			
 
				 
			
 
				 \begin{figure}[ht]
			
@@ -34,9 +32,9 @@ A cell in itself is a living organism; The smalles one possible. It has two laye
 
				   \label{k2:dna-struct}
			
 
				 \end{figure}
			
 
				 
			
 
				-Each of them consists of two main components: the Suggar Phosphat backbone, which is irelavant for this work and the Bases. The arrangement of Bases represents the Information stored in the \ac{DNA}. A base is an organic molecule, they are called Nucleotides \cite{dna_structure}. %Nucleotides have special attributes and influence other Nucleotides in the \acs{DNA} Sequence
			
 
				+Each of them consists of two main components: the Suggar Phosphat backbone, which is not relevant for this work and the Bases. The arrangement of Bases represents the Information stored in the \ac{DNA}. A base is an organic molecule, they are called Nucleotides \cite{dna_structure}. \\
			
 
				 % describe Genomes?
			
 
				 
			
 
				-\section{Nucleotides}
			
 
				-For this work, nucleotides are the most important parts of the \acs{DNA}. A Nucleotide can have one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a Counterpart with which a bond can be established: adenine can bond with thymine, guanine can bond with cytosine. For someone who whishes to persist this information, it means the content of one helix can be determined by ``inverting'' the other one, in other words: the nucleotides of only one (entire) helix needs to be stored physically, to save the information of the whole \ac{DNA}. The counterpart for e.g.: \texttt{adenine, guanine, adenine} chain would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initial: \texttt{AGA} in one Helix, \texttt{TCT} in the other.
			
 
				-
			
 
				+For this work, nucleotides are the most important parts of the \ac{DNA}. A Nucleotide can occour in one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a Counterpart with which a bond can be established: adenine can bond with thymine, guanine can bond with cytosine.\\
			
 
				+From the perspective of an computer scientist: The content of one helix must be stored, to persist the full information. In more practical terms: The nucleotides of only one (entire) helix needs to be stored physically, to save the information of the whole \ac{DNA} because the other half can be determined by ``inverting'' the stored one. An example would show the counterpart for e.g.: \texttt{adenine, guanine, adenine} chain which would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initiat. So the example would change to \texttt{AGA} in one Helix, \texttt{TCT} in the other.\\
			
 
				+This representation ist commonly used to store \ac{DNA} digitally. Depending on the sequencing procedure and other factors, more information is stored and therefore more characters are required but for now 'A', 'C', 'G' and 'T' should be the only concern.
			
--- a/latex/tex/kapitel/k3_datatypes.tex
+++ b/latex/tex/kapitel/k3_datatypes.tex
@@ -23,37 +23,48 @@
 
				 % where are limits (e.g. BAM)
			
 
				 % what is our focus (and maybe 'why')
			
 
				 
			
 
				-\chapter{Datatypes}
			
 
				-\label{chap:datatypes}
			
 
				-As described in previous chapters \ac{DNA} can be represented by a String with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
			
 
				-Storing a single \textit{A} with \ac{ascii} encoding requires 8 bit (\,excluding magic bytes and the bytes used to mark \ac{EOF})\, since there are at least $2^8$ or 128 displayable symbols. Since the \ac{DNA} buildingblocks only require a minimum of four letters, two bits are needed e.g.: \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. Depending on the sequencing method, more than four letters are used. The complex process of sequencing \ac{DNA} is not 100\% preceice, so additional Letters are used to mark nucelotides that could not or could only partly get determined.\\
			
 
				+\chapter{File Types Used to Store DNA}
			
 
				+\label{chap:filetypes}
			
 
				+As described in previous chapters \ac{DNA} can be represented by a string with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
			
 
				+The \ac{ascii} table is a characterset, registered in 1975 and to this day still in use to encode texts digitally. For the purpose of communication bigger charactersets replaced \ac{ascii}. It is still used in situations where storage is short.
			
 
				+% todo grund dass ascii abgelöst wurde -> zu wenig darstellungsmöglichkeiten. Pro heute -> weniger overhead pro character
			
 
				+Storing a single \textit{A} with \ac{ascii} encoding, requires 8 bit (\,excluding magic bytes and the bytes used to mark \ac{EOF})\ . Since there are at least $2^8$ or 128 displayable symbols. The buildingblocks of \ac{DNA} require a minimum of four letters, so two bits are needed
			
 
				+% cout out examples. Might be needed later or elsewhere
			
 
				+% \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. 
			
 
				+In most tools, more than four symbols are used. This is due to the complexity in sequencing \ac{DNA}. It is not 100\% preceice, so additional symbols are used to mark nucelotides that could not or could only partly get determined. Further a so called quality score is used to indicate the certainty, for each single nucleotide, that is was sequenced correctly as what was stored.\\
			
 
				 More common everyday-usage text encodings like unicode require 16 bits per letter. So settling with \ac{ascii} has improvement capabilitie but is, on the other side, more efficient than using bulkier alternatives like unicode.\\
			
 
				 
			
 
				-Several people and groups have developed different fileformats to store genomes. Unfortunally for this work, there is no defined standard filetype or set of filetypes therefor one has to gather information on which types exist and how they function by themself. In order to not go beyond scope, this work will focus only on fileformats that fullfill following factors:\\
			
 
				+Several people and groups have developed different fileformats to store genomes. Unfortunally for this work, there is no defined standard filetype or set of filetypes, therefore one has to gather information by themselve. In order to not go beyond scope, this work will focus only on fileformats that fullfill following criteria:\\
			
 
				 \begin{itemize}
			
 
				   \item{the format has reputation, either through a scientific paper, that prooved its superiority to other relevant tools or through a broad ussage of the format.}
			
 
				-  \item{the format does not include \ac{IUPAC} codes besides A, C, G and T \autocite{iupac}.}
			
 
				-  \item{the format is open source.}
			
 
				+  \item{the format should no specialize on only one type of \ac{DNA}.}
			
 
				+  \item{the format mainly stores nucleotide seuqences and does not neccesarily include \ac{IUPAC} codes besides A, C, G and T \autocite{iupac}.}
			
 
				+  \item{the format is open source. Otherwise improvements can not be tested, without buying the software and/or requesting permission to disassemble and reverse engineer the software or parts of it.}
			
 
				   \item{the compression methode used in the format is based on probabilities.}
			
 
				 \end{itemize}
			
 
				 
			
 
				-Some common fileformats would be:
			
 
				+Information on available formats where gathered through various Internet platforms \autocite{ensembl, ucsc, ga4gh}. 
			
 
				+Some common fileformats found:
			
 
				 \begin{itemize}
			
 
				 % which is relevant? 
			
 
				-  \item{FASTA}
			
 
				-  \item{\ac{FASTQ}}
			
 
				-  \item{twoBit}
			
 
				-  \item{SAM/BAM}
			
 
				-  \item{VCF}
			
 
				-  \item{BED}
			
 
				+  \item{BED}  % \autocite{bed}
			
 
				+  \item{CRAM} % \autocite{cram}
			
 
				+  \item{FASTA} % \autocite{}
			
 
				+  \item{\ac{FASTQ}} % \autocite{}
			
 
				+  \item{GFF} % \autocite{}
			
 
				+  \item{SAM/\ac{BAM}} % \autocite{}
			
 
				+  \item{twoBit}% \autocite{}
			
 
				+  \item{VCF}% \autocite{}
			
 
				+
			
 
				 \end{itemize}
			
 
				 % src: http://help.oncokdm.com/en/articles/1195700-what-is-a-bam-fastq-vcf-and-bed-file
			
 
				-
			
 
				-Since methods to store this kind of Data are still in development, there are many more filetypes. The few, mentioned above are used by different organisations and researchers and are backed by a scientific publication. % todo find sources to both points in last sentence
			
 
				-%rewrite:
			
 
				-In order to not go beyond the scope, this paper will only focuse on compression tools which are using standard formats.
			
 
				+Since methods to store this kind of Data are still in development, there are many more filetypes. The few, mentioned above are used by different organisations and 
			
 
				+%todo calc percentage 
			
 
				+are backed by scientific papers.\\
			
 
				+Considering the first criteria, by searching through anonymously accesable \ac{ftp} servers, only two formats are used commonly: FASTA or its extension \ac{FASTQ} and the \ac{BAM} Format. %todo <- add ftp servers to cite
			
 
				 
			
 
				 \section{\ac{FASTQ}}
			
 
				+% todo add some fasta knowledge
			
 
				 Is a text base format for storing sequenced data. It saves nucleotides as letters and in extend to that, the quality values are saved.
			
 
				 \ac{FASTQ} files are split into multiples of four, each four lines contain the informations for one sequence. The exact structure of \ac{FASTQ} format is as follows:
			
 
				 \texttt{
			
@@ -76,33 +87,3 @@ The quality value shows the estimated probability of error in the sequencing pro
 
				   \caption{SAM/BAM file structure example}
			
 
				   \label{k_datatypes:bam-struct}
			
 
				 \end{figure}
			
 
				-
			
 
				-\subsection{Valid Symbols}
			
 
				-%- char set restrictions: \ac{ascii} in range ! to ~ apart from '@'
			
 
				-%- ... as regex (todo: describe in words): 
			
 
				-% This RegEx -> \[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
			
 
				-The regulare expression, shown above, filters touple of characters from a to z in lower and uppercase, numbers and several special characters. There are minor differences between the special chars first and second element of the touple, in the latter '*' and '=' are allowed which are not allowed in the first.
			
 
				-
			
 
				-%\subsection{Index File}
			
 
				- 
			
 
				-% notes
			
 
				-%- may contian metadata: magic string, ref seq, bins, offsets, unmapped reads
			
 
				-%- allows viewing BAM data (localy and remote via ftp/http)
			
 
				-%- file extention: <filename>.bam.bai
			
 
				-
			
 
				-%- stores more data than \ac{FASTQ}
			
 
				- 
			
 
				-% src: https://support.illumina.com/help/BS_App_RNASeq_Alignment_OLH_1000000006112/Content/Source/Informatics/BAM-Format.htm
			
 
				-%- allignment section includes
			
 
				-% - RG Read Group
			
 
				-% - BC Barcode Tag
			
 
				-% - SM Single-end alignment quality
			
 
				-% - AS Paired-end alignment quality
			
 
				-% - NM Edit distance tag
			
 
				-% - XN amplicon name tag
			
 
				-
			
 
				-%- BAM index files nameschema: <filename>.bam.bai 
			
 
				-
			
 
				-\section{Compressed Reference-oriented Ailgnment Map}
			
 
				-\ac{CRAM} was developed as an alternative to the \ac{SAM} and \ac{BAM} Format. It specification is maintained by \ac{GA4GH}. It features both lossy and lossless compression mode. Since it is not relevant to this work, the lossy compression is ignored from here on. Even though it is part of \ac{GA4GH} suite, the file format can be used independently.\\
			
 
				-The format saves data in containers which consist out of slices. Each slice is represented by a line in the file. Container and slices each store metadata in a header. Data is stored as blocks in slices, in a compressed form.
			
--- a/latex/tex/kapitel/k4_algorithms.tex
+++ b/latex/tex/kapitel/k4_algorithms.tex
@@ -16,38 +16,35 @@
 
				 %# POSSIBLE IMPROVEMENT
			
 
				 %- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
			
 
				 %- IMPACT ON COMPRESSION
			
 
				+\newcommand{\mycomment}[1]{}
			
 
				+
			
 
				+% entropie fim doku grundlagen2 
			
 
				+% dna nucleotide zu einem kapitel -> structure of dna. auch kapitel wegstreichen (zu generisch)
			
 
				+% file structure <-> datatypes. länger beschreiben: e.g. File formats to store dna
			
 
				+% 3.2.1 raus
			
 
				 
			
 
				 \chapter{Compression aproaches}
			
 
				 The process of compressing data serves the goal to generate an output that is smaller than its input data. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or network protocols that are used to transmit video/audio streams live.
			
 
				-For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete.\\
			
 
				-
			
 
				-% begin with entropy encoding/shannons source coding theorem
			
 
				-\section{Shannons Entropy}
			
 
				-The great mathematician, electrical engineer and cryptographer Claude Elwood Shannon developed information entropy and published it in 1948 \autocite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
			
 
				-
			
 
				-% todo insert Fig. 1 shannon_1948
			
 
				-
			
 
				-Altering this figure shows how it can be used for other technology like compression.\\
			
 
				-The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same phyiscal actor. 
			
 
				-transmitter and receiver are changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time.
			
 
				+For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Both are described in detail below.\\
			
 
				 
			
 
				-Shannons Entropy provides a formular to determine the '(un)certainty of a probability distribution'. This is used today to find the maximum amount of bits needed to store information. 
			
 
				-% alphabet, chain of symbols, kurz entropy erklären
			
 
				+\section{Dictionary coding}
			
 
				+Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand occurences of strings. Strings are a chain of characters representing a full word or just a part of it. This is explained shortly for a better understanding of dictionary coding but is of no great relevance to the focus of this work:
			
 
				+% exkurs
			
 
				+Looking at the string 'stationary' it might be smart to store 'station' and 'ary' as seperate dictionary enties. Which way is more efficient depents on the text that should get compressed. 
			
 
				+% end exkurs
			
 
				+The dictionary should only store strings that occour in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is made out of the input data. Each first occourence is left uncompressed. Every occurence of a string, after the first one, points to its first occurence. Since this 'pointer' needs less space than the string it points to, a decrease in the size is created.\\
			
 
				 
			
 
				-\section{Arithmetic coding}
			
 
				-Arithmetic coding is an approach to solve the problem of waste of memory, due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if there would be a possibility to encode two letters in the alphabet with one bit and the other one with a two byte combination. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted either as the letter it should represent or as two one bit letters.
			
 
				-% check this wording 'simulating' with sources 
			
 
				-% this is called subdividing
			
 
				-Arithmetic coding works by simulating a n-letter binary encoding for a n-letter alphabet. This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point number in the space between 0.0 and 1.0 (exclusively), which is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). To encode a sequence of characters, the intervall start of the character is noted, its intervall is split into smaller intervalls with the ratios of the initial intervalls between 0.0 and 1.0. With this, the second character is choosen. This process is repeated for until a intervall for the last character is choosen.\\
			
 
				-To encode in binary, the binary floating point representation of a number inside the intervall, for the last character is calculated, by using a similar process, described above, called subdividing.
			
 
				-% its finite subdividing because processors bottleneck floatingpoints 
			
 
				-
			
 
				-\subsection{\ac{CABAC}}
			
 
				-% a form of entropy coding
			
 
				-% https://en.wikipedia.org/wiki/Context-adaptive_binary_arithmetic_coding
			
 
				+% unuseable due to the lack of probability
			
 
				+\mycomment{
			
 
				+% - known algo
			
 
				+\subsection{The LZ Family}
			
 
				+The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv created multiple algorithms that are based on dictionary coding. They can be recognized by the substring \texttt{LZ} in its name, like \texttt{LZ77 and LZ78} which are short for Lempel Ziv 1977 and 1978. The number at the end indictates when the algorithm was published. Today LZ78 is widely used in unix compression solutions like gzip and bz2. Those tools are also used in compressing \ac{DNA}.\\
			
 
				+\ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
			
 
				+% example 
			
 
				+}
			
 
				 
			
 
				-\section{Huffman encoding}
			
 
				-% list of algos and the tools that use them
			
 
				+\section{Entropy coding}
			
 
				+\subsection{Huffman coding}
			
 
				 The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
			
 
				 % binary view for alphabet
			
 
				 % length n of sequence to compromize
			
@@ -61,9 +58,16 @@ The well known Huffman coding, is used in several Tools for genome compression.
 
				 The process of compressing starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
			
 
				 Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.
			
 
				 
			
 
				-% (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
			
 
				-\subsection{\ac{LZ77}}
			
 
				-\ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.
			
 
				+\section{Implementations}
			
 
				+\subsection{} % geco
			
 
				+\subsection{} % genie
			
 
				+\subsection{} % samtools 
			
 
				+
			
 
				+\mycomment{
			
 
				+\subsection{\ac{CABAC}}
			
 
				+% a form of entropy coding
			
 
				+% https://en.wikipedia.org/wiki/Context-adaptive_binary_arithmetic_coding
			
 
				+
			
 
				 
			
 
				 \section{Implementations}
			
 
				 % SAM - LZ4 src: https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md
			
@@ -76,3 +80,4 @@ Following one path results in the binary representation for one symbol. For an a
 
				 The DEFLATE compression algorithm combines \ac{LZ77} and Huffman coding. To get more specific, the raw data is compressed with \ac{LZ77} and remaining data is shortened by using Huffman coding. 
			
 
				 % huffman - little endian
			
 
				 % lz77 compressed - big endian (least significant byte first/most left)
			
 
				+}
			
--- a/latex/tex/kapitel/k5_feasability.tex
+++ b/latex/tex/kapitel/k5_feasability.tex
@@ -72,7 +72,7 @@ For an initial test, a small pool of three tools was choosen.
 
				   \item GeCo
			
 
				   \item genie
			
 
				 \end{itemize}
			
 
				-Each of this tools comply with the criteria choosen in \autoref{chap:datatypes}.\\
			
 
				+Each of this tools comply with the criteria choosen in \autoref{chap:filetypes}.\\
			
 
				 To test each tool, the same set of data were used. The genome of a homo sapien id: GRCh38 were chosen due to its size TODO: find more exact criteria for testdata.
			
 
				 The Testdata is available via an open FTP Server, hotsed by ensembl. Source:\url{http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/}\\
			
 
				 Testparameters that were focused on:
			
--- a/latex/tex/literatur.bib
+++ b/latex/tex/literatur.bib
@@ -88,4 +88,39 @@
 
				   publisher    = {Institute of Electrical and Electronics Engineers ({IEEE})},
			
 
				 }
			
 
				 
			
 
				+@Online{ucsc,
			
 
				+  author  = {UCSC - University of California, Santa Cruz},
			
 
				+  date    = {2022-10-28},
			
 
				+  title   = {UCSC Genome Browser},
			
 
				+  url     = {https://genome.ucsc.edu/},
			
 
				+  urldate = {2022-10-28},
			
 
				+}
			
 
				+
			
 
				+@Online{ensembl,
			
 
				+  author = {Paul Flicek},
			
 
				+  date   = {2022-10-24},
			
 
				+  title  = {ENSEMBL Project},
			
 
				+  url    = {http://www.ensembl.org/},
			
 
				+}
			
 
				+
			
 
				+@Online{ga4gh,
			
 
				+  date  = {2022-10-10},
			
 
				+  title = {Global Alliance for Genomics and Health},
			
 
				+  url   = {https://github.com/samtools/hts-specs.},
			
 
				+}
			
 
				+
			
 
				+@Online{cram,
			
 
				+  author = {Markus Hsi-Yang Fritz and Vadim Zalunin},
			
 
				+  date   = {2022-10-28},
			
 
				+  title  = {CRAM Compressed Reference-oriented Alignment Map},
			
 
				+  url    = {https://samtools.github.io/hts-specs/CRAMv3.pdf},
			
 
				+}
			
 
				+
			
 
				+@Online{bed,
			
 
				+  author = {Sanger Institute, Genome Research Limited},
			
 
				+  date   = {2022-10-20},
			
 
				+  title  = {BED Browser Extensible Data},
			
 
				+  url    = {https://samtools.github.io/hts-specs/BEDv1.pdf},
			
 
				+}
			
 
				+
			
 
				 @Comment{jabref-meta: databaseType:biblatex;}