1
0

2 Achegas 4e59b22625 ... c8451e2d66

Autor SHA1 Mensaxe Data
  u c8451e2d66 corr. until S. 20 %!s(int64=3) %!d(string=hai) anos
  u d0a6e965a0 worked on lib %!s(int64=3) %!d(string=hai) anos

+ 0 - 200
latex/tex/&!make

@@ -1,200 +0,0 @@
-\chapter{Results and Discussion}
-The two tables \ref{t:effectivity}, \ref{t:efficiency} contain raw measurement values for the two goals, described in \ref{k5:goals}. The first table visualizes how long each compression procedure took, in milliseconds. The second one contains file sizes in bytes. Each row contains information about one of the files following this naming scheme:
-
-\texttt{Homo\_sapiens.GRCh38.dna.chromosome.}x\texttt{.fa}
-
-To improve readability, the filename in all tables were replaced by \texttt{File}. To determine which file was compressed, simply replace the placeholder with the number following \texttt{File}.\\
-
-\section{Interpretation of Results}
-The units milliseconds and bytes store a high precision. Unfortunately they are harder to read and compare, solely by the readers eyes. Therefore the data was altered. Sizes in \ref{t:sizepercent} are displayed in percentage, in relation to the respective source file. Meaning the compression with \acs{GeCo} on:
-
-Homo\_sapiens.GRCh38.dna.chromosome.11.fa 
-
-resulted in a compressed file which were only 17.6\% as big.
-Runtimes in \ref{t:time} were converted into seconds and have been rounded to two decimal places.
-Also a line was added to the bottom of each table, showing the average percentage or runtime for each process.\\
-\label{t:sizepercent}
-\sffamily
-\begin{footnotesize}
-  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
-    \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
-        {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
-        \\
-    \toprule
-     \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\
-    \midrule
-			File 1& 18.32& 24.51& 22.03\\
-			File 2& 20.15& 26.36& 23.7\\
-			File 3& 19.96& 26.14& 23.69\\
-			File 4& 20.1& 26.26& 23.74\\
-			File 5& 17.8& 22.76& 20.27\\
-			File 6& 17.16& 22.31& 20.11\\
-			File 7& 16.21& 21.69& 19.76\\
-			File 8& 17.43& 23.48& 21.66\\
-			File 9& 18.76& 25.16& 23.84\\
-			File 10& 20.0& 25.31& 23.63\\
-			File 11& 17.6& 24.53& 23.91\\
-			File 12& 20.28& 26.56& 23.57\\
-			File 13& 19.96& 25.6& 23.67\\
-			File 14& 16.64& 22.06& 20.44\\
-			File 15& 79.58& 103.72& 92.34\\
-			File 16& 19.47& 25.52& 22.6\\
-			File 17& 19.2& 25.25& 22.57\\
-			File 18& 19.16& 25.04& 22.2\\
-			File 19& 18.32& 24.4& 22.12\\
-			File 20& 18.58& 24.14& 21.56\\
-			File 21& 16.22& 22.17& 19.96\\
-      &&&\\
-			\textbf{Total}& 21.47& 28.24& 25.59\\
-    \bottomrule
-  \end{longtable}
-\end{footnotesize}
-\rmfamily
-
-Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} methode improved this by rughly 2.5\%. \acs{GeCo} provided the greatest reduction with 78.53\%. This gap of about 4\% comes with a comparatively great sacrifice in time.\\
-
-\label{t:time}
-\sffamily
-\begin{footnotesize}
-  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
-    \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
-        {Compression duration in seconds} % Caption für die Tabelle selbst
-        \\
-    \toprule
-     \textbf{ID.} & \textbf{\acs{GeCo} } & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM} } \\
-    \midrule
-			File 1 & 23.5& 3.786& 16.926\\
-			File 2 & 24.65& 3.784& 17.043\\
-			File 3 & 2.016& 3.123& 13.999\\
-			File 4 & 19.408& 3.011& 13.445\\
-			File 5 & 18.387& 2.862& 12.802\\
-			File 6 & 17.364& 2.685& 12.015\\
-			File 7 & 15.999& 2.503& 11.198\\
-			File 8 & 14.828& 2.286& 10.244\\
-      File 9 & 12.304& 2.078& 9.21\\
-			File 10 & 13.493& 2.127& 9.461\\
-			File 11 & 13.629& 2.132& 9.508\\
-			File 12 & 13.493& 2.115& 9.456\\
-			File 13 & 99.902& 1.695& 7.533\\
-			File 14 & 92.475& 1.592& 7.011\\
-			File 15 & 85.255& 1.507& 6.598\\
-			File 16 & 82.765& 1.39& 6.089\\
-			File 17 & 82.081& 1.306& 5.791\\
-			File 18 & 79.842& 1.277& 5.603\\
-			File 19 & 58.605& 0.96& 4.106\\
-			File 20 & 64.588& 1.026& 4.507\\
-			File 21 & 41.198& 0.721& 3.096\\
-      &&&\\
-      \textbf{Total}&42.57&2.09&9.32\\
-    \bottomrule
-  \end{longtable}
-\end{footnotesize}
-\rmfamily
-
-As \ref{t:time} is showing, the average compression duration for \acs{GeCo} is at 42.57s. That is a little over 33s, or 78\% longer than the average runtime of samtools for compressing into the \acs{CRAM} format.\\
-Since \acs{CRAM} requires a file in \acs{BAM} format, the third row is calculated by adding the time needed to compress into \acs{BAM} with the time needed to compress into \acs{CRAM}. 
-While \acs{SAM} format is required for compressing a \acs{FASTA} into \acs{BAM} and further into \acs{CRAM}, in itself it does not features no compression. However, the conversion from \acs{SAM} to \acs{FASTA} can result in a decrease in size. At first this might be contra intuitive since, as described in \ref{k2:sam} \acs{SAM} stores more information than \acs{FASTA}. This can be explained by comparing the sequence storing mechanism. A \acs{FASTA} sequence section can be spread over multiple lines whereas \acs{SAM} files store a sequence in just one line, converting can result in a \acs{SAM} file that is smaller than the original \acs{FASTA} file.
-% (hi)storytime
-Before interpreting this data further, a quick view into development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing.\\
-% todo interpret bit files and compare
-
-% big tables
-Reviewing \ref{t:recal-time} one will notice, that \acs{GeCo} reached a runtime over 60 seconds on every run. Instead of displaying the runtime solely in seconds, a leading number followed by an m indicates how many minutes each run took.
-
-\label{t:recal-size}
-\sffamily
-\begin{footnotesize}
-  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
-    \caption[Compression Effectivity for greater files]                       % Caption für das Tabellenverzeichnis
-        {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
-        \\
-    \toprule
-     \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\
-    \midrule
-			%geco bam and cram in percent
-			File 1& 1.00& 6.28& 5.38\\
-			File 2& 0.98& 6.41& 5.52\\
-			File 3& 1.21& 8.09& 7.17\\
-			File 4& 1.20& 7.70& 6.85\\
-			File 5& 1.08& 7.58& 6.72\\
-			File 6& 1.09& 7.85& 6.93\\
-			File 7& 0.96& 5.83& 4.63\\
-      &&&\\
-			\textbf{Total}	1.07& 7.11& 6.17\\
-    \bottomrule
-  \end{longtable}
-\end{footnotesize}
-\rmfamily
-
-\label{t:recal-time}
-\sffamily
-\begin{footnotesize}
-  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
-    \caption[Compression Effectivity for greater files]                       % Caption für das Tabellenverzeichnis
-        {Compression duration in seconds} % Caption für die Tabelle selbst
-        \\
-    \toprule
-     \textbf{ID.} & \textbf{\acs{GeCo} } & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM} } \\
-    \midrule
-			%compress time for geco, bam and cram in seconds
-			File 1 & 1m58.427& 16.248& 23.016\\
-			File 2 & 1m57.905& 15.770& 22.892\\
-			File 3 & 1m09.725& 07.732& 12.858\\
-			File 4 & 1m13.694& 08.291& 13.649\\
-			File 5 & 1m51.001& 14.754& 23.713\\
-			File 6 & 1m51.315& 15.142& 24.358\\
-			File 7 & 2m02.065& 16.379& 23.484\\
-      &&&\\
-			\textbf{Total}	 & 1m43.447& 13.474& 20.567\\
-    \bottomrule
-  \end{longtable}
-\end{footnotesize}
-\rmfamily
-
-In both tables \ref{t:recal-time} and \ref{t:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{t:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~1.3 Gigabyte). Closely folled by file one and two (\~1.2 Gigabyte). 
-% todo greater filesize means better compression
-
-\section{View on Possible Improvements}
-S. Petukhov described new findings about the distribution of nucleotides. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours is revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} can be determined:\\
-%\%C ≈ Σ\%CN ≈ Σ\%NС ≈ Σ\%CNN ≈ Σ\%NCN ≈ Σ\%NNC ≈ Σ\%CNNN ≈ Σ\%NCNN ≈ Σ\%NNCN ≈ Σ\%NNNC\\
-
-% begin optimization 
-Considering this and the meassured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a tool that is developed with todays standards, there is a possibility that even greater improvements could be archived.\\
-% simple theoretical approach
-How would a theoretical improvement approach look like? As described in \ref{k4:arith}, entropy coding requires to determine the probabilies of each symbol in the alphabet. The simplest way to do that, is done by parsing the whole sequence from start to end and increasing a counter for each nucleotide that got parsed. 
-With new findings discovered by S. Petukhov in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach would be that the probability of one nucleotide can be used to determine the probability of other nucelotides, by a calculation rather than the process of counting each one.
-This approach throws a few questions that need to be answered in order to plan a implementation:  
-\begin{itemize}
-	\item How many probabilities are needed to calculate the others?
-	\item Is there space for improvement in the parsing/counting process?
-	%\item Is there space for visible improvements, when only counting one nucleotide?
-	\item How can the variation between probabilities be determined?
-\end{itemize}
-
-Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant.
-%todo compare time needed: to store a variable <-> parsing the sequence
-To compare parts of a programm and their complexity, the Big-O notation is used. Unfortunally this is only covering loops and coditions as a whole. Therefore a more detailed view on operations must be created: 
-Considering a single threaded loop with the purpose to count every nucleotide in a sequence, the process of counting can be split into several operations, defined by this pseudocode.
-
-%todo use GeCo arith function with bigO
-while (sequence not end):\\
-	\-\hspace{1cm} next\_nucleotide = read\_next\_nucleotide(sequence)\\
-	for (element in alphabet\_probabilities):\\
-		if (element equals next\_nucleotide)\\
-			element = element + 1\\
-		fi\\
-	rof\\
-elihw\\
-
-This loop will itterate over a whole sequence, counting each nucleotide. In line three, a inner loop can be found which itterates over the alphabet, to determine which symbol should be increased. Considering the findings, described above, the inner loop can be left out, because there is no need to compare the read nucleotide against more than one symbol. The Big-O notation for this code, with any sequence with the length of n, would be decreseased from O($n^2$) to O($n\cdot 1)$) or simply O(N) \cite{big-o}. Which is clearly an improvement in complexety and therefor also in runtime.\\
-The runtime for calculations of the other symbols probabilities must be considered as well and compared against the nested loop to be certain, that the overall was improved.
-% more realistic view on parsing todo need cites
-In practice, obviously smarter ways are used, to determine probabilities. Like splitting the sequence in multiple parts and parse each subsequence asynchronous. This results can either sumed up for global probabilities or get used individually on each associated subsequence. Either way, the presented improvement approach should be appliable to both parsing methods.\\
-
-
-% how is data interpreted
-% why did the tools result in this, what can we learn
-% improvements
-% - goal: less time to compress
-% 	- approach: optimize probability determination
-% 	-> how?

+ 6 - 6
latex/tex/docinfo.tex

@@ -8,7 +8,7 @@
 \newcommand{\hsmatitelde}{Vergleich von Kompressionswerkzeugen für biologische Daten und Analyse von Verbesserungsmöglichkeiten}
 
 % Titel der Arbeit auf Englisch
-\newcommand{\hsmatitelen}{Comparison of compression tools for biological data and analysis of possible improvements}
+\newcommand{\hsmatitelen}{Comparison of Compression Tools for Biological Data and Analysis of Possible Optimization}
 
 % Weitere Informationen zur Arbeit
 \newcommand{\hsmaort}{Mannheim}          % Ort
@@ -17,8 +17,8 @@
 \newcommand{\hsmadatum}{01.12.22}      % Datum der Abgabe
 \newcommand{\hsmajahr}{2022}             % Jahr der Abgabe
 %\newcommand{\hsmafirma}{Paukenschlag GmbH, Mannheim} % Firma bei der die Arbeit durchgeführt wurde
-\newcommand{\hsmabetreuer}{Prof. Elena Fimmel, Hochschule Mannheim} % Betreuer an der Hochschule
-\newcommand{\hsmazweitkorrektor}{TBD}   % Betreuer im Unternehmen oder Zweitkorrektor
+\newcommand{\hsmabetreuer}{Prof. Dr. Elena Fimmel, Hochschule Mannheim} % Betreuer an der Hochschule
+\newcommand{\hsmazweitkorrektor}{Prof. Dr. Markus Gumbel}   % Betreuer im Unternehmen oder Zweitkorrektor
 \newcommand{\hsmafakultaet}{I}    % I für Informatik oder E, S, B, D, M, N, W, V
 \newcommand{\hsmastudiengang}{IB} % IB IMB UIB CSB IM MTB (weitere siehe titleblatt.tex)
 
@@ -39,8 +39,8 @@
 %          erkannt.
 
 % Kurze (maximal halbseitige) Beschreibung, worum es in der Arbeit geht auf Deutsch
-\newcommand{\hsmaabstractde}{Verschiedene Algorithmen werden verwendet, um sequenzierte DNA zu speichern. Eine neue Entdeckung darüber wie die Bausteine der DNA angeordnet sind, bietet die Möglichkeit vorhandene kompressions methoden zum speichern von sequenzierten DNA zu verbessern.\\
-Diese Arbeit vergleicht vier weit verbreitete Kompressionsmethoden und analysiert deren Verwendung von Algorithmen. Durch die Ergebnisse lässt sich der Schluss ziehen, dass Verbesserungen in der Implementation von arithmetischer codierung möglich sind. Die abschließende Diskussion betrachtet mögliche vorgehensweisen zur Vebesserung und welche Aufgaben diese mit sich ziehen könnten.}
+\newcommand{\hsmaabstractde}{Verschiedene Algorithmen werden verwendet, um sequenzierte DNA zu speichern. Eine neue Entdeckung darüber wie die Bausteine der DNA angeordnet sind, bietet die Möglichkeit vorhandene Kompressionsmethoden zum Speichern von sequenzierten DNA zu verbessern.\\
+Diese Arbeit vergleicht vier weit verbreitete Kompressionsmethoden und analysiert deren Verwendung von Algorithmen. Durch die Ergebnisse lässt sich der Schluss ziehen, dass Verbesserungen in der Implementation von arithmetischer Codierung möglich sind. Die abschließende Diskussion betrachtet mögliche Vorgehensweisen zur Verbesserung und welche Aufgaben diese mit sich ziehen könnten.}
 
 % Kurze (maximal halbseitige) Beschreibung, worum es in der Arbeit geht auf Englisch
-\newcommand{\hsmaabstracten}{A variety of algorithms is used to compress sequenced DNA. New findings in the patterns of how the building blocks of DNA are distributed, might provide a chance to improve long used compression algorithms. The comparison of four widely used compression methods and a analysis on their implemented algorithms, leads to the conclusion that improvements are feasable. The closing discussion provides insights, in possible improvement approaches and which challenges they might involve.}
+\newcommand{\hsmaabstracten}{A variety of algorithms is used to compress sequenced DNA. New findings in the patterns of how the building blocks of DNA are distributed, might provide a chance to improve long used compression algorithms. The comparison of four widely used compression methods and an analysis on their implemented algorithms, leads to the conclusion that improvements are feasible. The closing discussion provides insights, to possible optimization approaches and which challenges they might involve.}

+ 1 - 1
latex/tex/kapitel/Abstract.tex

@@ -1 +1 @@
-A variety of algorithms is used to compress sequenced DNA. New findings in the patterns of how nucleotides are distributed in DNA, might provide a chance to improve long used compression algorithms. The comparison of four widely used compression methods and a analysis on their implemented algorithms, leads to the conclusion that improvements are feasable. The closing discussion provides insights, in possible improvement approaches and which challenges they might involve. 
+A variety of algorithms is used to compress sequenced DNA. New findings in the patterns of how nucleotides are distributed in DNA, might provide a chance to improve long used compression algorithms. The comparison of four widely used compression methods and a analysis on their implemented algorithms, leads to the conclusion that improvements are feasible. The closing discussion provides insights, in possible improvement approaches and which challenges they might involve. 

+ 1 - 1
latex/tex/kapitel/a6_results.tex

@@ -4,7 +4,7 @@
 	\label{a6:compr-time}
   \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
     \caption[Compression efficiency in milliseconds]                       % Caption für das Tabellenverzeichnis
-        {\textbf{Compression duration of various tools, meassured in milliseconds}} % Caption für die Tabelle selbst
+        {\textbf{Compression duration of various tools, measured in milliseconds}} % Caption für die Tabelle selbst
         \\
     \toprule
      \textbf{ID.} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}} & \textbf{Samtools \acs{CRAM}} \\

+ 1 - 0
latex/tex/kapitel/abkuerzungen.tex

@@ -6,6 +6,7 @@
 %          sortieren. Das passiert nicht automatisch.
 \begin{acronym}[IEEE]
   \acro{ANS}{Arithmetic Numeral System}
+  \acro{ANSI}{American National Standard Insitute}
   \acro{ASCII}{American Standard Code for Information Interchange}
   \acro{BAM}{Binary Alignment Map}
   \acro{CABAC}{Context-Adaptive Arithmetic Coding}

+ 9 - 8
latex/tex/kapitel/k1_introduction.tex

@@ -1,22 +1,23 @@
 \chapter{Introduction}
 % general information and intro
 %Understanding how things in our cosmos work, was and still is a pleasure, that the human being always wants to fulfill. 
-Understanding the biological code of living things, is a alsways developing taks which is important for multiple aspekts of our live. The results of reasearch in this area provides knowledge that helps development in the medical sector, agriculture and more \cite{ju_21, wang-22, mo_83}.
-Getting insights into this biological code is possible through storing and studying information, embedded in genonmes \cite{dna_structure}. Since live is complex, there is a lot of information, which requires a lot of memory \cite{alok17, survey}.\\
+Understanding the biological code of living things-, is an alsways developing task which plays a significant part in multiple aspects of our lives. The results of research in this area provides knowledge that helps development in the medical sector, in agriculture and more \cite{ju_21, wang_22, mo_83}.
+Getting insights into this biological code is possible through storing and studying information, embedded in genonmes \cite{dna_structure}. Since life is complex, there is a lot of information, that requires a lot of memory space \cite{alok17, survey}.\\
 % ...Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
 % compression values and goals
-With compression algorithms and their implementation in tools, the problem of storing information got smaller. Compressed data requires less space and therefore less time to be transported over networks \cite{Shannon_1948}. This advantage is scalable, and since genetic information needs a lot of storage, even in a compressed state, improvements are welcomed \cite{moffat_arith}. Since this field is, compared to others, like computer theory which created the foundation for compression algorithms, relatively new, there is much to discover and new findings are not unusual \cite{Shannon_1948}. From some of this findings, new tools can be developed. In general they focus on increasing at least one of two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data \cite{moffat_arith, alok17, Shannon_1948}.\\
+%With compression algorithms and their implementation in tools, the problem of storing information got smaller.
+Compression algorithms and their implementation has helped towards resolving the problem of storing information. Compressed data requires less space and therefore less time to be transported over networks \cite{Shannon_1948}. This advantage is scalable and, since genetic information needs a lot of storage even in a compressed state, improvements are welcomed \cite{moffat_arith}. Since this field is relatively new compared to others, such as computer theory, which created the foundation for compression algorithms, there is much to discover and new findings are not unusual \cite{Shannon_1948, sam12, geco}. From some of these findings, new tools can be developed. In general they focus on increasing at least one of two factors: the speed at which data is compressed and the compression ratio, meaning the difference between uncompressed and compressed data \cite{moffat_arith, alok17, Shannon_1948}.\\
 % ...
 % more exact explanation
 
 % actual focus in short and simple terms
-New discoveries in the universal rules of stochastical organization of genomes might provide a base for new algorithms and therefore new tools or an improvement of existing ones for genome compression \cite{pet21}. The aim of this work is to analyze the current state of the art for compression tools for biological data and implemented, probabilistic algorithms. Further this work will determine if there is room for improvement.\\
-The discussion will include a superficial analysation of how and where this new approach could get implemented and what problems possibly need to be taken care of in the process.\\
+New discoveries in the universal rules of the stochastical organization of genomes might provide a base for new algorithms and therefore new tools or an improvement of existing ones for genome compression \cite{pet21}. The aim of this work is to analyze the current state of the art for compression tools for biological data and implemented probabilistic algorithms. Furthermore this work will determine if there is room for optimization.\\
+The discussion will include a superficial analysis of how and where this new approach could be implemented and what problems possibly need to be taken care of in the process.\\
 
 % focus and structure of work in greater detail 
-	To reach a common ground, the first pages will give the reader a quick overview on the structure of human DNA. There will also be an fundamental explanation for some basic terms, used in biology and computer science. The first step into the theory of genome compression will be taken, by describing differences in common file formats, used to store genome information. From there, a section which is relevant for understanding compression will follow. It will analyze differences between compression approaches, go over some history of coding theory and lead to a deeper look into the fundamentals of state of the art compression algorithms. The chapter will end with a few pages about implementations of compression algorithms in tools relevant.\\
-In order to meassure a improvement, a baseline must be set. Therefore the efficiency and effecitity of suiteable state of the art tools will be meassured. To be as precise as possible, the middle part of this work focuses on setting up an environment, picking input data, installing and executing tools and finaly meassuring and documenting the results.\\
-The results of this compared with the understanding of how the tools work, will show if there is the need of a improvement and on what factor it should focus. The end of this work will be used to discuss the properties of a a possible improvement, how feasability could be determined and which problems such a project woudl need overcome.\\
+To reach a common ground, the first pages will give the reader a quick overview on the structure of human DNA. This will include explanations for some basic terms, used in biology and computer science. The first step into the theory of genome compression will be taken, by describing differences in common file formats, used to store genome information. From there, a section which is relevant for understanding compression will follow. It will analyze differences between compression approaches, go over some history of coding theory and lead to a deeper look into the fundamentals of state of the art compression algorithms. The chapter will end with a few pages about implementations of compression algorithms in relevant tools.\\
+In order to measure an optimization, a baseline must be set. Therefore, the efficiency and effectivity of suitable state of the art tools will be measured. To be as precise as possible, the middle part of this work focuses on setting up an environment, picking input data, installing and executing tools and finally meassuring and documenting the results.\\
+These results compared with the understanding of how the tools work, will show if there is the need of an improvement and on what factor it should focus. The end of this work will be used to discuss the properties of a possible optimization, how feasibility could be determined and which problems such a project would need to overcome.\\
 
 % todo: 
 %   explain: coding 

+ 9 - 9
latex/tex/kapitel/k2_dna_structure.tex

@@ -17,9 +17,9 @@
 %- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
 %- IMPACT ON COMPRESSION
 
-\chapter{The Structure of the Human Genome and how its Digital Form is Compressed}
+\chapter{The Structure of the Human Genome and How its Digital Form is Compressed}
 \section{Structure of Human \acs{DNA}}
-To strengthen the understanding of how and where biological information is stored, this section starts with a quick and general rundown on the structure of any living organism.\\
+To strengthen the understanding of how and where biological information is stored, this section starts with a quick and general rundown of the structure of any living organism.\\
 
 \begin{figure}[ht]
   \centering
@@ -28,9 +28,9 @@ To strengthen the understanding of how and where biological information is store
   \label{k2:gene-overview}
 \end{figure}
 
-All living organisms, like plants and animals, are made of cells. To get a rough impression, a human body can consist out of several trillion cells.
-A cell in itself, is the smallest living organism. Most cells consists of a outer section and a core which is a called nucleus. In \ref{k2:gene-overview} the nucleus is illustrated as a purple, cirlce like scheme, inside a lighter circle. The nucleus contains chromosomes. Those chromosomes contain genetic information, about its organism, in form of \ac{DNA} \cite{cells}.\\
-\acs{DNA} is often seen in the form of a double helix, like shown in \ref{k2:dna-struct}. A double helix consists, as the name suggests, of two single helix \cite{dna_structure}. 
+All living organisms, like plants and animals, are made of cells. To get a rough impression, a human body can consist of several trillion cells.
+A cell in itself is the smallest living organism. Most cells consist of an outer section and a core which is a called nucleus. In \ref{k2:gene-overview} the nucleus is illustrated as a purple, circlelike scheme inside a lighter circle. The nucleus contains chromosomes. Those chromosomes contain genetic information, about their organism in form of \ac{DNA} \cite{cells}.\\
+\acs{DNA} is often seen in the form of a double helix, as shown in \ref{k2:dna-struct}. A double helix consists, as the name suggests, of two single helixes \cite{dna_structure}. 
 
 \begin{figure}[ht]
   \centering
@@ -41,8 +41,8 @@ A cell in itself, is the smallest living organism. Most cells consists of a oute
 
 Each of them consists of two main components: the sugar phosphate backbone, which is not relevant for this work and the bases. The suggar phosphate backbones are illustrated as flat stripes, circulating aroung the horizontal line in \ref{k2:dna-struct}. Pairs of bases are symbolized as vertical bars between the suggar phosphates. 
 The arrangement of Bases represents the information, stored in the \acs{DNA}. Whar is here described as base is a organic molecule, which is also called nucleotide \cite{dna_structure}.\\
-For this work, nucleotides are the most important parts of the \acs{DNA}. A nucleotide can occur in one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a Counterpart with which a bond can be established: adenine can bond with thymine, guanine can bond with cytosine.\\
-From the perspective of an computer scientist: The content of one helix must be stored, to persist the full information. In more practical terms: The nucleotides of only one (entire) helix needs to be stored physically, to save the information of the whole \acs{DNA}. The other half can be determined by ``inverting'' the stored one. 
+For this work, nucleotides are the most important parts of the \acs{DNA}. A nucleotide can occur in one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a counterpart with which a bond can be established: adenine can bond with thymine; guanine can bond with cytosine.\\
+From the perspective of a computer scientist: The content of one helix must be stored to persist the full information. In more practical terms: The nucleotides of only one (entire) helix need to be stored physically to save the information of the whole \acs{DNA}. The other half can be determined by ``inverting'' the stored one. 
 % todo OPT -> figure?
-An example would show the counterpart for e.g.: \texttt{adenine, guanine, adenine} chain which would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initiat. So the example would change to \texttt{AGA} in one Helix, \texttt{TCT} in the other.\\
-This representation ist commonly used to store \acs{DNA} digitally. Depending on the sequencing procedure and other factors, more information is stored and therefore more characters are required but for now 'A', 'C', 'G' and 'T' should be the only concern.
+An example would show the counterpart for e.g.: \texttt{adenine, guanine, adenine} chain which would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initial. So, the example would change to \texttt{AGA} in one helix, \texttt{TCT} in the other.\\
+This representation is commonly used to store \acs{DNA} digitally. Depending on the sequencing procedure and other factors, more information is stored and therefore more characters are required but for now 'A', 'C', 'G' and 'T' should be the only concern.

+ 31 - 32
latex/tex/kapitel/k3_datatypes.tex

@@ -24,42 +24,42 @@
 
 \section{File Formats used to Store DNA}
 \label{chap:file formats}
-As described in previous chapters \ac{DNA} can be represented by a string with the buildingblocks A,T,G and C. Using a common file format for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
-The \ac{ASCII} \cite{iso-ascii} table is a character set, registered in 1975 and to this day still in use to encode texts digitally. For the purpose of communication bigger character sets replaced \acs{ASCII}. It is still used in situations where storage is short.\\
+As described in previous chapters \ac{DNA} can be represented by a string with the buildingblocks A,T,G and C. Using a common file format for saving text would be impractical because the amount of characters or symbols in the used alphabet defines how many bits are used to store each single symbol.\\
+The \ac{ASCII} \cite{iso-ascii} table is a character set registered in 1975, and to this day it is still in use to encode texts digitally. For the purpose of communication larger character sets replaced \acs{ASCII}. It is still used in situations where storage is short.\\
 % grund dass ASCII abgelöst wurde -> zu wenig darstellungsmöglichkeiten. Pro heute -> weniger overhead pro character
-The buildingblocks of \acs{DNA} require a minimum of four letters, so at least two bits are needed. Storing a single \textit{A} with \acs{ASCII} encoding, requires 8 bit (\,excluding magic bytes and the bytes used to mark the \ac{EOF})\. Since there are at least $2^8$ or 128 displayable symbols with \acs{ASCII} encoding, this leaves a great overhead of unused combination.\\
+The buildingblocks of \acs{DNA} require a minimum of four letters, so at least two bits are needed. Storing a single \textit{A} with \acs{ASCII} encoding, requires 8 bit (\,excluding magic bytes and the bytes used to mark the \ac{EOF})\,. Since there are at least $2^8$ or 128 displayable symbols with \acs{ASCII} encoding, this leaves a great overhead of unused combinations.\\
 % cout out examples. Might be needed later or elsewhere
 % \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. 
-In most tools, more than four symbols are used. This is due to the complexity in sequencing \acs{DNA}. It is not 100\% preceice, so additional symbols are used to mark nucelotides that could not or could only partly get determined. Further a so called quality score is used to indicate the certainty, for each single nucleotide, that it got sequenced correctly \cite{survey, Cock_2009}.\\
-More common everyday-usage text encodings like unicode require 16 bits per letter. So settling with \acs{ASCII} has improvement capabilities but is, on the other side, more efficient than using bulkier alternatives like unicode.\\
+In most tools, more than four symbols are used. This is due to the complexity in sequencing \acs{DNA}. It is not 100\% preceice, so additional symbols are used to mark nucleotides that could not or could only partly get determined. Furthermore a so called quality score is used to indicate the certainty of correct sequencing for each nucleotide \cite{survey, Cock_2009}.\\
+More common everyday-usage text encodings like unicode require 16 bits per letter. So, settling with \acs{ASCII} has improvement capabilities and is, on the other side, more efficient than using bulkier alternatives like unicode.\\
 
 % differences between information that is store
-Formats for storing uncompressed genomic data, can be sorted into several categories. Three noticable ones would be \cite{survey}:
+Formats for storing uncompressed genomic data, can be sorted into several categories. Three noticeable ones would be \cite{survey}:
 \begin{itemize}
-	\item Sequenced reads
-	\item Aligned data
 	\item Sequence variation
+	\item Aligned data
+	\item Sequenced reads
 \end{itemize}
-The categories are listed on their complexity, considering their usecase and data structure, in ascending order. Starting with sequence variation, also called haplotype describes formats storing graph based structures that focus on analysing variations in different genomes \cite{haplo, sam12}. 
-Sequenced reads focus on storing continous protein chains from a sequenced genome \cite{survey}.
-Aligned data is somwhat simliar to sequenced reads with the difference that instead of a whole chain of genomes, overlapping subsequences are stored. This could be described as a rawer form of sequenced reads. This way aligned data stores additional information on how certain a specific part of a genome is read correctly \cite{survey, sam12}.
-The focus of this work lays on compression of sequenced data but not on the likelyhood of how accurate the data might be. Therefore, only formats that are able to store sequenced reads will be worked with. Note that some alginged data formats are also able to store aligned reads, since latter is just a less informative representation of first \cite{survey, sam12}.\\
+The categories are listed in descending order, based on their complexity, considering their usecase and data structure. Starting with sequence variation, called haplotype which describes formats storing graph based structures that focus on analyzing variations in different genomes \cite{haplo, sam12}. 
+Sequenced reads focus on storing continuous protein chains from a sequenced genome \cite{survey}.
+Aligned data is somewhat simliar to sequenced reads with the difference that instead of a whole chain of genomes, overlapping subsequences are stored. This could be described as a rawer form of sequenced reads. This way aligned data stores additional information on how certain a specific part of a genome is read correctly \cite{survey, sam12}.
+The focus of this work is the compression of sequenced data but not the likelihood of how accurate the data might be. Therefore, only formats that are able to store sequenced reads will be worked with. Note that some algigned data formats are also able to store aligned reads, since latter is just a less informative representation of the first \cite{survey, sam12}.\\
 
 % ausschluss criteria
-Several people and groups have developed different file formats to store genomes. Unfortunaly, the only standard for storing genomic data is fairly new \cite{isompeg, mpeg}. Therefore, formats and tools implementing this standard are mostly still in development. In order to not go beyond scope, this work will focus only on file formats that fulfill following criteria:\\
+Several people and groups have developed different file formats to store genomes. Unfortunately, the only standard for storing genomic data is fairly new \cite{isompeg, mpeg}. Therefore, formats and tools implementing this standard are mostly still in development. In order to not go beyond scope, this work will focus only on file formats that fulfill the following criteria:\\
 \begin{itemize}
   \item{The format has reputation. This can be indicated through:}
 	\begin{itemize}
 		\item{A scientific paper, that proved its superiority to other relevant tools.}
-		\item{A broad ussage of the format determined by its use on ftp servers, which focus on supporting scientific research.}
+		\item{A broad usage of the format determined by its use on ftp servers, which focus on supporting scientific research.}
 	\end{itemize}
   \item{The format should not specialize on only one type of \acs{DNA} or target a specific technology.}
-  \item{The format stores nucleotide seuqences and does not neccesarily include \ac{IUPAC} codes besides A, C, G and T \cite{iupac}.}
-  \item{The format is open source. Otherwise, improvements can not be tested, without buying the software and/or requesting permission to disassemble and reverse engineer the software or parts of it.}
+  \item{The format stores nucleotide sequences and does not necessarily include \ac{IUPAC} codes besides A, C, G and T \cite{iupac}.}
+  \item{The format is open source. Otherwise, optimizations cannot be tested without buying the software and/or requesting permission to disassemble and reverse engineer the software or parts of it.}
 \end{itemize}
 
-	Information on available formats where gathered through various Internet platforms \cite{ensembl, ucsc, ga4gh} and scientific papers \cite{survey, sam12, Cock_2009}. 
-Some common file formats found:\\
+Information on available formats was gathered through various Internet platforms \cite{ensembl, ucsc, ga4gh} and scientific papers \cite{survey, sam12, Cock_2009}. 
+Some common file formats are:\\
 
 \begin{itemize}
 % which is relevant? 
@@ -72,12 +72,12 @@ Some common file formats found:\\
 
 % groups: sequence data, alignment data, haplotypic
 % src: http://help.oncokdm.com/en/articles/1195700-what-is-a-bam-fastq-vcf-and-bed-file
-Since methods to store this kind of Data are still in development, there are many more file formats. From the selection listed above, \acs{FASTA} and \acs{FASTq} seem to have established the reputation of a inoficial standard for sequenced reads \cite{survey, geco, sam12, vertical, cram-origin}.\\
-Considering the first criteria, by searching through anonymously accesable \acs{FTP} servers, only two formats are used commonly: FASTA or its extension \acs{FASTq} and the \acs{BAM} Format \cite{ftp-igsr, ftp-ncbi, ftp-ensembl}.
+Since methods to store this kind of data are still in development, there are many more file formats. From the selection listed above, \acs{FASTA} and \acs{FASTq} seem to have established the reputation of an unoficial standard for sequenced reads \cite{survey, geco, sam12, vertical, cram-origin}.\\
+Considering the first criteria, by searching through anonymously accessible \acs{FTP} servers, only two formats are used commonly: FASTA or its extension \acs{FASTq} and the \acs{BAM} Format \cite{ftp-igsr, ftp-ncbi, ftp-ensembl}.
 % todo Explain twobit: The last format, called twoBit is also included, because it is 
 
 \subsection{\acs{FASTA} and \acs{FASTq}}
-The rather simple \acs{FASTA} format, are widely used when it comes to storing sequenced reads, without a quality store \cite{sam12, survey}. Since it is a uncompressed format, \acs{FASTA} files are often transmitted compressed with an external tool like gzip \cite{ftp-ensembl, ftp-ncbi}.\\
+The rather simple \acs{FASTA} format, is widely used when it comes to storing sequenced reads, without a quality score \cite{sam12, survey}. Since it is an uncompressed format, \acs{FASTA} files are often transmitted compressed with an external tool like gzip \cite{ftp-ensembl, ftp-ncbi}.\\
 
 \begin{figure}[h]
   \centering
@@ -87,18 +87,17 @@ The rather simple \acs{FASTA} format, are widely used when it comes to storing s
 \end{figure}
 
 The format consists of two repeated sections. The first section consists of one line and stores metadata about the sequenced genome and the file itself. This line, also called header, contains a comment section starting with \texttt{>} followed by a custom text \cite{alok17, Cock_2009}. The comment section is usually used to store information about the sequenced genome and sometimes metadata about the file itself like its size in bytes.\\
-The other section contains the sequenced genome whereas each nucleotide is represented by character \texttt{A, C, G or T}. There are more nucleotide characters that store additional information and some characters for representing amino acids, but in order to not go beyond scope, only \texttt{A, C, G, and T} will be paid attention to \cite{iupac}.\\
-The second section can have multiple lines of sequences. A simliar format is the Multi-\acs{FASTA} file format, it consists of concatenated \acs{FASTA} files.\cite{survey}.\\
+The other section contains the sequenced genome whereas each nucleotide is represented by the character \texttt{A, C, G or T}. There are more nucleotide characters that store additional information and some characters for representing amino acids, but in order to not go beyond scope, only \texttt{A, C, G, and T} will be paid attention to \cite{iupac}.\\
+The second section can have multiple lines of sequences. A similar format is the Multi-\acs{FASTA} file format, it consists of concatenated \acs{FASTA} files.\cite{survey}.\\
 % fastq
-In addition to its predecessor, \acs{FASTq} files contain a quality score. The file content consists of four sections, wherby no section is stored in more than one line. All four lines contain information about one sequence. The exact structure of \acs{FASTq} is formated in this order \cite{Cock_2009}:
+In addition to its predecessor, \acs{FASTq} files contain a quality score. The file content consists of four sections, where no section is stored in more than one line. All four lines contain information about one sequence. The exact structure of \acs{FASTq} is formated in this order \cite{Cock_2009}:
 \begin{itemize}
 	\item Line 1: Sequence identifier aka. Title, starting with an @ and an optional description.
-	\item Line 2: The seuqence consisting of nucleoids, symbolized by A, T, G and C.
-	\item Line 3: A '+' that functions as a seperator or delmitier. Optionally followed by content of Line 1.
-	\item Line 4: quality line(s). consisting of letters and special characters in the \acs{ASCII} scope.
+	\item Line 2: The sequence consisting of nucleoids, symbolized by A, T, G and C.
+	\item Line 3: A '+' that functions as a separator or delimitier. Optionally followed by the content of line 1.
+	\item Line 4: Quality line(s). consisting of letters and special characters in the \acs{ASCII} scope.
 \end{itemize}
-The quality scores have no fixed format. To name a few, there is the sanger format, the solexa format introduced by Solexa Inc., the Illumina and the QUAL format which is generated by the PHRED software.\\
-The quality value shows the estimated probability of error in the sequencing process \cite{Cock_2009}.\\
+The quality scores have no fixed format. To name a few, there is the Sanger format, the Solexa format introduced by Solexa Inc., the Illumina and the QUAL format which is generated by the PHRED software \cite{Cock_2009}.\\
 
 \begin{figure}[h]
   \centering
@@ -108,12 +107,12 @@ The quality value shows the estimated probability of error in the sequencing pro
 \end{figure}
 
 In \ref{k3:fasta-struct} the described structure is illustrated. The sequence and the delmitier section were altered, to illustrate the stucture of this format better.
-In the header section, \texttt{SRR002906.1} is the sequence identifiert, the following text is a description. In the delimiter line, the header section without the leading @ could be written again. The last line shows the header for the second sequence.\\
+In the header section, \texttt{SRR002906.1} is the sequence identifier; the text that follows is a description. In the delimiter line, the header section without the leading @ could be written again. The last line shows the header for the second sequence.\\
 
 \label{k2:sam}
 \subsection{Sequence Alignment Map}
 % src https://github.com/samtools/samtools
-\acs{SAM} often seen in its compressed, binary representation \acs{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by the whitespace character called tabulation or \textbf{tab} for short \cite{sam12}. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 \cite{rfcansi}. The structure is more complex than the one in \acs{FASTq} and described best, accompanied by an example:
+\acs{SAM} often seen in its compressed, binary representation \acs{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a utility tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by the whitespace character called tabulation or \textbf{tab} for short \cite{sam12}. It uses 7-bit US-\acs{ASCII}; to be precise charset \acs{ANSI} X3.4-1968 \cite{rfcansi}. The structure is more complex than the one in \acs{FASTq} and described best, accompanied by an example:
 
 \begin{figure}[h]
   \centering
@@ -123,4 +122,4 @@ In the header section, \texttt{SRR002906.1} is the sequence identifiert, the fol
 \end{figure}
 
 Compared to \acs{FASTA} \acs{SAM} and further compression forms, store more information. As displayed in \ref{k2:bam-struct} this is done by adding, identifier for Reads e.g. \textbf{+r003}, aligning subsequences and writing additional symbols like dots e.g. \textbf{ATAGCT......} in the split alignment +r004 \cite{survey}. A full description of the information stored in \acs{SAM} files would be of little value to this work, therefore further information on is left out but can be found in \cite{sam12} or at \cite{bam}.\\
-Samtools provide the feature to convert a \acs{FASTA} file into \acs{SAM} format. Since there is no way to calulate mentioned, additional information from the information stored in \acs{FASTA}, the converted files only store two lines. The first one stores metadata about the file and the second stores the nucleotide sequence in just one line.
+Samtools provide the feature to convert a \acs{FASTA} file into \acs{SAM} format. Since there is no way to calculate the mentioned additional information from the information stored in \acs{FASTA}, the converted files only store two lines. The first line stores metadata about the file and the second stores the nucleotide sequence in just one line.

+ 55 - 54
latex/tex/kapitel/k4_algorithms.tex

@@ -23,53 +23,51 @@
 % file structure/format <-> datatypes. länger beschreiben: e.g. File formats to store dna
 % 3.2.1 raus
 
-\section{Compression aproaches}
+\section{Compression Aproaches}
 The process of compressing data serves the goal to generate an output that is smaller than its input \cite{dict}.\\
-In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible with any compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or with network protocols like \ac{UDP} which are used to transmit video/audio streams live \cite{rfc-udp, cnet13}.\\
-For storing \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its exact position is needed for the sequence to be complete and usefull.\\
+In many cases, like in gene compressing, the compression is idealy lossless. This means, it is possible to receive the full information that was available in the origin data by decompressing any kind of compressed data. Lossy compression on the other hand might exclude parts of data in the compression process, in order to increase the compression rate. The excluded parts are typically not necessary to transmit the original information. This works with certain audio and picture files, or with network protocols like \ac{UDP} which are used to transmit video/audio streams live \cite{rfc-udp, cnet13}.\\
+For storing \acs{DNA} a lossless compression is needed. To be precise a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its exact position are needed for the sequence to be complete and useful.\\
 Before going on, the difference between information and data should be emphasized.\\
 % excurs data vs information
-Data contains information. In digital data  clear, physical limitations delimit what and how much of something can be stored. A bit can only store 0 or 1, eleven bit can store up to $2^{11}$ combinations of bit and a 1 \acs{GB} drive can store no more than 1 \acs{GB} data. Information on the other hand, is limited by the way how it is stored. What exactly defines informations, depends on multiple factors. The context in which information is transmitted and the source and destination of the information. This can be in form of a signal, transfered from one entity to another or information that is persisted so it can be obtained at a later point in time.\\
+Data contains information. In digital data , clear physical limitations delimit what and how much of something can be stored. A bit can only store 0 or 1, eleven bit can store up to $2^{11}$ combinations of bit and a 1~\acs{GB} drive can store no more than 1~\acs{GB} of data. Information on the other hand is limited by the way it is stored. What exactly defines information, depends on multiple factors. The context in which information is transmitted and the source and destination of the information. This can be in form of a signal, transferred from one entity to another, or information that is persisted, so it can be obtained at a later point in time.\\
 % excurs information vs data
-For the scope of this work, information will be seen as the type and position of nucleotides, sequenced from \acs{DNA}. To get even more preceise, it is a chain of characters from a alphabet of \texttt{A, C, G, and T}, since this is the \textit{de facto} standard for digital persistence of \acs{DNA} \cite{isompeg}.
-The boundaries of information, when it comes to storing capabilities, can be illustrated by using the example mentioned above. A drive with the capacity of 1 \acs{GB} could contain a book in form of images, where the content of each page is stored in a single image. Another, more resourceful way would be storing just the text of every page in \acs{UTF}-16 \cite{isoutf}. The information, the text would provide to a potential reader would not differ. Changing the text encoding to \acs{ASCII} and/or using compression techniques would reduce the required space even more, without loosing any information.\\
+For the scope of this work, information will be seen as the type and position of nucleotides, sequenced from \acs{DNA}. To be even more precise, it is a chain of characters from an alphabet of \texttt{A, C, G, and T}, since this is the \textit{de facto} standard for digital persistence of \acs{DNA} \cite{isompeg}.
+When it comes to storing capabilities, the boundaries of information, can be illustrated by using the example mentioned above. A drive with the capacity of 1~\acs{GB} could contain a book in form of images, where the content of each page is stored in a single image. Another, more resourceful way would be storing just the text of every page in \acs{UTF}-16 \cite{isoutf}. The information the text would provide to a potential reader would not differ. Changing the text encoding to \acs{ASCII} and/or using compression techniques would reduce the required space even more, without losing any information.\\
 % excurs end
-For \acs{DNA} a lossless compression is needed. To be precise a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Methods from both fields, that aquired reputation, are described in detail below \cite{cc14, moffat20, moffat_arith, alok17}.\\
+For \acs{DNA} a lossless compression is needed. To be precise a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression, two mayor approaches are known: the dictionary coding and the entropy coding. Methods from both fields, that acquired reputation, are described in detail below \cite{cc14, moffat20, moffat_arith, alok17}.\\
 
-\subsection{Dictionary coding}
+\subsection{Dictionary Coding}
 
 \label{k4:dict}
-Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand occurences of strings. Strings are a chain of characters representing a full word or just a part of it. For a better understanding this should be illustrated by a short example:
+Dictionary coding, as the name suggest, uses a dictionary to eliminate redundant occurrences of strings. Strings are a chain of characters representing a full word or just a part of it. For a better understanding, this should be illustrated by a short example:
 % demo substrings
-Looking at the string 'stationary' it might be smart to store 'station' and 'ary' as seperate dictionary enties. Which way is more efficient depents on the text that should get compressed. 
+Looking at the string 'stationary' it might be smart to store 'station' and 'ary' as separate dictionary entries. Which way is more efficient depends on the text that should get compressed. 
 % end demo
-The dictionary should only store strings that occur in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is part of the text. Each first occurence is left uncompressed. Each occurence of a string, after the first one, points either to to its first occurence or to the last replacement of its occurence.\\ 
-\ref{k4:dict-fig} illustrates how this process is executed. The bar on top of the figure, which extends over the full widht, symbolizes any text. The squares inside the text are repeating occurences of text segments. 
-In the dictonary coding process, the square annotated as \texttt{first occ.} is added to the dictionary. \texttt{second} and \texttt{third occ.} get replaced by a structure \texttt{<pos, len>} consisting of a pointer to the position of the first occurence \texttt{pos} and the length of that occurence \texttt{len}.
-The bar at the bottom of the figure shows how the compressed text for this example would be structured. The dotted lines would only consist of two bytes, storing position and lenght, pointing to \texttt{first occ.}. Decompressing this text would only require to parse the text from left to right and replace every \texttt{<pos, len>} with the already parsed word from the dictionary. This means jumping back to the parsed position stored in the replacement, reading for as long as the length dictates, copying the read section, jumping back and pasting the section.\\
+The dictionary should only store strings that occur in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dictionary is part of the text. Each first occurrence is left uncompressed. Each occurrence of a string, after the first one, points either to to its first occurrence or to the last replacement of its occurrence. Which method is used depends on the algorithm.\\ 
+\ref{k4:dict-fig} illustrates how this process is executed. The bar on top of the figure, which extends over the full width, symbolizes any text. The squares inside the text are repeating occurrences of text segments. 
+In the dictionary coding process, the square annotated as \texttt{first occ.} is added to the dictionary. \texttt{Second} and \texttt{third occ.} get replaced by a structure \texttt{<pos, len>} consisting of a pointer to the position of the first occurrence \texttt{pos} and the length of that occurrence \texttt{len}.
+The bar at the bottom of the figure shows how the compressed text for this example would be structured. The dotted lines would only consist of two bytes, storing position and lenght, pointing to \texttt{first occ.}. Decompressing this text would only require parsing the text from left to right and to replace every \texttt{<pos, len>} with the already parsed word from the dictionary. This means jumping back to the parsed position stored in the replacement, reading for as long as the length dictates, copying the read section, jumping back and pasting the section.\\
 % offsets are volatile when replacing
 
 \begin{figure}[H]
   \centering
   \includegraphics[width=15cm]{k4/dict-coding.png}
-  \caption{Schematic sketch, illustrating the replacement of multiple occurences done in dictionary coding.}
+  \caption{Schematic sketch, illustrating the replacement of multiple occurrences done in dictionary coding.}
   \label{k4:dict-fig}
 \end{figure}
 
 \label{k4:lz}
 \subsubsection{The LZ Family}
-The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv created multiple algorithms that are based on dictionary coding. They can be recognized by the substring \texttt{LZ} in its name, like \texttt{LZ77 and LZ78} which are short for Lempel Ziv 1977 and 1978 \cite{lz77}. The number at the end indictates when the algorithm was published. Today LZ78 is widely used in unix compression solutions like gzip and bz2. Those tools are also used in compressing \ac{DNA}.\\
+The computer scientist Abraham Lempel and the electrical engineer Jacob Ziv created multiple algorithms that are based on dictionary coding. They can be recognized by the substring \texttt{LZ} in their name; like \texttt{LZ77 and LZ78} which are short for Lempel Ziv 1977 and 1978 \cite{lz77}. The number at the end indicates when the algorithm was published. Today, members of the LZ family are widely used in compression implementations like rar, zip, gzip and bz2 \cite{rfcgzip}. Some of those are also used to compress \ac{DNA}.\\
 
-\acs{LZ77} basically works, by removing all repetition of a string or substring 
-and replacing them with information where to find the first occurence and how long it is. Lempel and Ziv described restricted the pointer in a range to integers. Today a pointer, length pair is typically stored in two bytes. One bit is reseverd to indicate that the next 15 bit are a position, lenght pair. More than 8 bit are available to store the pointer and the rest is reserved for storing the length. Exact amounts depend on the implementation \cite{rfc1951, lz77}.
+\acs{LZ77} basically works, by removing all repetitions of a string or substring and replacing them with information where to find the first occurrence and how long it is. The distance between the first occurrence and a replacement is limited, because each pointer has a static amount of storage available. A pointer, length pair is typically stored in two bytes. One bit is reseverd to indicate that the next 15 bit are a position, lenght pair. More than 8 bit are available to store the pointer and the rest is reserved for storing the length. Exact amounts depend on the implementation \cite{rfc1951, lz77}.
 % rewrite and implement this:
-%This method is limited by the space a pointer is allowed to take. Other variants let the replacement store the offset to the last replaced occurence, therefore it is harder to reach a position where the space for a pointer runs out.
-
-Unfortunally, implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore not in the main scope for this work. To strenghten the understanding of compression algortihms this section will remain. Also it will be usefull for the explanation of a hybrid coding method, which will get described later in this chapter.\\
+%This method is limited by the space a pointer is allowed to take. Other variants let the replacement store the offset to the last replaced occurrence, therefore it is harder to reach a position where the space for a pointer runs out.
 
+Unfortunally, implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore not in the main scope for this work. To strengthen the general understanding of compression algortihms and because it is a part of hybrid coding implementations, this section remains.\\
 
 \subsection{Shannons Entropy}
-The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \cite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition.
+The founder of information theory Claude Elwood Shannon described entropy and published his work in 1948 \cite{Shannon_1948}. Here, he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only useful for forms of information transmission.
 
 % todo insert Fig. 1 shannon_1948
 \begin{figure}[H]
@@ -79,10 +77,10 @@ The founder of information theory Claude Elwood Shannon described entropy and pu
   \label{k4:comsys}
 \end{figure}
 
-Altering \ref{k4:comsys} would show how this can be applied to other technology like compression. The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same physical actor.\\
+Altering \ref{k4:comsys} would show how this can be applied to other technology like compression. The information source and destination are left unchanged; one has to keep in mind, it is possible that both are represented by the same physical actor.\\
 Transmitter and receiver would be changed to compression/encoding and decompression/decoding. Inbetween those two, there is no signal but instead any period of time \cite{Shannon_1948}.\\
 
-Shannons Entropy provides a formula to determine the 'uncertainty of a probability distribution' in a finite field.
+Shannon's Entropy provides a formula to determine the ``uncertainty of a probability distribution'' in a finite field.
 
 \begin{equation}\label{eq:entropy}
 %\resizebox{.9 \textwidth}{!}
@@ -99,7 +97,7 @@ Shannons Entropy provides a formula to determine the 'uncertainty of a probabili
 %  \label{k4:entropy}
 %\end{figure}
 
-He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite probability space. Then $x\in X$ are possible final states of an probability experiment over X. Every state that actually occurs, while executing the experiment generates information which is meassured in \textit{Bits} with the part of the equation displayed in \ref{eq:info-in-bit} \cite{delfs_knebl,Shannon_1948}:
+He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite probability space. Then $x\in X$ are possible final states of a probability experiment over X. Every state that actually occurs, while executing the experiment generates information which is measured in binary digits \textit{bits} for short with the part of the equation displayed in \ref{eq:info-in-bit} \cite{delfs_knebl,Shannon_1948}:
 
 \begin{equation}\label{eq:info-in-bit}
  log_2(\frac{1}{prob(x)}) \equiv - log_2(prob(x)).
@@ -112,6 +110,8 @@ He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite prob
 %  \label{f4:info-in-bit}
 %\end{figure}
 
+%Noteable here is, that with \textit{Bits} a unit for the information entropy is meant. Even though they store the same form of data, no indications could be found, that there is a direct connection to the binary digit (\,bit)\, that describes the physical unit to store iformation in computer science.
+
 %todo explain 2.2 second bulletpoint of delfs_knebl. Maybe read gumbl book
 
 %This can be used to find the maximum amount of bit needed to store information.\\ 
@@ -119,7 +119,7 @@ He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite prob
 
 \label{k4:arith}
 \subsection{Arithmetic coding}
-This coding method is an approach to solve the problem of wasting memeory due to the overhead which is created by encoding certain lenghts of alphabets in binary \cite{ris76, moffat_arith}. For example: Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bit, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective and thinking a step further: Less storage would be required, if there would be a possibility to encode more than one letter in two bit.\\
+This coding method is an approach to solve the problem of wasting memory due to the overhead which is created by encoding certain lengths of alphabets in binary \cite{ris76, moffat_arith}. For example: Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bit, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective and thinking a step further: Less storage would be required, if there was a possibility to encode more than one letter in two bit.\\
 Dr. Jorma Rissanen described arithmetic coding in a publication in 1976 \cite{ris76}. % Besides information theory and math, he also published stuff about dna
 This works goal was to define an algorithm that requires no blocking. Meaning the input text could be encoded as one instead of splitting it and encoding the smaller texts or single symbols. He stated that the coding speed of arithmetic coding is comparable to that of conventional coding methods \cite{ris76}.  
 
@@ -141,10 +141,10 @@ The coding algorithm works with probabilities for symbols in an alphabet. From a
 \end{equation}
 }
 % math and computers
-Bevore getting into the arithmetic coding algorithm, the following section will go over some details on how digital fractions are handled by computers. This knowledge will be helpfull in understanding how arithmetic coding works.\\
-In computers, arithmetic operations on floating point numbers are processed with integer representations of given floating point number \cite{ieee-float}. The number 0.4 + would be represented by $4\cdot 10^{-1}$.\\
-A interval would be represented by natural numbers between 0 and 100 and $... \cdot 10^-x$. \texttt{x} starts with the value 2 and grows as the intgers grow in length, meaning only if a uneven number is divided. For example: Dividing a uneven number like $5\cdot 10^{-1}$ by two, will result in $25\cdot 10^{-2}$. On the other hand, subdividing $4\cdot 10^y$ by two, with any negativ real number as y would not result in a greater \texttt{x} the length required to display the result will match the length required to display the input number \cite{witten87, moffat_arith}.\\
-Binary fractions are limited in from of representing decimal fractions. This is due to the fact that every other digit, adds zero or half of the value before. In other terms: $b \cdot 2^{-n}$ determines the value of $b \in {0,1}$ at position n behind the decimal point.\\
+Before getting into the arithmetic coding algorithm, the following section will go over some details on how digital fractions are handled by computers. This knowledge will be helpful in understanding how arithmetic coding works.\\
+In computers, arithmetic operations on floating point numbers are processed with integer representations \cite{ieee-float}. The number 0.4 for example would be represented by $4\cdot 10^{-1}$.\\
+An interval would be represented by natural numbers between 0 and 100 and $... \cdot 10^-x$. \texttt{x} starts with the value 2 and grows as the integers grow in length; meaning only if a uneven number is divided. For example: Dividing an uneven number like $5\cdot 10^{-1}$ by two, will result in $25\cdot 10^{-2}$. On the other hand, subdividing $4\cdot 10^y$ by two, with any negative real number as y would not result in a greater \texttt{x}. The length required to display the result will match the length required to display the input number \cite{witten87, moffat_arith}.\\
+Binary fractions are limited in form of representing decimal fractions. This is due to the fact that every other digit adds zero or half of the value before. In other terms: $b \cdot 2^{-n}$ determines the value of $b \in {0,1}$ at position n behind the decimal point.\\
 
 
 %todo example including figure
@@ -156,8 +156,8 @@ Binary fractions are limited in from of representing decimal fractions. This is
   \label{k4:arith-unscaled}
 \end{figure}
 
-The encoding of the input text, or a sequence is possible by projecting it on a binary encoded fraction between 0 and 1. To get there, each character in the alphabet is represented by an interval between two fractions, in the space between 0.0 and 1.0. In \ref{k4:arith-unscaled} this space is illustraded by the line in the upper center, with a scaling form 0.0 on the left, to 1.0 on the right side. The interval for each symbol is determined by its distribution, in the input text (interval start) and the the start of the next character (interval end). The sum of all intervals will result in one \cite{moffat_arith}.\\
-In order, to remain in a presentable range, the example in \ref{k4:arith-unscaled} uses an alphabet of only three characters: \texttt{A, C and G}. For the sequence \texttt{AGAC} a probability distribution as shown in the upper left corner and listed in \ref{t:arith-prob} was calculated. The intervals resulting from this probabilities, are visualized by the three sections marked by outwards pointing arrows at the top. The interval for \texttt{A} extends from 0.0 until the start of \texttt{C} at 0.5, which extends to the start of \texttt{G} at 0.75 and so on.\\
+The encoding of the input text, or a sequence is possible by projecting it on a binary encoded fraction between 0 and 1. To get there, each character in the alphabet is represented by an interval between two fractions, in the space between 0.0 and 1.0. In \ref{k4:arith-unscaled} this space is illustrated by the line in the upper center, with a scaling from 0.0 on the left, to 1.0 on the right side. The interval for each symbol is determined by its distribution, in the input text (interval start) and the start of the next character (interval end). The sum of all intervals will result in one \cite{moffat_arith}.\\
+In order to remain in a presentable range, the example in \ref{k4:arith-unscaled} uses an alphabet of only three characters: \texttt{A, C and G}. For the sequence \texttt{AGAC} a probability distribution as shown in the upper left corner and listed in \ref{t:arith-prob} was calculated. The intervals resulting from these probabilities are visualized by the three sections marked by outwards pointing arrows at the top. The interval for \texttt{A} extends from 0.0 until the start of \texttt{C} at 0.5, which extends to the start of \texttt{G} at 0.75 and so on.\\
 
 \label{t:arith-prob}
 \sffamily
@@ -169,32 +169,32 @@ In order, to remain in a presentable range, the example in \ref{k4:arith-unscale
     \toprule
      \textbf{Symbol} & \textbf{Probability} & \textbf{Interval}\\
     \midrule
-			A & $\frac{2}{4}=0.11$ & [0.0, 0.5[ \\ %${x\in \mathbb{Q} | 0.0 <= x < 0.5}$\\
-			C & $\frac{1}{4}=0.71$ & [0.5, 0.75[ \\ %${x\in \mathbb{Q} | 0.5 <= x < 0.75}$\\
-			G & $\frac{1}{4}=0.13$ & [0.75, 1.0[ \\ %${x\in \mathbb{Q} | 0.75 <= x < 1.0}$\\
+			A & $\frac{2}{4}=0.11$ & $[ 0.0,$ $ 0.5 ) $ \\ %${x\in \mathbb{Q} | 0.0 <= x < 0.5}$\\
+			C & $\frac{1}{4}=0.71$ & $[ 0.5,$ $ 0.75 ) $ \\ %${x\in \mathbb{Q} | 0.5 <= x < 0.75}$\\
+			G & $\frac{1}{4}=0.13$ & $[ 0.75,$ $ 1.0 ) $ \\ %${x\in \mathbb{Q} | 0.75 <= x < 1.0}$\\
     \bottomrule
   \end{longtable}
 \end{footnotesize}
 \rmfamily
 
-In the encoding process, the first symbol read from the sequence determines a interval, its symbol is associated with. Every following symbol determines a subinterval, which is formed by subdividing the previous interval into sections proportional to the probabilities from \ref{t:arith-prob}.
-Starting with \texttt{A}, the most left interval in \ref{k4:arith-unscaled} is subdivided into intervals visulaized below. Leaving a available space of $[0.0, 0.5)$. From there the interval, representing \texttt{G} is subdivided, and so on until the last symbol \texttt{C} is processed. This leaves a interval of $[0.40625, 0.421275)$. This is marked in \ref{k4:arith-unscaled} with a red line. Since the interval is comparably small, in the illustration it seems like a point in the interval is marked. This is not the case, the red line shows the position of the last mentioned interval.\\
+In the encoding process, the first symbol read from the sequence determines a interval that its symbol is associated with. Every following symbol determines a subinterval, which is formed by subdividing the previous interval into sections proportional to the probabilities from \ref{t:arith-prob}.
+Starting with \texttt{A}, the most left interval in \ref{k4:arith-unscaled} is subdivided into intervals visualized below. Leaving an available space of $[0.0, 0.5)$. From there, the interval representing \texttt{G} is subdivided, and so on until the last symbol \texttt{C} is processed. This leaves an interval of $[0.40625, 0.421275)$. This is marked in \ref{k4:arith-unscaled} with a red line. Since the interval is comparably small, in the illustration it seems like a point in the interval is marked. This is not the case, the red line shows the position of the last mentioned interval.\\
 %To encode a text, subdividing is used, step by step on the text symbols from start to the end
-To store the encoding result in as few bits as possible, only a single number,between upper and lower end of the last intervall will be stored. To encode in binary, the binary floating point representation of any number inside the interval, for the last character is calculated.\\
-For this example, the number \texttt{0.41484375} in decimal, or \texttt{0.0110101} in binary, would be calculated.\\
+To store the encoding result in as few bits as possible, only a single number between the upper and the lower end of the last interval will be stored. To encode in binary, the binary floating point representation of any number inside the interval for the last character is calculated.\\
+In this example, the number \texttt{0.41484375} in decimal, or \texttt{0.0110101} in binary, would be calculated.\\
 %todo compression ratio
 To summarize the encoding process in short \cite{moffat_arith, witten87}:\\
  
 \begin{itemize}
 	\item The interval representing the first character is noted. 
 	\item Its interval is split into smaller intervals, with the ratios of the initial intervals between 0.0 and 1.0. 
-	\item The interval representing the second character is choosen.
-	\item This process is repeated, until a interval for the last character is determined.
-	\item A binary floating point number is determined wich lays in between the interval that represents the represents the last symbol.\\
+	\item The interval representing the second character is chosen.
+	\item This process is repeated until an interval for the last character is determined.
+	\item A binary floating point number is determined wich lays in between the interval that represents the last symbol.\\
 \end{itemize}
 
 % its finite subdividing because of the limitation that comes with processor architecture
-For the decoding process to work, the \ac{EOF} symbol must be be present as the last symbol in the text. The compressed file will store the probabilies of each alphabet symbol as well as the floatingpoint number. The decoding process executes in a simmilar procedure as the encoding. The stored probabilies determine intervals. Those will get subdivided, by using the encoded floating point as guidance, until the \ac{EOF} symbol is found. By noting in which interval the floating point is found, for every new subdivision, and projecting the probabilies associated with the intervals onto the alphabet, the origin text can be read \cite{witten87, moffat_arith, ris76}.\\
+For the decoding process to work, the \ac{EOF} symbol must be present as the last symbol in the text. The compressed file will store the probabilities of each alphabet symbol as well as the floatingpoint number. The decoding process executes in a similar procedure as the encoding. The stored probabilities determine intervals. Those will get subdivided by using the encoded floating point as guidance until the \ac{EOF} symbol is found. By noting in which interval the floating point is found for every new subdivision, and projecting the probabilities associated with the intervals onto the alphabet, the origin text can be read \cite{witten87, moffat_arith, ris76}.\\
 % rescaling
 
 \begin{figure}[H]
@@ -206,25 +206,26 @@ For the decoding process to work, the \ac{EOF} symbol must be be present as the
 \end{figure}
 
 % finite percission
-The described coding is only feasible on machines with infinite percission \cite{witten87}. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bit or any number between 0 and 18,446,744,073,709,551,615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued) \cite{moffat_arith}. For the compression with finite percission, rescaling is used. This method works by scaling up the intervals which results from subdividing. With that. The process for this is illustrated in \ref{k4:arith-scaled}. The red lines indicate the final interval.
+The described coding is only feasible on machines with infinite precision \cite{witten87}. As soon as finite precision comes into play, the algorithm must be extended, so that a certain length in the resulting number will not be exceeded. This is due to the fact that digital datatypes are limited in their capacity for example, the unsigned 64-bit integers which can store up to $2^64-1$ bit or any number between 0 and 18,446,744,073,709,551,615. That might seem like a great amount at first, but considering a unfavorable alphabet that extends the results lenght by one on each symbol that is read, only sequences with the length of 63 can be encoded (62 if \acs{EOF} is exclued) \cite{moffat_arith}. For the compression with finite percission, rescaling is used. This method works by scaling up the intervals which result from subdividing. The upscaling process is illustrated in \ref{k4:arith-scaled}. The vertical lines illustrate the interval of each step. The smaller, black lines between them indicate which previous section was scaled up. The red lines indicate the final interval and the letters at the bottom indicate which symbol gets encoded in this step.
 
 \label{k4:huff}
-\subsection{Huffman encoding}
+\subsection{Huffman Encoding}
 % list of algos and the tools that use them
-D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The Shannon-Fano coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. 
+D. A. Huffman's work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano, named after its developers, which worked similar. The Shannon-Fano coding is not used today due to the superiority of Huffman's algorithm in both efficiency and effectivity \cite{moffat_arith}.\\
 Even though his work was released in 1952, the method he developed is in use  today. Not only tools for genome compression but in compression tools with a more general ussage \cite{rfcgzip}.\\ 
-Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bit, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bit used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \cite{huf52}. 
-As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how the symbols will be encoded. As in arithmetic coding, the probability of a letter is calculated for given text. The binary tree will be constructed after following guidelines \cite{alok17}:
+Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}; of waste through unused bit for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bit used per symbol in a message. This is possible by setting individual bit lengths for symbols used in the text that should get compressed \cite{huf52}. 
+As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how each individual symbols will be encoded. The binary tree will be constructed after following guidelines \cite{alok17}:
 % greedy algo?
 \begin{itemize}
   \item Every symbol of the alphabet is one leaf.
   \item The right branch from every knot is marked as a 1, the left one is marked as a 0.
-  \item Every symbol got a weight, the weight is defined by the frequency the symbol occurs in the input text. This might be a fraction between 0 and 1 or an integer. In this scenario it will described as the first.
-  \item The less weight a leaf has, the higher the probability is, that this node is read next in the symbol sequence.
-  \item The leaf with the lowest probability is most left and the one with the highest probability is most right in the tree. 
+  \item Every symbol got a weigh. The weight is defined by the frequency the symbol occurs in the input text. This might be a fraction between 0 and 1 or an integer. In this scenario it will described as the first.
+  \item The less weight a leaf has, the higher is the probability, that this node is read next in the symbol sequence.
+	\item Pairs of the lowest weighting nodes are formed. This pair will from there on be represented by a node which weight is equal to the sum of the weight of its child nodes.
+	\item Higher weighting nodes are positioned left, lower ones right.
 \end{itemize}
 %todo tree building explanation
-A often mentioned difference between Shannon-Fano and Huffman coding, is that first is working top down while the latter is working bottom up. This means the tree starts with the lowest weights. The nodes that are not leafs have no value ascribed to them. They only need their weight, which is defined by the weights of their individual child nodes \cite{moffat20, alok17}.\\
+An often-mentioned difference between Shannon-Fano and Huffman coding is that the first is working top down while the latter is working bottom up. Meaning the first Shannon-Fano is starting with the highest probabilities while Huffman starts with the lowest \cite{moffat20, alok17}.\\
 
 Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \texttt{$W_{i}$} and codeword length as \texttt{$L_{i}$} for the node \texttt{$K_{i}$}. Then will \texttt{$L_{av}$} be the average length for \texttt{L} in a finite chain of symbols, with a distribution that is mapped onto \texttt{W} \cite{huf52}.
 \begin{equation}\label{eq:huf}
@@ -257,7 +258,7 @@ The average length for any symbol encoded in \acs{ASCII} is eight, while only us
 \end{footnotesize}
 \rmfamily
 
-The exact input text is not relevant, since only the resulting probabilities are needed. To make this example more illustrative, possible occurences are listed in the most right column of \ref{t:huff-pre}. The probability for each symbol is calculated by dividing the message length by the times the symbol occured. This and the resulting probabilities on a scale between 0.0 and 1.0, for this example are shown in \ref{t:huff-pre} \cite{huf52}.\\ 
+The exact input text is not relevant, since only the resulting probabilities are needed. To make this example more illustrative, possible occurrences are listed in the most right column of \ref{t:huff-pre}. The probability for each symbol is calculated by dividing the message length by the times the symbol occured. This and the resulting probabilities on a scale between 0.0 and 1.0, for this example are shown in \ref{t:huff-pre} \cite{huf52}.\\ 
 Creating a tree will be done bottom up. In the first step, for each symbol from the alphabet, a node without any connection is formed .\\
 
 \texttt{<A>, <T>, <C>, <G>}\\
@@ -372,7 +373,7 @@ With this simple rules, the alphabet can be compressed too. Instead of storing c
 BGZF extends this by creating a series of blocks. Each can not extend a limit of 64 Kilobyte. Each block contains a standard gzip file header, followed by compressed data.\\
 
 \subsubsection{CRAM}
-The improvement of \acs{BAM} \cite{cram-origin} called \acs{CRAM}, also features a block structure \cite{bam}. The whole file can be seperated into four sections, stored in ascending order: File definition, a CRAM Header Container, multiple Data Container and a final CRAM EOF Container.\\
+The improvement of \acs{BAM} \cite{cram-origin} called \acs{CRAM}, also features a block structure \cite{bam}. The whole file can be separated into four sections, stored in ascending order: File definition, a CRAM Header Container, multiple Data Container and a final CRAM EOF Container.\\
 The complete structure is displayed in \ref{k4:cram-struct}. The following paragrph will give a brief description to the high level view of a \acs{CRAM} fiel, illustrated as the most upper bar. Followed by a closer look at the data container, which components are listed in the bar, at the center of \ref{k4:cram-struct}. The most in deph explanation will be given to the bottom bar, which shows the structure of so called slices.\\
 
 \begin{figure}[H]

+ 1 - 1
latex/tex/kapitel/k5_feasability.tex

@@ -136,7 +136,7 @@ To leave the testing environment in a consistent state, not project specific pro
 Due to following circumstances, a current Linux distribution was chosen as a suitable operating system:
 \begin{itemize}
   \item{factors that interfere with a consistent efficiency value should be avoided}
-  \item{packages, support and user experience should be present to an reasonable ammount}
+  \item{packages, support and user experience should be present to an reasonable amount}
 \end{itemize}
 Some background processes will run while the compression analysis is done. This is owed to the demand of an increasingly complex operating system to execute complex programs. Considering that different tools will be exeuted in this environment, minimizing the background processes would require building a custom operating system or configuring an existing one to fit this specific use case. The boundary set by the time limitation for this work rejects named alternatives. 
 %By comparing the values of explained factors, a sweet spot can be determined:

+ 4 - 4
latex/tex/kapitel/k6_results.tex

@@ -1,5 +1,5 @@
 \chapter{Results and Discussion}
-The tables \ref{a6:compr-size} and \ref{a6:compr-time} contain raw measurement values for the two goals, described in \ref{k5:goals}. The table \ref{a6:compr-time} lists how long each compression procedure took, in milliseconds. \ref{a6:compr-size} contains file sizes in bytes. In these tables, as well as in the other ones associated with tests in the scope of this work, the a name scheme is used, to improve readability. The filenames were replaced by \texttt{File} followed by two numbers seperated by a point. For the first test set, the number prefix \texttt{1.} was used, the second set is marked with a \texttt{2.}. For example, the fourth file of each test, in tables are named like this \texttt{File 1.4} and \texttt{File 2.4}. The name of the associated source file for the first set is:
+The tables \ref{a6:compr-size} and \ref{a6:compr-time} contain raw measurement values for the two goals, described in \ref{k5:goals}. The table \ref{a6:compr-time} lists how long each compression procedure took, in milliseconds. \ref{a6:compr-size} contains file sizes in bytes. In these tables, as well as in the other ones associated with tests in the scope of this work, the a name scheme is used, to improve readability. The filenames were replaced by \texttt{File} followed by two numbers separated by a point. For the first test set, the number prefix \texttt{1.} was used, the second set is marked with a \texttt{2.}. For example, the fourth file of each test, in tables are named like this \texttt{File 1.4} and \texttt{File 2.4}. The name of the associated source file for the first set is:
 
 \texttt{Homo\_sapiens.GRCh38.dna.chromosome.\textbf{4}.fa}
 
@@ -222,7 +222,7 @@ S.V. Petoukhov described his prepublished findings, which are under ongoing rese
 \texttt{\% C $\approx$ $\sum$\%CN $\approx$ $\sum$\%NC $\approx$ $\sum$\%CNN $\approx$ $\sum$\%NCN $\approx$ $\sum$\%NNC $\approx$ $\sum$\%CNNN $\approx$ $\sum$\%NCNN $\approx$ $\sum$\%NNCN $\approx$ $\sum$\%NNNC ...}
 
 Whereas the elements in each sum get more, with a increasing n in the n-plet. To be precise $4^n$ describes the growing of combinations.
-For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of \texttt{N} as a placeholder for any nucleotide of \texttt{A, C, G or T}, and including at least one \texttt{C} might be determinable without counting them \cite{pet21}. \texttt{$\sum$\%CN} means the probability \texttt{\%C} determines a estimation of all occurences \texttt{\%CC + \%CA + \%CG + \%CT}.\\
+For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of \texttt{N} as a placeholder for any nucleotide of \texttt{A, C, G or T}, and including at least one \texttt{C} might be determinable without counting them \cite{pet21}. \texttt{$\sum$\%CN} means the probability \texttt{\%C} determines a estimation of all occurrences \texttt{\%CC + \%CA + \%CG + \%CT}.\\
 Further he described that there might be a simliarity between nucleotides. 
 
 \begin{figure}[H]
@@ -248,7 +248,7 @@ This approach throws a few questions that need to be answered in order to plan a
 \end{itemize}
 
 % first bulletpoint
-The question for how many probabilities are needed, needs to be answered, to start working on any kind of implementation. This question will only get answered by theoretical prove. It could happen in form of a mathematical equation, which proves that counting all occurences of one nucleotide reveals can be used to determin all probabilities. 
+The question for how many probabilities are needed, needs to be answered, to start working on any kind of implementation. This question will only get answered by theoretical prove. It could happen in form of a mathematical equation, which proves that counting all occurrences of one nucleotide reveals can be used to determin all probabilities. 
 %Since this task is time and resource consuming and there is more to discuss, finding a answer will be postponed to another work. 
 %One should keep in mind that this is only one of many approaches. Any prove of other approaches which reduces the probability determination, can be taken in instead. 
 
@@ -292,7 +292,7 @@ $\%G = \%A$\\
 
 The mapping, mentioned in the last point would be identical to the first and last declaration.\\
 % actual programming approach
-Working with probabilities, in fractions as well as in percent, would probabily mean rounding values. To increase the accuracity, the actual value resulting form the counted symbol could be used. For this to work, the ammount of overall symbols had to be determined, for \texttt{\%A+\%G} to be calculateable. Since counting all symbols while counting the one nucleotide, would have an impact on the runtime, the value could be calculated. With s beeing the size of the parsed sequence in bytes and c beeing the bytes per character $\frac{s}{c}$ would result in the amount of symbols in the sequence.\\
+Working with probabilities, in fractions as well as in percent, would probabily mean rounding values. To increase the accuracity, the actual value resulting form the counted symbol could be used. For this to work, the amount of overall symbols had to be determined, for \texttt{\%A+\%G} to be calculateable. Since counting all symbols while counting the one nucleotide, would have an impact on the runtime, the value could be calculated. With s beeing the size of the parsed sequence in bytes and c beeing the bytes per character $\frac{s}{c}$ would result in the amount of symbols in the sequence.\\
 They obviously differ in several categories: runtime, since each is parsing more of the sequence, and the grade of heuristic which is taken into account.
 
 % more realistic view on parsing todo need cites

+ 101 - 100
latex/tex/literatur.bib

@@ -1,3 +1,102 @@
+@TechReport{rfc-udp,
+  author       = {J. Postel},
+  date         = {1980-08-28},
+  institution  = {RFC Editor},
+  title        = {User Datagram Protocol},
+  doi          = {10.17487/RFC0768},
+  number       = {768},
+  pagetotal    = {3},
+  url          = {https://www.rfc-editor.org/info/rfc768},
+  howpublished = {RFC 768},
+  month        = {aug},
+  publisher    = {RFC Editor},
+  series       = {Request for Comments},
+  year         = {1980},
+}
+
+@TechReport{rfcgzip,
+  author       = {L. Peter Deutsch and Jean-Loup Gailly and Mark Adler and L. Peter Deutsch and Glenn Randers-Pehrson},
+  date         = {1996-05},
+  title        = {GZIP file format specification version 4.3},
+  number       = {1952},
+  type         = {RFC},
+  howpublished = {Internet Requests for Comments},
+  issn         = {2070-1721},
+  month        = {May},
+  publisher    = {RFC},
+  year         = {1996},
+}
+
+@TechReport{rfcansi,
+  author       = {K. Simonsen and},
+  title        = {Character Mnemonics and Character Sets},
+  number       = {1345},
+  type         = {RFC},
+  howpublished = {Internet Requests for Comments},
+  issn         = {2070-1721},
+  month        = {June},
+  year         = {1992},
+}
+
+@TechReport{isompeg,
+  author      = {ISO/IEC 23092-1:2020/CD},
+  date        = {2020-10},
+  institution = {International Organization for Standardization, Geneva, Switzerland.},
+  title       = {Information technology — Genomic information representation — Part 1: Transport and storage of genomic information — Amendment 1: Support for Part 6},
+  type        = {Standard},
+  url         = {https://www.iso.org/standard/23092.html},
+  year        = {2019},
+}
+
+@TechReport{iso-ascii,
+  author      = {ISO/IEC JTC 1/SC 2 Coded character sets},
+  date        = {1998-04},
+  institution = {International Organization for Standardization},
+  title       = {Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1},
+  type        = {Standard},
+  address     = {Geneva, CH},
+  key         = {ISO8859-1:1998},
+  volume      = {1998},
+  year        = {1998},
+}
+
+@TechReport{isoutf,
+  author = {ISO},
+  title  = {ISO/IEC 10646:2020 UTF},
+}
+
+@Article{ju_21,
+  author       = {Philomin Juliana and Ravi Prakash Singh and Jesse Poland and Sandesh Shrestha and Julio Huerta-Espino and Velu Govindan and Suchismita Mondal and Leonardo Abdiel Crespo-Herrera and Uttam Kumar and Arun Kumar Joshi and Thomas Payne and Pradeep Kumar Bhati and Vipin Tomar and Franjel Consolacion and Jaime Amador Campos Serna},
+  date         = {2021-03},
+  journaltitle = {Scientific Reports},
+  title        = {Elucidating the genetics of grain yield and stress-resilience in bread wheat using a large-scale genome-wide association mapping study with 55,568 lines},
+  doi          = {10.1038/s41598-021-84308-4},
+  volume       = {11},
+  publisher    = {Springer Science and Business Media {LLC}},
+}
+
+@Article{mo_83,
+  author       = {Arno G. Motulsky},
+  date         = {1983-01},
+  journaltitle = {Science},
+  title        = {Impact of Genetic Manipulation on Society and Medicine},
+  doi          = {10.1126/science.6336852},
+  pages        = {135--140},
+  volume       = {219},
+  publisher    = {American Association for the Advancement of Science ({AAAS})},
+}
+
+@Article{wang_22,
+  author       = {Si-Wei Wang and Chao Gao and Yi-Min Zheng and Li Yi and Jia-Cheng Lu and Xiao-Yong Huang and Jia-Bin Cai and Peng-Fei Zhang and Yue-Hong Cui and Ai-Wu Ke},
+  date         = {2022-02},
+  journaltitle = {Molecular Cancer},
+  title        = {Current applications and future perspective of {CRISPR}/Cas9 gene editing in cancer},
+  doi          = {10.1186/s12943-022-01518-8},
+  number       = {1},
+  volume       = {21},
+  publisher    = {Springer Science and Business Media {LLC}},
+}
+
 @Article{alok17,
   author       = {Anas Al-Okaily and Badar Almarri and Sultan Al Yami and Chun-Hsi Huang},
   date         = {2017-04-01},
@@ -134,19 +233,6 @@
   publisher    = {Foundation of Computer Science},
 }
 
-@TechReport{rfcgzip,
-  author       = {L. Peter Deutsch and Jean-Loup Gailly and Mark Adler and L. Peter Deutsch and Glenn Randers-Pehrson},
-  date         = {1996-05},
-  title        = {GZIP file format specification version 4.3},
-  number       = {1952},
-  type         = {RFC},
-  howpublished = {Internet Requests for Comments},
-  issn         = {2070-1721},
-  month        = {May},
-  publisher    = {RFC},
-  year         = {1996},
-}
-
 @Article{huf52,
   author      = {Huffman, David A.},
   title       = {A Method for the Construction of Minimum-Redundancy Codes},
@@ -203,6 +289,7 @@
 }
 
 @Article{ieee-float,
+	title 	= {Fasi, Massimiliano and Mikaitis, Mantas},
   title   = {IEEE Standard for Floating-Point Arithmetic},
   doi     = {10.1109/IEEESTD.2019.8766229},
   pages   = {1-84},
@@ -244,18 +331,6 @@
 }
 
 
-
-@TechReport{rfcansi,
-  author       = {K. Simonsen and},
-  title        = {Character Mnemonics and Character Sets},
-  number       = {1345},
-  type         = {RFC},
-  howpublished = {Internet Requests for Comments},
-  issn         = {2070-1721},
-  month        = {June},
-  year         = {1992},
-}
-
 @Article{witten87,
   author       = {Ian H. Witten and Radford M. Neal and John G. Cleary},
   date         = {1987-06},
@@ -310,17 +385,7 @@
   volume       = {15},
   publisher    = {Public Library of Science ({PLoS})},
 }
-@TechReport{isompeg,
-  author      = {{ISO Central Secretary}},
-  date        = {2020-10},
-  institution = {International Organization for Standardization},
-  title       = {MPGE-G},
-  language    = {en},
-  number      = {ISO/IEC 23092-1:2020},
-  type        = {Standard},
-  url         = {https://www.iso.org/standard/23092.html},
-  year        = {2019},
-}
+
 
 @Article{mpeg,
   author    = {Claudio Albert and Tom Paridaens and Jan Voges and Daniel Naro and Junaid J. Ahmad and Massimo Ravasi and Daniele Renzi and Giorgio Zoia and Paolo Ribeca and Idoia Ochoa and Marco Mattavelli and Jaime Delgado and Mikel Hernaez},
@@ -359,17 +424,6 @@
   publisher = {{MDPI} {AG}},
 }
 
-@TechReport{iso-ascii,
-  author      = {ISO/IEC JTC 1/SC 2 Coded character sets},
-  date        = {1998-04},
-  institution = {International Organization for Standardization},
-  title       = {Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1},
-  type        = {Standard},
-  address     = {Geneva, CH},
-  key         = {ISO8859-1:1998},
-  volume      = {1998},
-  year        = {1998},
-}
 
 @Book{dict,
   author    = {McIntosh, Colin},
@@ -380,26 +434,6 @@
   publisher = {Cambridge University Press},
 }
 
-@TechReport{rfc-udp,
-  author       = {J. Postel},
-  date         = {1980-08-28},
-  institution  = {RFC Editor},
-  title        = {User Datagram Protocol},
-  doi          = {10.17487/RFC0768},
-  number       = {768},
-  pagetotal    = {3},
-  url          = {https://www.rfc-editor.org/info/rfc768},
-  howpublished = {RFC 768},
-  month        = aug,
-  publisher    = {RFC Editor},
-  series       = {Request for Comments},
-  year         = {1980},
-}
-
-@TechReport{isoutf,
-  author = {ISO},
-  title  = {ISO/IEC 10646:2020 UTF},
-}
 
 @Article{lz77,
   author  = {Ziv, J. and Lempel, A.},
@@ -412,39 +446,6 @@
   year    = {1977},
 }
 
-@Article{wang_22,
-  author       = {Si-Wei Wang and Chao Gao and Yi-Min Zheng and Li Yi and Jia-Cheng Lu and Xiao-Yong Huang and Jia-Bin Cai and Peng-Fei Zhang and Yue-Hong Cui and Ai-Wu Ke},
-  date         = {2022-02},
-  journaltitle = {Molecular Cancer},
-  title        = {Current applications and future perspective of {CRISPR}/Cas9 gene editing in cancer},
-  doi          = {10.1186/s12943-022-01518-8},
-  number       = {1},
-  volume       = {21},
-  publisher    = {Springer Science and Business Media {LLC}},
-}
-
-@Article{ju_21,
-  author       = {Philomin Juliana and Ravi Prakash Singh and Jesse Poland and Sandesh Shrestha and Julio Huerta-Espino and Velu Govindan and Suchismita Mondal and Leonardo Abdiel Crespo-Herrera and Uttam Kumar and Arun Kumar Joshi and Thomas Payne and Pradeep Kumar Bhati and Vipin Tomar and Franjel Consolacion and Jaime Amador Campos Serna},
-  date         = {2021-03},
-  journaltitle = {Scientific Reports},
-  title        = {Elucidating the genetics of grain yield and stress-resilience in bread wheat using a large-scale genome-wide association mapping study with 55,568 lines},
-  doi          = {10.1038/s41598-021-84308-4},
-  number       = {1},
-  volume       = {11},
-  publisher    = {Springer Science and Business Media {LLC}},
-}
-
-@Article{mo_83,
-  author       = {Arno G. Motulsky},
-  date         = {1983-01},
-  journaltitle = {Science},
-  title        = {Impact of Genetic Manipulation on Society and Medicine},
-  doi          = {10.1126/science.6336852},
-  number       = {4581},
-  pages        = {135--140},
-  volume       = {219},
-  publisher    = {American Association for the Advancement of Science ({AAAS})},
-}
 
 @Online{bam,
   title   = {Sequence Alignment/Map Format Specification},

+ 1 - 1
latex/tex/thesis.tex

@@ -182,7 +182,7 @@
 \cleardoublepage
 \begin{flushleft}
 \let\clearpage\relax % Fix für leere Seiten (issue #25)
-\printbibliography
+\printbibliography[nottype=online]
 \printbibliography[type=online, sorting=nud title=Online Sources]
 \end{flushleft}
 \endgroup