3 years ago · b85e1794e9
--- a/latex/tex/&!make
+++ b/latex/tex/&!make
@@ -0,0 +1,200 @@
 
				+\chapter{Results and Discussion}
			
 
				+The two tables \ref{t:effectivity}, \ref{t:efficiency} contain raw measurement values for the two goals, described in \ref{k5:goals}. The first table visualizes how long each compression procedure took, in milliseconds. The second one contains file sizes in bytes. Each row contains information about one of the files following this naming scheme:
			
 
				+
			
 
				+\texttt{Homo\_sapiens.GRCh38.dna.chromosome.}x\texttt{.fa}
			
 
				+
			
 
				+To improve readability, the filename in all tables were replaced by \texttt{File}. To determine which file was compressed, simply replace the placeholder with the number following \texttt{File}.\\
			
 
				+
			
 
				+\section{Interpretation of Results}
			
 
				+The units milliseconds and bytes store a high precision. Unfortunately they are harder to read and compare, solely by the readers eyes. Therefore the data was altered. Sizes in \ref{t:sizepercent} are displayed in percentage, in relation to the respective source file. Meaning the compression with \acs{GeCo} on:
			
 
				+
			
 
				+Homo\_sapiens.GRCh38.dna.chromosome.11.fa 
			
 
				+
			
 
				+resulted in a compressed file which were only 17.6\% as big.
			
 
				+Runtimes in \ref{t:time} were converted into seconds and have been rounded to two decimal places.
			
 
				+Also a line was added to the bottom of each table, showing the average percentage or runtime for each process.\\
			
 
				+\label{t:sizepercent}
			
 
				+\sffamily
			
 
				+\begin{footnotesize}
			
 
				+  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				+    \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
			
 
				+        {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
			
 
				+        \\
			
 
				+    \toprule
			
 
				+     \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\
			
 
				+    \midrule
			
 
				+			File 1& 18.32& 24.51& 22.03\\
			
 
				+			File 2& 20.15& 26.36& 23.7\\
			
 
				+			File 3& 19.96& 26.14& 23.69\\
			
 
				+			File 4& 20.1& 26.26& 23.74\\
			
 
				+			File 5& 17.8& 22.76& 20.27\\
			
 
				+			File 6& 17.16& 22.31& 20.11\\
			
 
				+			File 7& 16.21& 21.69& 19.76\\
			
 
				+			File 8& 17.43& 23.48& 21.66\\
			
 
				+			File 9& 18.76& 25.16& 23.84\\
			
 
				+			File 10& 20.0& 25.31& 23.63\\
			
 
				+			File 11& 17.6& 24.53& 23.91\\
			
 
				+			File 12& 20.28& 26.56& 23.57\\
			
 
				+			File 13& 19.96& 25.6& 23.67\\
			
 
				+			File 14& 16.64& 22.06& 20.44\\
			
 
				+			File 15& 79.58& 103.72& 92.34\\
			
 
				+			File 16& 19.47& 25.52& 22.6\\
			
 
				+			File 17& 19.2& 25.25& 22.57\\
			
 
				+			File 18& 19.16& 25.04& 22.2\\
			
 
				+			File 19& 18.32& 24.4& 22.12\\
			
 
				+			File 20& 18.58& 24.14& 21.56\\
			
 
				+			File 21& 16.22& 22.17& 19.96\\
			
 
				+      &&&\\
			
 
				+			\textbf{Total}& 21.47& 28.24& 25.59\\
			
 
				+    \bottomrule
			
 
				+  \end{longtable}
			
 
				+\end{footnotesize}
			
 
				+\rmfamily
			
 
				+
			
 
				+Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} methode improved this by rughly 2.5\%. \acs{GeCo} provided the greatest reduction with 78.53\%. This gap of about 4\% comes with a comparatively great sacrifice in time.\\
			
 
				+
			
 
				+\label{t:time}
			
 
				+\sffamily
			
 
				+\begin{footnotesize}
			
 
				+  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				+    \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
			
 
				+        {Compression duration in seconds} % Caption für die Tabelle selbst
			
 
				+        \\
			
 
				+    \toprule
			
 
				+     \textbf{ID.} & \textbf{\acs{GeCo} } & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM} } \\
			
 
				+    \midrule
			
 
				+			File 1 & 23.5& 3.786& 16.926\\
			
 
				+			File 2 & 24.65& 3.784& 17.043\\
			
 
				+			File 3 & 2.016& 3.123& 13.999\\
			
 
				+			File 4 & 19.408& 3.011& 13.445\\
			
 
				+			File 5 & 18.387& 2.862& 12.802\\
			
 
				+			File 6 & 17.364& 2.685& 12.015\\
			
 
				+			File 7 & 15.999& 2.503& 11.198\\
			
 
				+			File 8 & 14.828& 2.286& 10.244\\
			
 
				+      File 9 & 12.304& 2.078& 9.21\\
			
 
				+			File 10 & 13.493& 2.127& 9.461\\
			
 
				+			File 11 & 13.629& 2.132& 9.508\\
			
 
				+			File 12 & 13.493& 2.115& 9.456\\
			
 
				+			File 13 & 99.902& 1.695& 7.533\\
			
 
				+			File 14 & 92.475& 1.592& 7.011\\
			
 
				+			File 15 & 85.255& 1.507& 6.598\\
			
 
				+			File 16 & 82.765& 1.39& 6.089\\
			
 
				+			File 17 & 82.081& 1.306& 5.791\\
			
 
				+			File 18 & 79.842& 1.277& 5.603\\
			
 
				+			File 19 & 58.605& 0.96& 4.106\\
			
 
				+			File 20 & 64.588& 1.026& 4.507\\
			
 
				+			File 21 & 41.198& 0.721& 3.096\\
			
 
				+      &&&\\
			
 
				+      \textbf{Total}&42.57&2.09&9.32\\
			
 
				+    \bottomrule
			
 
				+  \end{longtable}
			
 
				+\end{footnotesize}
			
 
				+\rmfamily
			
 
				+
			
 
				+As \ref{t:time} is showing, the average compression duration for \acs{GeCo} is at 42.57s. That is a little over 33s, or 78\% longer than the average runtime of samtools for compressing into the \acs{CRAM} format.\\
			
 
				+Since \acs{CRAM} requires a file in \acs{BAM} format, the third row is calculated by adding the time needed to compress into \acs{BAM} with the time needed to compress into \acs{CRAM}. 
			
 
				+While \acs{SAM} format is required for compressing a \acs{FASTA} into \acs{BAM} and further into \acs{CRAM}, in itself it does not features no compression. However, the conversion from \acs{SAM} to \acs{FASTA} can result in a decrease in size. At first this might be contra intuitive since, as described in \ref{k2:sam} \acs{SAM} stores more information than \acs{FASTA}. This can be explained by comparing the sequence storing mechanism. A \acs{FASTA} sequence section can be spread over multiple lines whereas \acs{SAM} files store a sequence in just one line, converting can result in a \acs{SAM} file that is smaller than the original \acs{FASTA} file.
			
 
				+% (hi)storytime
			
 
				+Before interpreting this data further, a quick view into development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing.\\
			
 
				+% todo interpret bit files and compare
			
 
				+
			
 
				+% big tables
			
 
				+Reviewing \ref{t:recal-time} one will notice, that \acs{GeCo} reached a runtime over 60 seconds on every run. Instead of displaying the runtime solely in seconds, a leading number followed by an m indicates how many minutes each run took.
			
 
				+
			
 
				+\label{t:recal-size}
			
 
				+\sffamily
			
 
				+\begin{footnotesize}
			
 
				+  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				+    \caption[Compression Effectivity for greater files]                       % Caption für das Tabellenverzeichnis
			
 
				+        {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
			
 
				+        \\
			
 
				+    \toprule
			
 
				+     \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\
			
 
				+    \midrule
			
 
				+			%geco bam and cram in percent
			
 
				+			File 1& 1.00& 6.28& 5.38\\
			
 
				+			File 2& 0.98& 6.41& 5.52\\
			
 
				+			File 3& 1.21& 8.09& 7.17\\
			
 
				+			File 4& 1.20& 7.70& 6.85\\
			
 
				+			File 5& 1.08& 7.58& 6.72\\
			
 
				+			File 6& 1.09& 7.85& 6.93\\
			
 
				+			File 7& 0.96& 5.83& 4.63\\
			
 
				+      &&&\\
			
 
				+			\textbf{Total}	1.07& 7.11& 6.17\\
			
 
				+    \bottomrule
			
 
				+  \end{longtable}
			
 
				+\end{footnotesize}
			
 
				+\rmfamily
			
 
				+
			
 
				+\label{t:recal-time}
			
 
				+\sffamily
			
 
				+\begin{footnotesize}
			
 
				+  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				+    \caption[Compression Effectivity for greater files]                       % Caption für das Tabellenverzeichnis
			
 
				+        {Compression duration in seconds} % Caption für die Tabelle selbst
			
 
				+        \\
			
 
				+    \toprule
			
 
				+     \textbf{ID.} & \textbf{\acs{GeCo} } & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM} } \\
			
 
				+    \midrule
			
 
				+			%compress time for geco, bam and cram in seconds
			
 
				+			File 1 & 1m58.427& 16.248& 23.016\\
			
 
				+			File 2 & 1m57.905& 15.770& 22.892\\
			
 
				+			File 3 & 1m09.725& 07.732& 12.858\\
			
 
				+			File 4 & 1m13.694& 08.291& 13.649\\
			
 
				+			File 5 & 1m51.001& 14.754& 23.713\\
			
 
				+			File 6 & 1m51.315& 15.142& 24.358\\
			
 
				+			File 7 & 2m02.065& 16.379& 23.484\\
			
 
				+      &&&\\
			
 
				+			\textbf{Total}	 & 1m43.447& 13.474& 20.567\\
			
 
				+    \bottomrule
			
 
				+  \end{longtable}
			
 
				+\end{footnotesize}
			
 
				+\rmfamily
			
 
				+
			
 
				+In both tables \ref{t:recal-time} and \ref{t:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{t:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~1.3 Gigabyte). Closely folled by file one and two (\~1.2 Gigabyte). 
			
 
				+% todo greater filesize means better compression
			
 
				+
			
 
				+\section{View on Possible Improvements}
			
 
				+S. Petukhov described new findings about the distribution of nucleotides. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours is revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} can be determined:\\
			
 
				+%\%C ≈ Σ\%CN ≈ Σ\%NС ≈ Σ\%CNN ≈ Σ\%NCN ≈ Σ\%NNC ≈ Σ\%CNNN ≈ Σ\%NCNN ≈ Σ\%NNCN ≈ Σ\%NNNC\\
			
 
				+
			
 
				+% begin optimization 
			
 
				+Considering this and the meassured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a tool that is developed with todays standards, there is a possibility that even greater improvements could be archived.\\
			
 
				+% simple theoretical approach
			
 
				+How would a theoretical improvement approach look like? As described in \ref{k4:arith}, entropy coding requires to determine the probabilies of each symbol in the alphabet. The simplest way to do that, is done by parsing the whole sequence from start to end and increasing a counter for each nucleotide that got parsed. 
			
 
				+With new findings discovered by S. Petukhov in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach would be that the probability of one nucleotide can be used to determine the probability of other nucelotides, by a calculation rather than the process of counting each one.
			
 
				+This approach throws a few questions that need to be answered in order to plan a implementation:  
			
 
				+\begin{itemize}
			
 
				+	\item How many probabilities are needed to calculate the others?
			
 
				+	\item Is there space for improvement in the parsing/counting process?
			
 
				+	%\item Is there space for visible improvements, when only counting one nucleotide?
			
 
				+	\item How can the variation between probabilities be determined?
			
 
				+\end{itemize}
			
 
				+
			
 
				+Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant.
			
 
				+%todo compare time needed: to store a variable <-> parsing the sequence
			
 
				+To compare parts of a programm and their complexity, the Big-O notation is used. Unfortunally this is only covering loops and coditions as a whole. Therefore a more detailed view on operations must be created: 
			
 
				+Considering a single threaded loop with the purpose to count every nucleotide in a sequence, the process of counting can be split into several operations, defined by this pseudocode.
			
 
				+
			
 
				+%todo use GeCo arith function with bigO
			
 
				+while (sequence not end):\\
			
 
				+	\-\hspace{1cm} next\_nucleotide = read\_next\_nucleotide(sequence)\\
			
 
				+	for (element in alphabet\_probabilities):\\
			
 
				+		if (element equals next\_nucleotide)\\
			
 
				+			element = element + 1\\
			
 
				+		fi\\
			
 
				+	rof\\
			
 
				+elihw\\
			
 
				+
			
 
				+This loop will itterate over a whole sequence, counting each nucleotide. In line three, a inner loop can be found which itterates over the alphabet, to determine which symbol should be increased. Considering the findings, described above, the inner loop can be left out, because there is no need to compare the read nucleotide against more than one symbol. The Big-O notation for this code, with any sequence with the length of n, would be decreseased from O($n^2$) to O($n\cdot 1)$) or simply O(N) \cite{big-o}. Which is clearly an improvement in complexety and therefor also in runtime.\\
			
 
				+The runtime for calculations of the other symbols probabilities must be considered as well and compared against the nested loop to be certain, that the overall was improved.
			
 
				+% more realistic view on parsing todo need cites
			
 
				+In practice, obviously smarter ways are used, to determine probabilities. Like splitting the sequence in multiple parts and parse each subsequence asynchronous. This results can either sumed up for global probabilities or get used individually on each associated subsequence. Either way, the presented improvement approach should be appliable to both parsing methods.\\
			
 
				+
			
 
				+
			
 
				+% how is data interpreted
			
 
				+% why did the tools result in this, what can we learn
			
 
				+% improvements
			
 
				+% - goal: less time to compress
			
 
				+% 	- approach: optimize probability determination
			
 
				+% 	-> how?
			
--- a/latex/tex/kapitel/k4_algorithms.tex
+++ b/latex/tex/kapitel/k4_algorithms.tex
@@ -116,7 +116,6 @@ This works goal was to define an algorithm that requires no blocking. Meaning th
 
				 \mycomment{
			
 
				 The coding algorithm works with probabilities for symbols in an alphabet. From any text, the alphabet is defined by the set of individual symbols, used in the text. The probability for a single symbol is defined as its distribution. In a \texttt{n} symbol long text, with the first letter in the alphabet occuring \texttt{o} times, its probability is $\frac{o}{n}$.\\
			
 
				 
			
 
				-% todo rethink this equation stuff (and compare it to the original work <-compl.)
			
 
				 \begin{itemize}
			
 
				   \item $p_{i}$ represents the probability of the symbol, at position \texttt{i} in the alphabet.
			
 
				   \item L will contain the bit sequence with which the text is encoded. This sequence can be seen as a single value. A fraction between 0 and 1 which gets more precise with every processed symbol.
			
@@ -163,7 +162,7 @@ The described coding is only feasible on machines with infinite percission. As s
 
				 \label{k4:huff}
			
 
				 \subsection{Huffman encoding}
			
 
				 % list of algos and the tools that use them
			
 
				-D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The Shannon-Fano coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
			
 
				+D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The Shannon-Fano coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. 
			
 
				 Even though his work was released in 1952, the method he developed is in use  today. Not only tools for genome compression but in compression tools with a more general ussage \cite{rfcgzip}.\\ 
			
 
				 Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bits, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bits used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \cite{huf52}. 
			
 
				 As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how the symbols will be encoded. As in arithmetic coding, the probability of a letter is calculated for given text. The binary tree will be constructed after following guidelines \cite{alok17}:
			
@@ -176,7 +175,6 @@ As with other codings, a set of symbols must be defined. For any text constructe
 
				   \item The leaf with the lowest probability is most left and the one with the highest probability is most right in the tree. 
			
 
				 \end{itemize}
			
 
				 %todo tree building explanation
			
 
				-% storytime might need to be rearranged
			
 
				 A often mentioned difference between Shannon-Fano and Huffman coding, is that first is working top down while the latter is working bottom up. This means the tree starts with the lowest weights. The nodes that are not leafs have no value ascribed to them. They only need their weight, which is defined by the weights of their individual child nodes \cite{moffat20, alok17}.\\
			
 
				 
			
 
				 Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \texttt{$W_{i}$} and codeword length as \texttt{$L_{i}$} for the node \texttt{$K_{i}$}. Then will \texttt{$L_{av}$} be the average length for \texttt{L} in a finite chain of symbols, with a distribution that is mapped onto \texttt{W} \cite{huf}.
			
@@ -204,9 +202,10 @@ Leaving the theory and entering the practice, brings some details that lessen th
 
				 \section{Implementations in Relevant Tools}
			
 
				 This section should give the reader a quick overview, how a small variety of compression tools implement described compression algorithms. 
			
 
				 
			
 
				+\label{k4:geco}
			
 
				 \subsection{\ac{GeCo}} % geco
			
 
				-% geco.c: analyze data/open files, parse to determine fileformat, create alphabet
			
 
				-
			
 
				+% differences between geco geco2 and geco3
			
 
				+This tool has three development stages, the first \acs{GeCo} released in 2016 \acs{geco}. This tool happens to have the smalles codebase, with only eleven C files. The two following extensions \acs{GeCo}2, released in 2020 and the latest version \acs{GeCo}3 have bigger codebases. They also provide features like the ussage of a neural network, which are of no help for this work. Since the file, providing arithmetic coding functionality, do not differ between all three versions, the first release was analyzed.\\
			
 
				 % explain header files
			
 
				 The header files, that this tool includes in \texttt{geco.c}, can be split into three categories: basic operations, custom operations and compression algorithms. 
			
 
				 The basic operations include header files for general purpose functions, that can be found in almost any c++ Project. The provided functionality includes operations for text-output on the command line inferface, memory management, random number generation and several calculations on up to real numbers.\\
			
@@ -218,6 +217,8 @@ Since original versions of the files licensed by University of Aveiro could not
 
				 Following function calls in all three files led to the conclusion that the most important function is defined as \texttt{arithmetic\_encode} in \texttt{arith.c}. In this function the actual artihmetic encoding is executed. This function has no redirects to other files, only one function call \texttt{ENCODE\_RENORMALISE} the remaining code consists of arithmetic operations only.
			
 
				 % if there is a chance for improvement, this function should be consindered as a entry point to test improving changes.
			
 
				 
			
 
				+Following function calls int the \texttt{compressor} section of \texttt{geco.c}, to find the call of \texttt{arith.c} no sign of multithreading could be identified. This fact leaves additional optimization possibilities and will be discussed in \ref{k6:results}.
			
 
				+
			
 
				 %useless? -> Both, \texttt{bitio.c} and \texttt{arith.c} are pretty simliar. They were developed by the same authors, execpt for Radford Neal who is only mentioned in \texttt{arith.c}, both are based on the work of A. Moffat \cite{moffat_arith}.
			
 
				 %\subsection{genie} % genie
			
 
				 \subsection{Samtools} % samtools 
			
--- a/latex/tex/kapitel/k5_feasability.tex
+++ b/latex/tex/kapitel/k5_feasability.tex
@@ -168,9 +168,10 @@ Following criteria is reqiured for test data to be appropriate:
 
				   \item{The test file is in a format that all or at least most of the tools can work with, meaning \acs{FASTA} or \acs{FASTq} files.}
			
 
				   \item{The file is publicly available and free to use (for research).}
			
 
				 \end{itemize}
			
 
				-A second, bigger set of testfiles were required. This would verify the test results are not limited to small files. The size of 1 Gigabyte per file, would hold over five times as much data as the first set.
			
 
				+A second, bigger set of testfiles were required. This would verify the test results are not limited to small files. A minimum of one gigabyte of average filesize were set as a boundary. This corresponds to over five times the size of the first set.\\
			
 
				+% data gathering
			
 
				 Since there are multiple open \ac{FTP} servers which distribute a variety of files, finding a suitable first set is rather easy. The ensembl database featured defined criteria, so the first available set called Homo\_sapiens.GRCh38.dna.chromosome were chosen \cite{ftp-ensembl}. This sample includes over 20 chromosomes, whereby considering the filenames, one chromosome is contained in each single file. After retrieving and unpacking the files, write privileges on them was withdrawn. So no tool could alter any file contents.
			
 
				-Finding a second, bigger set happened to be more complicated. \acs{FTP} offers no fast, reliable way to sort files according to their size, regardless of their position. Since available servers \acs{ftp-ensembl, ftp-ncbi, ftp-isgr} offer several thousand files, stored in variating, deep directory structures, mapping filesize, filetype and file path takes too much time and resources to be done in this work. This problematic combined with a easily triggered overflow in the samtools library, resulted in a set of several, manualy searched and tested files which lacks in quantity. The variety of different species in this set \acs{DNA} provide a additional, interesting factor.\\
			
 
				+Finding a second, bigger set happened to be more complicated. \acs{FTP} offers no fast, reliable way to sort files according to their size, regardless of their position. Since available servers \acs{ftp-ensembl, ftp-ncbi, ftp-isgr} offer several thousand files, stored in variating, deep directory structures, mapping filesize, filetype and file path takes too much time and resources for the scope of this work. This problematic combined with a easily triggered overflow in the samtools library, resulted in a set of several, manualy searched and tested \acs{FASTq} files. Compared to the first set, there is a noticable lack of quantity, but the filesizes happen to be of a fortunate distribution. With pairs of two files in the ranges of 0.6, 1.1, 1.2 and one file with a size of 1.3 gigabyte, effects on scaling sizes should be clearly visible.\\
			
 
				  
			
 
				 % todo make sure this needs to stay.
			
 
				 \noindent Following tools and parameters where used in this process:
			
@@ -181,7 +182,8 @@ Finding a second, bigger set happened to be more complicated. \acs{FTP} offers n
 
				 \end{lstlisting}
			
 
				 
			
 
				 The chosen tools are able to handle the \acs{FASTA} format. However Samtools must convert \acs{FASTA} files into their \acs{SAM} format bevor the file can be compressed. The compression will firstly lead to an output with \acs{BAM} format, from there it can be compressed further into a \acs{CRAM} file. For \acs{CRAM} compression, the time needed for each step, from converting to two compressions, is summed up and displayed as one. For the compression time into the \acs{BAM} format, just the conversion and the single compression time is summed up. The conversion from \acs{FASTA} to \acs{SAM} is not displayed in the results. This is due to the fact that this is no compression process, and therefor has no value to this work.\\
			
 
				-Even though \acs{SAM} files are not compressed, there can be a small but noticeable difference in size between the files in each format. Since \acs{FASTA} should store less information, by leaving out quality scores, this observation was counterintuitive. Comparing the first few lines showed two things: the header line were altered and newlines were removed. The alteration of the header line would result in just a few more bytes. To verify, no information was lost while converting, both files were temporary stripped from metadata and formatting, so the raw data of both files can be compared. Using \texttt{diff} showed no differences between the stored characters in each file. 
			
 
				+Even though \acs{SAM} files are not compressed, there can be a small but noticeable difference in size between the files in each format. Since \acs{FASTA} should store less information, by leaving out quality scores, this observation was counterintuitive. Comparing the first few lines showed two things: the header line were altered and newlines were removed. The alteration of the header line would result in just a few more bytes. To verify, no information was lost while converting, both files were temporary stripped from metadata and formatting, so the raw data of both files can be compared. Using \texttt{diff} showed no differences between the stored characters in each file.\\
			
 
				+
			
 
				 % user@debian data$\ ls -l --block-size=M raw/Homo_sapiens.GRCh38.dna.chromosome.1.fa 
			
 
				 % -r--r--r-- 1 user user 242M Jun  4 10:49 raw/Homo_sapiens.GRCh38.dna.chromosome.1.fa
			
 
				 % user@debian data$\ ls -l --block-size=M samtools/files/Homo_sapiens.GRCh38.dna.chromosome.1.sam
			
--- a/latex/tex/kapitel/k6_results.tex
+++ b/latex/tex/kapitel/k6_results.tex
@@ -96,9 +96,9 @@ Since \acs{CRAM} requires a file in \acs{BAM} format, the third row is calculate
 
				 While \acs{SAM} format is required for compressing a \acs{FASTA} into \acs{BAM} and further into \acs{CRAM}, in itself it does not features no compression. However, the conversion from \acs{SAM} to \acs{FASTA} can result in a decrease in size. At first this might be contra intuitive since, as described in \ref{k2:sam} \acs{SAM} stores more information than \acs{FASTA}. This can be explained by comparing the sequence storing mechanism. A \acs{FASTA} sequence section can be spread over multiple lines whereas \acs{SAM} files store a sequence in just one line, converting can result in a \acs{SAM} file that is smaller than the original \acs{FASTA} file.
			
 
				 % (hi)storytime
			
 
				 Before interpreting this data further, a quick view into development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing.\\
			
 
				-% todo interpret bit files and compare
			
 
				 
			
 
				 % big tables
			
 
				+For the second set of testdata, the file identifier was set to follow the scheme \texttt{File 2.x} where x is a number between zero and seven. While the first set of testdata had names that matched the file identifiers, considering its numbering, the second set had more variating names. The mapping between identifier and file can be found in \ref{}. % todo add testset tables
			
 
				 Reviewing \ref{t:recal-time} one will notice, that \acs{GeCo} reached a runtime over 60 seconds on every run. Instead of displaying the runtime solely in seconds, a leading number followed by an m indicates how many minutes each run took.
			
 
				 
			
 
				 \label{t:recal-size}
			
@@ -152,18 +152,18 @@ Reviewing \ref{t:recal-time} one will notice, that \acs{GeCo} reached a runtime
 
				 \rmfamily
			
 
				 
			
 
				 In both tables \ref{t:recal-time} and \ref{t:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{t:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~1.3 Gigabyte). Closely folled by file one and two (\~1.2 Gigabyte). 
			
 
				-% todo greater filesize means better compression
			
 
				 
			
 
				 \section{View on Possible Improvements}
			
 
				-S. Petukhov described new findings about the distribution of nucleotides. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours is revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} can be determined:\\
			
 
				+S.V. Petoukhov described his findings about the distribution of nucleotides \cite{pet21}. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours is revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} can be determined without counting them \cite{pet21}.\\
			
 
				 %\%C ≈ Σ\%CN ≈ Σ\%NС ≈ Σ\%CNN ≈ Σ\%NCN ≈ Σ\%NNC ≈ Σ\%CNNN ≈ Σ\%NCNN ≈ Σ\%NNCN ≈ Σ\%NNNC\\
			
 
				 
			
 
				 % begin optimization 
			
 
				 Considering this and the meassured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a tool that is developed with todays standards, there is a possibility that even greater improvements could be archived.\\
			
 
				 % simple theoretical approach
			
 
				 How would a theoretical improvement approach look like? As described in \ref{k4:arith}, entropy coding requires to determine the probabilies of each symbol in the alphabet. The simplest way to do that, is done by parsing the whole sequence from start to end and increasing a counter for each nucleotide that got parsed. 
			
 
				-With new findings discovered by S. Petukhov in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach would be that the probability of one nucleotide can be used to determine the probability of other nucelotides, by a calculation rather than the process of counting each one.
			
 
				-This approach throws a few questions that need to be answered in order to plan a implementation:  
			
 
				+With new findings discovered by Petoukhov in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach would be that the probability of one nucleotide can be used to determine the probability of other nucelotides, by a calculation rather than the process of counting each one.
			
 
				+This approach throws a few questions that need to be answered in order to plan a implementation \cite{pet21}:\\ 
			
 
				+
			
 
				 \begin{itemize}
			
 
				 	\item How many probabilities are needed to calculate the others?
			
 
				 	\item Is there space for improvement in the parsing/counting process?
			
@@ -171,25 +171,56 @@ This approach throws a few questions that need to be answered in order to plan a
 
				 	\item How can the variation between probabilities be determined?
			
 
				 \end{itemize}
			
 
				 
			
 
				-Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant.
			
 
				-%todo compare time needed: to store a variable <-> parsing the sequence
			
 
				-To compare parts of a programm and their complexity, the Big-O notation is used. Unfortunally this is only covering loops and coditions as a whole. Therefore a more detailed view on operations must be created: 
			
 
				-Considering a single threaded loop with the purpose to count every nucleotide in a sequence, the process of counting can be split into several operations, defined by this pseudocode.
			
 
				+% first bulletpoint
			
 
				+The question for how many probabilities are needed, needs to be answered, to start working on any kind of implementation. This question will only get answered by theoretical proove. It could happen in form of a mathematical equtaion, which prooves that counting all ocurences of one nucleotide reveals can be used to determin all probabilities. Since this task is time and resource consuming and there is more to discuss, finding a answer will be postponed to another work. 
			
 
				+%One should keep in mind that this is only one of many approaches. Any proove of other approaches which reduces the probability determination, can be taken in instead. 
			
 
				+
			
 
				+% second bullet point (mutlithreading aspect=
			
 
				+The Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant. Especially if multithreading is a option. Since in the static codeanalysis in \ref{k3:GeCo} revealed no multithreading, the analysis for improvements when splitting the workload onto several threads should be considered, before working on an improvement based on Petoukhovs findings. This is relevant, because some improvements, like the one described above, will loose efficiency if only subsections of a genomes are processed. A tool like OpenMC for multithreading C programs would possibly supply the required functionality to develop a prove of concept \cite{cthreading, pet21}.
			
 
				+% theoretical improvement with pseudocode
			
 
				+But how could a improvement look like, not considering possible difficulties multithreading would bring?
			
 
				+To answer this, first a mechanism to determine a possible improvement must be determined. To compare parts of a programm and their complexity, the Big-O notation is used. Unfortunally this is only covering loops and coditions as a whole. Therefore a more detailed view on operations must be created: 
			
 
				+Considering a single threaded loop with the purpose to count every nucleotide in a sequence, the process of counting can be split into several operations, defined by this pseudocode.\\
			
 
				 
			
 
				 %todo use GeCo arith function with bigO
			
 
				-while (sequence not end):\\
			
 
				-	next\_nucleotide = read\_next\_nucleotide(sequence)\\
			
 
				-	for (element in alphabet\_probabilities):\\
			
 
				-		if (element equals next\_nucleotide)\\
			
 
				-			element = element + 1\\
			
 
				-		fi\\
			
 
				-	rof\\
			
 
				-elihw\\
			
 
				-
			
 
				-This loop will itterate over a whole sequence, counting each nucleotide. In line three, a inner loop can be found which itterates over the alphabet, to determine which symbol should be increased. Considering the findings, described above, the inner loop can be left out, because there is no need to compare the read nucleotide against more than one symbol. The Big-O notation for this code, with any sequence with the length of n, would be decreseased from O($n^2$) to O($n\cdot 1)$) or simply O(N) \cite{big-o}. Which is clearly an improvement in complexety and therefor also in runtime.\\
			
 
				-The runtime for calculations of the other symbols probabilities must be considered as well and compared against the nested loop to be certain, that the overall was improved.
			
 
				+while (sequence not end)\\
			
 
				+do\\
			
 
				+\-\hspace{0.5cm} next\_nucleotide = read\_next\_nucleotide(sequence)\\
			
 
				+\-\hspace{0.5cm} for (element in alphabet\_probabilities)\\
			
 
				+\-\hspace{0.5cm} do\\
			
 
				+\-\hspace{1cm} 	 if (element equals next\_nucleotide)\\
			
 
				+\-\hspace{1.5cm} element = element + 1\\
			
 
				+\-\hspace{1cm}   fi\\
			
 
				+\-\hspace{0.5cm} done\\
			
 
				+done\\
			
 
				+
			
 
				+This loop will itterate over a whole sequence, counting each nucleotide. In line three, a inner loop can be found which itterates over the alphabet, to determine which symbol should be increased. Considering the findings, described above, the inner loop can be left out, because there is no need to compare the read nucleotide against more than one symbol. The Big-O notation for this code, with any sequence with the length of n, would be decreseased from O($n^2$) to O($n\cdot 1)$ or simply O(n) \cite{big-o}. Which is clearly an improvement in complexety and therefor also in runtime.\\
			
 
				+The runtime for calculations of the other symbols probabilities must be considered as well and compared against the nested loop to be certain, that the overall runtime was improved.\\
			
 
				 % more realistic view on parsing todo need cites
			
 
				-In practice, obviously smarter ways are used, to determine probabilities. Like splitting the sequence in multiple parts and parse each subsequence asynchronous. This results can either sumed up for global probabilities or get used individually on each associated subsequence. Either way, the presented improvement approach should be appliable to both parsing methods.\\
			
 
				+%In practice, obviously smarter ways are used, to determine probabilities. Like splitting the sequence in multiple parts and parse each subsequence asynchronous. 
			
 
				+Getting back to the question how multithreading would impact improvements: A implementation like the one described above, could also work with multithreading. Since the ratio of the difference between O($n^2$) and O(n) does not differ with the reduction of n. Multiple threads, processing parts of a sequence with the length of n, would also benefit, because any fraction of $n^2$ will always be greater than the corresponding fraction of n. This results can either sumed up for global probabilities or get used individually on each associated subsequence. Either way, the presented improvement approach should be appliable to both parsing methods.\\
			
 
				+This leaves a list of problems, which needs to be regarded in the approach of developing a improvement.
			
 
				+If there space for improvement in the parsing/counting process, what problems needs to be addressed:
			
 
				+
			
 
				+\begin{itemize}
			
 
				+	\item reducing one process by adding aditional code must be estimated and set into relation.
			
 
				+	\item for a tool that does not feature multithreading, how would multithreading affect the improvement reulst?
			
 
				+\end{itemize}
			
 
				+
			
 
				+% todo petoukhov just said T = AT+GT+CT+TT = %NT and %T = %TN
			
 
				+% if %C = %T and %A = %G 
			
 
				+% C = ?
			
 
				+
			
 
				+% bulletpoint 3
			
 
				+A important question that needs answered would be: If Petoukhovs findings show that, through simliarities in the distribution of each nucleotide, one can lead to the aproximation of the other three. Entropy codings work with probabilities, how does that affect the coding mechanism?
			
 
				+With a equal probability for each nucleotide, entropy coding can not be treated as a whole. This is due to the fact, that huffman coding makes use of differing probabilities. A equal distribution means every character will be encoded in the same length which would make the encoding process unnecessary. Arithmetic coding on the other hand is able to handle equal probabilities.
			
 
				+The fact that there are obviously chains of repeating nucleotides in genomes. For example \texttt{File 2.2}, which contains this subsequence is found at line 90:
			
 
				+
			
 
				+\texttt{AAAAAAAAAAAAAAAAAAAAAATAAATATTTTATTT} 
			
 
				+
			
 
				+Without determining probabilities, one can see that the amount of \texttt{A}s outnumbers \texttt{T}s and neither \texttt{C} nor \texttt{G} are present. With the whole 1.2 gigabytes, the distribution will align more, but by cutting out a subsection, of relevant size, with unequal distributions will have an impact on the probabilities of the whole sequence. If a greater sequence would lead to a more equal distribution, this knowledge could be used to help determining distributions on subsequences of one with equaly distributed probabilities.
			
 
				+% length cutting
			
 
				+
			
 
				 
			
 
				 
			
 
				 % how is data interpreted
			
--- a/latex/tex/literatur.bib
+++ b/latex/tex/literatur.bib
@@ -409,4 +409,27 @@
 
				   url   = {https://ftp.ensembl.org},
			
 
				 }
			
 
				 
			
 
				+@Book{cthreading,
			
 
				+  author    = {Quinn, Michael J.},
			
 
				+  title     = {Parallel Programming in C with MPI and OpenMP},
			
 
				+  isbn      = {0071232656},
			
 
				+  publisher = {McGraw-Hill Education Group},
			
 
				+  year      = {2003},
			
 
				+}
			
 
				+
			
 
				+@Online{gecoRepo,
			
 
				+  author = {Cobilab},
			
 
				+  date   = {2022-11-19},
			
 
				+  title  = {Repositories for the three versions of GeCo},
			
 
				+  url    = {https://github.com/cobilab},
			
 
				+}
			
 
				+
			
 
				+@Article{pet21,
			
 
				+  author    = {Sergey V. Petoukhov},
			
 
				+  date      = {2021-10},
			
 
				+  title     = {Tensor Rules in the Stochastic Organization of Genomes and Genetic Stochastic Resonance in Algebraic Biology},
			
 
				+  doi       = {10.20944/preprints202110.0093.v1},
			
 
				+  publisher = {{MDPI} {AG}},
			
 
				+}
			
 
				+
			
 
				 @Comment{jabref-meta: databaseType:biblatex;}