Bläddra i källkod

added static code analysis for geco

u 3 år sedan
förälder
incheckning
e8426e00d1
2 ändrade filer med 27 tillägg och 4 borttagningar
  1. 25 4
      latex/tex/kapitel/k4_algorithms.tex
  2. 2 0
      latex/tex/kapitel/k6_results.tex

+ 25 - 4
latex/tex/kapitel/k4_algorithms.tex

@@ -37,6 +37,9 @@ In contrast to lossless compression, lossy compression might excludes parts of d
 For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Methods from both fields, that aquired reputation, are described in detail below \cite{cc14, moffat20, moffat_arith, alok17}.\\
 
 \subsection{Dictionary coding}
+\textbf{Disclaimer}
+Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore not in the main scope for this work. To strenghten the understanding of compression algortihms this section will remain. Also a hybrid implementation described later will use both dictionary and entropy coding.\\
+
 \label{k4:dict}
 Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand occurences of strings. Strings are a chain of characters representing a full word or just a part of it. For a better understanding this should be illustrated by a short example:
 % demo substrings
@@ -55,7 +58,6 @@ The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv cre
  % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
 \ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
 
-Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore out of scope for this work. Since finding repeting sections and their location might also be improved, this chapter will remain.\\
 
 \subsection{Shannons Entropy}
 The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \cite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
@@ -226,7 +228,26 @@ With this simple rules, the alphabet can be compressed too. Instead of storing c
 % example header, alphabet, data block?
 
 \section{Implementations in Relevant Tools}
-\subsection{} % geco
-\subsection{} % genie
-\subsection{} % samtools 
+This section should give the reader a quick overview, how a small variety of compression tools implement described compression algorithms. 
+
+\subsection{\ac{GeCo}} % geco
+% geco.c: analyze data/open files, parse to determine fileformat, create alphabet
+
+% explain header files
+The header files, that this tool includes in \texttt{geco.c}, can be split into three categories: basic operations, custom operations and compression algorithms. 
+The basic operations include header files for general purpose functions, that can be found in almost any c++ Project. The provided functionality includes operations for text-output on the command line inferface, memory management, random number generation and several calculations on up to real numbers.\\
+Custom operations happens to include general purpose functions too, with the difference that they were written, altered or extended by \acs{GeCo}s developer. The last category cosists of several C Files, containing implementations of two arithmetic coding implementations: \textbf{first} \texttt{bitio.c} and \texttt{arith.c}, \textbf{second} \texttt{arith\_aux.c}.\\
+The first two were developed by John Carpinelli, Wayne Salamonsen, Lang Stuiver and Radford Neal (is only mentioned in the latter). Comparing the two files, \texttt{bitio.c} has less code, shorter comments and much more not functioning code sections. Overall the conclusion would be likely that \texttt{arith.c} is some kind of official release, wheras \texttt{bitio.c} severs as a experimental  file for the developers to create proof of concepts. The described files adapt code from Armando J. Pinho licenced by University of Aveiro DETI/IEETA written in 1999.\\
+The second implementation was also licensed by University of Aveiro DETI/IEETA, but no author is mentioned. From interpreting the function names and considering the lenght of function bodys \texttt{arith\_aux.c} could serve as a wrapper for basic functions that are often used in arithmetic coding.\\
+Since original versions of the files licensed by University of Aveiro could not be found, there is no way to determine if the files comply with their originals or if changes has been made. This should be considered while following the static analysis.
+
+Following function calls in all three files led to the conclusion that the most important function is defined as \texttt{arithmetic\_encode} in \texttt{arith.c}. In this function the actual artihmetic encoding is executed. This function has no redirects to other files, only one function call \texttt{ENCODE\_RENORMALISE} the remaining code consists of arithmetic operations only.
+% if there is a chance for improvement, this function should be consindered as a entry point to test improving changes.
+
+%useless? -> Both, \texttt{bitio.c} and \texttt{arith.c} are pretty simliar. They were developed by the same authors, execpt for Radford Neal who is only mentioned in \texttt{arith.c}, both are based on the work of A. Moffat \cite{moffat_arith}.
+%\subsection{genie} % genie
+\subsection{Samtools} % samtools 
+% metion sam strcuture 
+\subsubsection{BAM}
+\subsubsection{CRAM}
 

+ 2 - 0
latex/tex/kapitel/k6_results.tex

@@ -85,3 +85,5 @@ Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} m
 
 As \ref{t:time} is showing, the average compression duration for \acs{GeCo} is at 42.57s. That is a little over 33s, or 78\% longer than the average runtime of samtools for compressing into the \acs{CRAM} format.\\
 Before interpreting this data further, a quick view into development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing. Considering the data with that in mind, an improvement in \acs{GeCo}s efficiency, would be a start to equalize the great gap in the compression duration.\\
+
+