|
@@ -1,5 +1,5 @@
|
|
|
\chapter{Results and Discussion}
|
|
\chapter{Results and Discussion}
|
|
|
-The tables \ref{a6:testsets-size} and \ref{a6:testsets-time} contain raw measurement values for the two goals, described in \ref{k5:goals}. The table \ref{a6:testsets-time} lists how long each compression procedure took, in milliseconds. \ref{a6:testsets-size} contains file sizes in bytes. In these tables, as well as in the other ones associated with tests in the scope of this work, the a name scheme is used, to improve readability. The filenames were replaced by \texttt{File} followed by two numbers seperated by a point. For the first test set, the number prefix \texttt{1.} was used, the second set is marked with a \texttt{2.}. For example, the fourth file of each test, in tables are named like this \texttt{File 1.4} and \texttt{File 2.4}. The name of the associated source file for the first set is:
|
|
|
|
|
|
|
+The tables \ref{a6:compr-size} and \ref{a6:compr-time} contain raw measurement values for the two goals, described in \ref{k5:goals}. The table \ref{a6:compr-time} lists how long each compression procedure took, in milliseconds. \ref{a6:compr-size} contains file sizes in bytes. In these tables, as well as in the other ones associated with tests in the scope of this work, the a name scheme is used, to improve readability. The filenames were replaced by \texttt{File} followed by two numbers seperated by a point. For the first test set, the number prefix \texttt{1.} was used, the second set is marked with a \texttt{2.}. For example, the fourth file of each test, in tables are named like this \texttt{File 1.4} and \texttt{File 2.4}. The name of the associated source file for the first set is:
|
|
|
|
|
|
|
|
\texttt{Homo\_sapiens.GRCh38.dna.chromosome.\textbf{4}.fa}
|
|
\texttt{Homo\_sapiens.GRCh38.dna.chromosome.\textbf{4}.fa}
|
|
|
|
|
|
|
@@ -215,13 +215,13 @@ In both tables \ref{k6:recal-time} and \ref{k6:recal-size} the already identifie
|
|
|
|
|
|
|
|
\section{View on Possible Improvements}
|
|
\section{View on Possible Improvements}
|
|
|
So far, this work went over formats for storing genomes, methods to compress files (in mentioned formats) and through tests where implementations of named algorithms compress several files and analyzed the results. The test results show that \acs{GeCo} provides a better compression ratio than Samtools and takes more time to run through. So in this testrun, implementations of arithmetic coding resulted in a better compression ratio than Samtools \acs{BAM} with the mix of huffman coding and \acs{LZ77}, or Samtools custom compression format \acs{CRAM}. Comparing results in \autocite{survey}, supports this statement. This study used \acs{FASTA}/Multi-FASTA files from 71MB to 166MB and found that \acs{GeCo} had a variating compression ratio from 12.34 to 91.68 times smaller than the input reference and also resulted in long runtimes up to over 600 minutes \cite{survey}. Since this study focused on another goal than this work and therefore used different test variables and environments, the results can not be compared. But what can be taken from this, is that arithmetic coding, at least in \acs{GeCo} is in need of a runtime improvement.\\
|
|
So far, this work went over formats for storing genomes, methods to compress files (in mentioned formats) and through tests where implementations of named algorithms compress several files and analyzed the results. The test results show that \acs{GeCo} provides a better compression ratio than Samtools and takes more time to run through. So in this testrun, implementations of arithmetic coding resulted in a better compression ratio than Samtools \acs{BAM} with the mix of huffman coding and \acs{LZ77}, or Samtools custom compression format \acs{CRAM}. Comparing results in \autocite{survey}, supports this statement. This study used \acs{FASTA}/Multi-FASTA files from 71MB to 166MB and found that \acs{GeCo} had a variating compression ratio from 12.34 to 91.68 times smaller than the input reference and also resulted in long runtimes up to over 600 minutes \cite{survey}. Since this study focused on another goal than this work and therefore used different test variables and environments, the results can not be compared. But what can be taken from this, is that arithmetic coding, at least in \acs{GeCo} is in need of a runtime improvement.\\
|
|
|
-The actual mathematical proove of such an improvemnt and its implementation can not be covered because it would to beyond scope. But in order to set up a foundation for this task, the rest of this work will consist of considerations and problem analysis, which should be thought about and dealt with to develop a improvement.
|
|
|
|
|
|
|
+The actual mathematical proove of such an improvement, the planing of a implementation and the development of a proof of concept, will be a rewarding but time and ressource comsuming project. Dealing with those tasks would go beyond the scope of this work. But in order to widen the foundation for this tasks, the rest of this work will consist of considerations and problem analysis, which should be thought about and dealt with to develop a improvement.
|
|
|
|
|
|
|
|
S.V. Petoukhov described his findings about the distribution of nucleotides \cite{pet21}. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours is revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} can be determined without counting them \cite{pet21}.\\
|
|
S.V. Petoukhov described his findings about the distribution of nucleotides \cite{pet21}. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours is revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} can be determined without counting them \cite{pet21}.\\
|
|
|
%\%C ≈ Σ\%CN ≈ Σ\%NС ≈ Σ\%CNN ≈ Σ\%NCN ≈ Σ\%NNC ≈ Σ\%CNNN ≈ Σ\%NCNN ≈ Σ\%NNCN ≈ Σ\%NNNC\\
|
|
%\%C ≈ Σ\%CN ≈ Σ\%NС ≈ Σ\%CNN ≈ Σ\%NCN ≈ Σ\%NNC ≈ Σ\%CNNN ≈ Σ\%NCNN ≈ Σ\%NNCN ≈ Σ\%NNNC\\
|
|
|
|
|
|
|
|
% begin optimization
|
|
% begin optimization
|
|
|
-Considering this and the meassured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a tool that is developed with todays standards, there is a possibility that even greater improvements could be archived.\\
|
|
|
|
|
|
|
+Considering this and the measured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a tool that is developed with todays standards, there is a possibility that even greater improvements could be archived.\\
|
|
|
% simple theoretical approach
|
|
% simple theoretical approach
|
|
|
How would a theoretical improvement approach look like? As described in \ref{k4:arith}, entropy coding requires to determine the probabilies of each symbol in the alphabet. The simplest way to do that, is done by parsing the whole sequence from start to end and increasing a counter for each nucleotide that got parsed.
|
|
How would a theoretical improvement approach look like? As described in \ref{k4:arith}, entropy coding requires to determine the probabilies of each symbol in the alphabet. The simplest way to do that, is done by parsing the whole sequence from start to end and increasing a counter for each nucleotide that got parsed.
|
|
|
With new findings discovered by Petoukhov in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach would be that the probability of one nucleotide can be used to determine the probability of other nucelotides, by a calculation rather than the process of counting each one.
|
|
With new findings discovered by Petoukhov in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach would be that the probability of one nucleotide can be used to determine the probability of other nucelotides, by a calculation rather than the process of counting each one.
|
|
@@ -267,7 +267,7 @@ If there space for improvement in the parsing/counting process, what problems ne
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
\begin{itemize}
|
|
|
\item reducing one process by adding aditional code must be estimated and set into relation.
|
|
\item reducing one process by adding aditional code must be estimated and set into relation.
|
|
|
- \item for a tool that does not feature multithreading, how would multithreading affect the improvement reulst?
|
|
|
|
|
|
|
+ \item for a tool that does not feature multithreading, how would multithreading affect the improvement results?
|
|
|
\end{itemize}
|
|
\end{itemize}
|
|
|
|
|
|
|
|
% todo petoukhov just said T = AT+GT+CT+TT = %NT and %T = %TN
|
|
% todo petoukhov just said T = AT+GT+CT+TT = %NT and %T = %TN
|
|
@@ -284,6 +284,7 @@ The fact that there are obviously chains of repeating nucleotides in genomes. Fo
|
|
|
Without determining probabilities, one can see that the amount of \texttt{A}s outnumbers \texttt{T}s and neither \texttt{C} nor \texttt{G} are present. With the whole 1.2 gigabytes, the distribution will align more, but by cutting out a subsection, of relevant size, with unequal distributions will have an impact on the probabilities of the whole sequence. If a greater sequence would lead to a more equal distribution, this knowledge could be used to help determining distributions on subsequences of one with equaly distributed probabilities.
|
|
Without determining probabilities, one can see that the amount of \texttt{A}s outnumbers \texttt{T}s and neither \texttt{C} nor \texttt{G} are present. With the whole 1.2 gigabytes, the distribution will align more, but by cutting out a subsection, of relevant size, with unequal distributions will have an impact on the probabilities of the whole sequence. If a greater sequence would lead to a more equal distribution, this knowledge could be used to help determining distributions on subsequences of one with equaly distributed probabilities.
|
|
|
% length cutting
|
|
% length cutting
|
|
|
|
|
|
|
|
|
|
+% todo erweitern um vergleiche zu survey work
|
|
|
% how is data interpreted
|
|
% how is data interpreted
|
|
|
% why did the tools result in this, what can we learn
|
|
% why did the tools result in this, what can we learn
|
|
|
% improvements
|
|
% improvements
|