k6_results.tex 28 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345
  1. \chapter{Results and Discussion}
  2. The tables \ref{a6:compr-size} and \ref{a6:compr-time} contain raw measurement values for the two goals, described in this \ref{k5:goals} earlyer section. The table \ref{a6:compr-time} lists how long each compression procedure took, in milliseconds. \ref{a6:compr-size} contains file sizes in bytes. In these tables, as well as in the other ones associated with tests, a consistent naming scheme is used, to improve readability. The filenames were replaced by \texttt{File} followed by two numbers separated by a point. For the first test set, the number prefix \texttt{1.} was used, the second set is marked with a \texttt{2.}. For example, the fourth file of each test, in tables are named like this \texttt{File 1.4} and \texttt{File 2.4}. The name of the associated source file for the first set is:
  3. \texttt{Homo\_sapiens.GRCh38.dna.chromosome.\textbf{4}.fa}
  4. Since the source files of the second set are not named as consistent as in the first one, a third column in \ref{k6:set2size} was added, which is mapping table identificator (ID.) and source file name.\\
  5. The files contained in each test set, as well as their size can be found in the tables \ref{k6:set1size} and \ref{k6:set2size}.
  6. The first test set contained a total of 2.8~\acs{GB} unevenly spread over 21 files, while the second test set contained 7~\acs{GB} in total, with a quantity of seven files.\\
  7. \sffamily
  8. \begin{footnotesize}
  9. \begin{longtable}[h]{ p{.4\textwidth} p{.4\textwidth}}
  10. \caption[First Test Set Files and their Sizes in \acs{MB}] % Caption für das Tabellenverzeichnis
  11. {Files contained in the First Test Set and their Sizes in \acs{MB}} % Caption für die Tabelle selbst
  12. \\
  13. \toprule
  14. \textbf{ID.} & \textbf{Size in \acs{MB}} \\
  15. \midrule
  16. File 1.1& 241.38\\
  17. File 1.2& 234.823\\
  18. File 1.3& 192.261\\
  19. File 1.4& 184.426\\
  20. File 1.5& 176.014\\
  21. File 1.6& 165.608\\
  22. File 1.7& 154.497\\
  23. File 1.8& 140.722\\
  24. File 1.9& 134.183\\
  25. File 1.10& 129.726\\
  26. File 1.11& 130.976\\
  27. File 1.12& 129.22\\
  28. File 1.13& 110.884\\
  29. File 1.14& 103.786\\
  30. File 1.15& 98.888\\
  31. File 1.16& 87.589\\
  32. File 1.17& 80.724\\
  33. File 1.18& 77.927\\
  34. File 1.19& 56.834\\
  35. File 1.20& 62.483\\
  36. File 1.21& 45.289\\
  37. \bottomrule
  38. \label{k6:set1size}
  39. \end{longtable}
  40. \end{footnotesize}
  41. \rmfamily
  42. \sffamily
  43. \begin{footnotesize}
  44. \begin{longtable}[h]{ p{.2\textwidth} p{.2\textwidth} p{.4\textwidth}}
  45. \caption[Second Test Set Files and their Sizes in \acs{MB}] % Caption für das Tabellenverzeichnis
  46. {Files contained in the Second Test Set, their Sizes in \acs{MB} and Source File Names} % Caption für die Tabelle selbst
  47. \\
  48. \toprule
  49. \textbf{ID.} & \textbf{Size in \acs{MB}} & \textbf{Source File Name}\\
  50. \midrule
  51. File 2.1& 1188.976& SRR002905.recal.fastq\\
  52. File 2.2& 1203.314& SRR002906.recal.fastq\\
  53. File 2.3& 627.467& SRR002815.recal.fastq\\
  54. File 2.4& 676.0& SRR002816.recal.fastq\\
  55. File 2.5& 1066.431& SRR002817.recal.fastq\\
  56. File 2.6& 1071.095& SRR002818.recal.fastq\\
  57. File 2.7& 1240.564& SRR002819.recal.fastq\\
  58. \bottomrule
  59. \label{k6:set2size}
  60. \end{longtable}
  61. \end{footnotesize}
  62. \rmfamily
  63. \section{Interpretation of Results}
  64. The units milliseconds and bytes store a high precision. Unfortunately they are harder to read and compare, solely by the readers eyes. Therefore the data was altered. Sizes in \ref{k6:sizepercent} are displayed in percentage, in relation to the respective source file. Meaning the compression with \acs{GeCo} on:
  65. \texttt{Homo\_sapiens.GRCh38.dna.chromosome.11.fa}
  66. resulted in a compressed file which were only 17.6\% as big.
  67. Runtimes in \ref{k6:time} were converted into seconds and have been rounded to two decimal places. Also a line was added to the bottom of each table, after which the average percentage of runtime for each process is displayed.\\
  68. \sffamily
  69. \begin{footnotesize}
  70. \begin{longtable}[h]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
  71. \caption[Compression Effectivity for first test set] % Caption für das Tabellenverzeichnis
  72. {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
  73. \\
  74. \toprule
  75. \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\
  76. \midrule
  77. %geco bam and cram in percent
  78. File 1.1& 18.32& 24.51& 22.03\\
  79. File 1.2& 20.28& 26.56& 23.57\\
  80. File 1.3& 20.4& 26.58& 23.66\\
  81. File 1.4& 20.3& 26.61& 23.56\\
  82. File 1.5& 20.12& 26.46& 23.65\\
  83. File 1.6& 20.36& 26.61& 23.6\\
  84. File 1.7& 19.64& 26.15& 23.71\\
  85. File 1.8& 20.4& 26.5& 23.67\\
  86. File 1.9& 17.01& 23.25& 20.94\\
  87. File 1.10& 20.15& 26.36& 23.7\\
  88. File 1.11& 19.96& 26.14& 23.69\\
  89. File 1.12& 20.1& 26.26& 23.74\\
  90. File 1.13& 17.8& 22.76& 20.27\\
  91. File 1.14& 17.16& 22.31& 20.11\\
  92. File 1.15& 16.21& 21.69& 19.76\\
  93. File 1.16& 17.43& 23.48& 21.66\\
  94. File 1.17& 18.76& 25.16& 23.84\\
  95. File 1.18& 20.0& 25.31& 23.63\\
  96. File 1.19& 17.6& 24.53& 23.91\\
  97. File 1.20& 19.96& 25.6& 23.67\\
  98. File 1.21& 16.64& 22.06& 20.44\\
  99. &&&\\
  100. \textbf{Total}& 18.98& 24.99& 22.71\\
  101. \bottomrule
  102. \label{k6:sizepercent}
  103. \end{longtable}
  104. \end{footnotesize}
  105. \rmfamily
  106. Overall, Samtools \acs{BAM} resulted in a little over 75\% size reduction, the \acs{CRAM} methode improved this by rughly 2\%. \acs{GeCo} provided the greatest reduction with over 80\%. This difference of about 6\% comes with a comparatively great sacrifice in time.\\
  107. \sffamily
  108. \begin{footnotesize}
  109. \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
  110. \caption[Compression Efficiency for first test set] % Caption für das Tabellenverzeichnis
  111. {Compression duration in seconds} % Caption für die Tabelle selbst
  112. \\
  113. \toprule
  114. \textbf{ID.} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM}} \\
  115. \midrule
  116. File 1.1 & 23.5& 3.786& 16.926\\
  117. File 1.2 & 24.65& 3.784& 17.043\\
  118. File 1.3 & 2.016& 3.123& 13.999\\
  119. File 1.4 & 19.408& 3.011& 13.445\\
  120. File 1.5 & 18.387& 2.862& 12.802\\
  121. File 1.6 & 17.364& 2.685& 12.015\\
  122. File 1.7 & 15.999& 2.503& 11.198\\
  123. File 1.8 & 14.828& 2.286& 10.244\\
  124. File 1.9 & 12.304& 2.078& 9.21\\
  125. File 1.10 & 13.493& 2.127& 9.461\\
  126. File 1.11 & 13.629& 2.132& 9.508\\
  127. File 1.12 & 13.493& 2.115& 9.456\\
  128. File 1.13 & 99.902& 1.695& 7.533\\
  129. File 1.14 & 92.475& 1.592& 7.011\\
  130. File 1.15 & 85.255& 1.507& 6.598\\
  131. File 1.16 & 82.765& 1.39& 6.089\\
  132. File 1.17 & 82.081& 1.306& 5.791\\
  133. File 1.18 & 79.842& 1.277& 5.603\\
  134. File 1.19 & 58.605& 0.96& 4.106\\
  135. File 1.20 & 64.588& 1.026& 4.507\\
  136. File 1.21 & 41.198& 0.721& 3.096\\
  137. &&&\\
  138. \textbf{Total}&42.57&2.09&9.32\\
  139. \bottomrule
  140. \label{k6:time}
  141. \end{longtable}
  142. \end{footnotesize}
  143. \rmfamily
  144. As \ref{k6:time} is showing, the average compression duration for \acs{GeCo} is at 42.57s. That is a little over 33s, or 78\% longer than the average runtime of Samtools for compressing into the \acs{CRAM} format.\\
  145. Since \acs{CRAM} requires a file in \acs{BAM} format, the third row is calculated by adding the time needed to compress into \acs{BAM} with the time needed to compress into \acs{CRAM}.
  146. While \acs{SAM} format is required for compressing source file into \acs{BAM} and further into \acs{CRAM}, in itself it is a format without compression. However, the conversion from \acs{SAM} to \acs{FASTA} can result in a decrease in size. At first this might be contra intuitive since, as described in \ref{k2:sam} \acs{SAM} is able to contain more information about a sequence than \acs{FASTA}. This can be explained by comparing the sequence storing mechanism. A \acs{FASTA} sequence section can be spread over multiple lines whereas \acs{SAM} files store a sequence in just one line, converting can result in a \acs{SAM} file that is smaller than the original \acs{FASTA} file.
  147. % (hi)storytime -> taken out because it sounds like a rechtfertigung
  148. %Before interpreting the second test set, a quick note on the development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing.\\
  149. % big tables
  150. %For the second set of test data, the file identifier was set to follow the scheme \texttt{File 2.x} where x is a number between zero and seven. While the first set of test data had names that matched the file identifiers, considering its numbering, the second set had more variating names. The mapping between identifier and file can be found in \ref{}. % todo add test set tables
  151. Reviewing the second test set in \ref{k6:recal-time} one will notice, that \acs{GeCo} reached a runtime over 60 seconds on every run. Instead of displaying the runtime solely in seconds, a leading number followed by an m indicates how many minutes each run took.
  152. \sffamily
  153. \begin{footnotesize}
  154. \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
  155. \caption[Compression Effectivity for second test set] % Caption für das Tabellenverzeichnis
  156. {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
  157. \\
  158. \toprule
  159. \textbf{ID.} & \textbf{\acs{GeCo}\% }& \textbf{Samtools \acs{BAM}\% }& \textbf{Samtools \acs{CRAM}\% } \\
  160. \midrule
  161. %geco bam and cram in percent
  162. File 2.1& 1.00& 6.28& 5.38\\
  163. File 2.2& 0.98& 6.41& 5.52\\
  164. File 2.3& 1.21& 8.09& 7.17\\
  165. File 2.4& 1.20& 7.70& 6.85\\
  166. File 2.5& 1.08& 7.58& 6.72\\
  167. File 2.6& 1.09& 7.85& 6.93\\
  168. File 2.7& 0.96& 5.83& 4.63\\
  169. &&&\\
  170. \textbf{Total}& 1.07& 7.11& 6.17\\
  171. \bottomrule
  172. \label{k6:recal-size}
  173. \end{longtable}
  174. \end{footnotesize}
  175. \rmfamily
  176. \sffamily
  177. \begin{footnotesize}
  178. \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
  179. \caption[Compression Effectivity for greater files] % Caption für das Tabellenverzeichnis
  180. {Compression duration in seconds} % Caption für die Tabelle selbst
  181. \\
  182. \toprule
  183. \textbf{ID.} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM}} \\
  184. \midrule
  185. %compress time for geco, bam and cram in seconds
  186. File 2.1 & 1m58.427& 16.248& 23.016\\
  187. File 2.2 & 1m57.905& 15.770& 22.892\\
  188. File 2.3 & 1m09.725& 07.732& 12.858\\
  189. File 2.4 & 1m13.694& 08.291& 13.649\\
  190. File 2.5 & 1m51.001& 14.754& 23.713\\
  191. File 2.6 & 1m51.315& 15.142& 24.358\\
  192. File 2.7 & 2m02.065& 16.379& 23.484\\
  193. &&&\\
  194. \textbf{Total}& 1m43.447& 13.474& 20.567\\
  195. \bottomrule
  196. \label{k6:recal-time}
  197. \end{longtable}
  198. \end{footnotesize}
  199. \rmfamily
  200. In both tables \ref{k6:recal-time} and \ref{k6:recal-size} the already identified pattern can be observed. Samtools compression is faster but less effective. Looking at the compression ratio in \ref{k6:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, \texttt{File 2.7} were the one with the greatest size (\~ 1.3~\acs{GB}). Closely folled by file one and two (\~ 1.2~\acs{GB}).
  201. \section{View on Possible Improvements}
  202. So far, this work went over formats for storing genomes, algorithms that are used to compress those genomes and through tests that compared efficiency and effectivity of mentioned algorithms. The test results show that \acs{GeCo} provides a better compression ratio than Samtools but takes more time to run through. So in this testrun, implementations of arithmetic coding resulted in a better compression ratio than Samtools \acs{BAM} with the mix of Huffman coding and \acs{LZ77}, or Samtools custom format called \acs{CRAM}. Comparing this against the results in \autocite{survey}, supports this statement. This study used sets of \acs{FASTA}/Multi-FASTA files from 71~\acs{MB} to 166~\acs{MB} per file and found that \acs{GeCo} had a variating compression ratio from 12.34 to 91.68 times smaller than the input reference and also resulted in long runtimes up to over 600 minutes \cite{survey}. Since this study focused on another goal and therefore used different test variables and environments, the results can not be compared directly. But what can be taken from this, is that arithmetic coding, at least in \acs{GeCo} is in need of a runtime improvement.\\
  203. The actual mathematical prove of such an optimization, the planing of the implementation and the development of a proof of concept, will be a rewarding but time and ressource comsuming project. In order to widen the foundation for this tasks, the rest of this work will consist of considerations and problem analysis, which should be thought about and dealt with to develop a improvement.
  204. S.V. Petoukhov described his prepublished findings, which are under ongoing research, about the distribution of nucleotides in \autocite{pet21}. There he found that the probabilities of nucleotides in certain chromosomes align. Also, the probability of one nucleotide reveals estimations about the direct neighbours of each occurrence of the nucleotide. This can be illustrated by the following formula in which the probability of \texttt{C} is known and \texttt{N} is a placeholder for any of the four nucleotides \cite{pet21}:\\
  205. \texttt{\% C $\approx$ $\sum$\%CN $\approx$ $\sum$\%NC $\approx$ $\sum$\%CNN $\approx$ $\sum$\%NCN $\approx$ $\sum$\%NNC $\approx$ $\sum$\%CNNN $\approx$ $\sum$\%NCNN $\approx$ $\sum$\%NNCN $\approx$ $\sum$\%NNNC ...}
  206. With the probability of \texttt{C}, the probabilities for sets (n-plets) of \texttt{N} as a placeholder for any nucleotide of \texttt{A, C, G or T}, and including at least one \texttt{C} might be determinable without counting them \cite{pet21}. So, \texttt{$\sum$\%CN} consists of \texttt{\%CC + \%CA + \%CG + \%CT}. The elements in each sum get more, with a increasing n in the n-plet. To be precise $4^{|N|}$ describes the growing of combinations.
  207. \begin{figure}[H]
  208. \centering
  209. \includegraphics[width=15cm]{k6/pet-prob.png}
  210. \caption{Probabilities for \texttt{A, C, G and T} in \texttt{Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly} \cite{pet21, ftp-ncbi}.}
  211. \label{k6:pet-prob}
  212. \end{figure}
  213. The exemplaric probabilities Petoukhov displayed are reprinted in \ref{k6:pet-prob}. Noteable are the similarities in the distirbution of \%A and \%T as well as in \%G and \%C. They align until the third digit after the decimal point. According to him, this regularity is found in the genome of humans, some anmials, plants, bacteria and more \cite{pet21}.\\
  214. % begin optimization
  215. Considering this and the measured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a more current tool it is possible that even greater improvements could be achived.\\
  216. % simple theoretical approach
  217. How would a theoretical improvement approach look like? As described in \ref{k4:arith}, entropy coding requires to determine the probabilies of each symbol in the alphabet. The simplest way to do that would be parsing the whole sequence from start to end and increasing a counter for each nucleotide that got parsed.
  218. With Petoukhov assumptions in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach for that could lay in determining the probabilities of all nucleotides from one by a calculation rather than counting each one.
  219. This approach throws a few questions that need to be answered in order to plan a implementation \cite{pet21}:\\
  220. \begin{itemize}
  221. \item Is there space for improvement in the parsing/counting process?
  222. \item How many probabilities are needed to calculate the others?
  223. \item How can the variation between probabilities be determined?
  224. \end{itemize}
  225. % first bulletpoint
  226. The first question must be answered before considering the others. Since counting one instead of four elements is not a significant difference, the possibility that the change would be to little to be relevant, must be taken in consideration. Additionally, since in the static codeanalysis in \ref{k4:geco} revealed no multithreading, the analysis for improvements when splitting the workload onto several threads should be considered, before working on an improvement based on Petoukhovs assumptions. This is relevant, because some improvements, like the one described above, will loose efficiency if only subsections of a genomes are processed. A tool like OpenMC for multithreading C programs would possibly supply the required functionality to develop a prove of concept \cite{cthreading, pet21}.
  227. % second bullet point
  228. The question for how many probabilities must be determined only gets answered by a theoretical prove. It could happen in form of a mathematical equation, which proves that counting all occurrences of one nucleotide type can be used to determin probabilities of the other nucleotides.
  229. % theoretical improvement with pseudocode
  230. But how could a improvement look like, not considering possible difficulties multithreading would bring?
  231. To answer this, first a mechanism to determine a possible improvement must be determined. To compare parts of a programm and their complexity, the Big-O notation is used. Considering a single threaded loop with the purpose to count every nucleotide in a sequence, the process of counting can be split into several operations, defined by this pseudocode.\\
  232. while (sequence not end)\\
  233. do\\
  234. \-\hspace{0.5cm} next\_nucleotide = read\_next\_nucleotide(sequence)\\
  235. \-\hspace{0.5cm} for (element in alphabet\_probabilities)\\
  236. \-\hspace{0.5cm} do\\
  237. \-\hspace{1cm} if (element equals next\_nucleotide)\\
  238. \-\hspace{1.5cm} element = element + 1\\
  239. \-\hspace{1cm} fi\\
  240. \-\hspace{0.5cm} done\\
  241. done\\
  242. This loop will itterate over a whole sequence, counting each nucleotide. In line three, a inner loop can be found which itterates over the alphabet, to determine which symbol should be increased. Considering the findings, described above, the inner loop can be left out, because there is no need to compare the read nucleotide against more than one symbol. The Big-O notation for this code, with any sequence with the length of n and an alphabet length of four, would be decreseased from O($n\cdot 4$) to O($n\cdot 1)$ or simply O(n) \cite{big-o}. Which is clearly an improvement in complexety and therfore might result in a decreade runtime.\\
  243. The runtime for calculations of the other symbols probabilities must be considered as well and compared against the nested loop to be certain, that the overall runtime was improved.\\
  244. Should Petoukhovs rules, and the regularity shown in \ref{k6:pet-prob} happen to be universal, three approaches could be used to determine the probability of a genome:
  245. \begin{itemize}
  246. \item No counting of any nucleotide and using a fixed set of probabilities.
  247. \item Counting only one nucleotide and determining the others by calculation.
  248. \item Counting either A and T or G and C and determining the other two by mirroring the results.
  249. \end{itemize}
  250. The calculation mentioned in the second bulletpoint, would look like the following, if for example the probability for \texttt{C} got determined by parsing the whole sequence:
  251. $\%G \approx \%C$\\
  252. $\%A+\%T \approx \%100-(\%G+\%C)$\\
  253. $\%A \approx \frac{\%100-(\%G+\%C)}{2}$\\
  254. $\%T \approx \%A$\\
  255. The mapping, mentioned in the last point would consist of the first and last line of the example above.\\
  256. % actual programming approach
  257. Working with probabilities, in fractions as well as in percent, would probabily mean rounding values. To increase the accuracity, the actual value resulting form the counted symbol could be used. For this to work, the amount of overall symbols had to be determined, for $\%A+\%T$ to be calculateable. Since counting all symbols during the process of the one needed nucleotide, could have an impact on the runtime, the full length could also be calculated. With s beeing the size of the parsed sequence in bytes and c beeing the bytes per character $\frac{s}{c}$ would result in the amount of symbols in the sequence.\\
  258. %tf? -> They obviously differ in several categories: runtime, since each is parsing more of the sequence, and the grade of heuristic which is taken into account.
  259. % more realistic view on parsing
  260. The described implementations could even work with a multithreaded parsing process. Improvements would not be impacted since the ratio of the difference between O($n^2$) and O(n) does not differ with the reduction of n. Multiple threads, processing parts of a sequence with the length of n, would also benefit, because any fraction of $n^2$ will always be greater than the corresponding fraction of simply n. The results can either be sumed up for global probabilities or get used individually on the associated subsequence a thread worked on. Either way, the presented improvement approach should be appliable to both parsing methods.\\
  261. After that, some problems are left which needs to be regarded in the approach of developing an improvement.
  262. % bulletpoint 3
  263. The last bulletpoint referes to the possibility that Petoukhovs findings will show that the simliarities in the probability distribution is univeral. Entropy codings work with probabilities, how would a similarity affect the coding mechanism? With an equal probability for each nucleotide, entropy coding can not be treated as a whole. This is due to the fact, that Huffman coding makes use of differing probabilities. A equal distribution means every character will be encoded in the same length which would make the encoding process less usefull. Arithmetic coding on the other hand is able to result in compression even with eaqual probabilites.
  264. But even though the whole chromosome might show a certain pattern, its subsequences mostly do not. For example \texttt{File 1.10}, which contains this subsequence:
  265. \texttt{AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCC}
  266. Without determining probabilities, one can see that the amount of \texttt{C}s outnumbers \texttt{T}s and \texttt{A}s. With the whole 133258320 symbols 130~\acs{MB}, the probability distribution will align more. The following values have been roundet down: \texttt{A $\approx$ 0.291723, C $\approx$ 0.207406, G $\approx$ 0.208009, T $\approx$ 0.2928609}. The pattern described by S. Petoukhov is recognizable. But by cutting out a subsection, of relevant size, with unequal distributions will have an impact on the probabilities of the whole sequence.
  267. If a greater sequence would lead to a more equal distribution, this knowledge could be used to help determining distributions on subsequences of one with equaly distributed probabilities.\\
  268. There are some rules that apply to any whole chromosom sequence as well as to subsequences referenced by \texttt{S}. With the knowledge about lenght \texttt{|S|} and the frequency and position of one symbol e.g. \texttt{C} represented as \texttt{|C|}, rules about the enveloping sequence can be derived. The arithmetic operations on symbols $\cdot$ for consecutive repetitions and $+$ for the concatination are used. For x and y as the ammount of nucleotides before the first and after the last \texttt{C} applies:
  269. \begin{itemize}
  270. \item $\frac{|S|}{x/y-1}\cdot (|C| -1)$ determines the ammount of $(x \cdot N) + C$ and $C + (y \cdot N)$ sequences $\in S$.
  271. \item The longest chain starting with \texttt{C} is $C + N \cdot (|S| - x - 1)$.
  272. \item The longest chain ending with \texttt{C} is $(|S| - y -1) \cdot N + C$.
  273. \item There are $(|C| - 1)$ occurrences of $(x + 1) \cdot N + C$ and an equal ammount of $C + N \cdot (y + 1)$.
  274. \end{itemize}
  275. Those statements might seem trivial to some, but possibly help other to clarify the boundaries on Petoukhov's rules. Also, they represent the end of the thought process of this works last section.\\
  276. \mycomment{
  277. Besides multithreading, there are other methods that could impact improvement approaches. Like the ussage of entropy coding in reference free compression, in combination with other compression solutions. This methods use structural properties, like repetitions or palindromes to apply a dictionary coding algorithm like \acs{LZ77} on the sequence. The parts that do not show any sign of forward or backward repetition get compressed using airhtmetic coding \cite{survey}. When this method is used, working with the probabilities of the whole genome is not purposeful. In the example subsequence out of \texttt{File 1.10}, no \texttt{G} is present. Compressing the subsequence with a additional interval of >0.2 for a symbol that would never get encoded, would be a waste of resources.\\
  278. }
  279. \mycomment{
  280. Summarizing relevant points to end this work in a final conclusion and the view in a possible future:
  281. - coding algorithms did not change drastically, in the last deccades
  282. - improvements are achived by additions to existing algorithms and combining multiple algorithms for specific tasks
  283. - tests and comparings shown that arithmetic coding lacks in efficiency
  284. possible future events:
  285. best case
  286. - improvement through exact determination of whole porb distribution
  287. - petoukov is right
  288. => improvements of probability determination for universal species whole chromosom sequences
  289. => possible further goal: optimization of prob determination for chromosome sections
  290. bad case
  291. - exact determination of all probabilities are not feasible
  292. -> using A$\approx$T$\approx$0.28 and G=C=0.22 to estimate the probability and gather additional information to aproximate the real distibution
  293. - petoukov was wrong about universality of his rules
  294. -> this still might work for a variety of genomes: all human chromosomes, mice, plants...
  295. }
  296. Before resulting in a final conclusion, a quick summary of important points:
  297. \begin{itemize}
  298. \item Coding algorithms did not change drastically, in the last deccades.
  299. \item Improvements are achived by additions to existing algorithms and the combination of several algorithms for specific tasks.
  300. \item Tests and comparings shown that arithmetic coding lacks in efficiency.
  301. \end{itemize}
  302. The goal for this new optimization approach is clearly defined. Also a possible test environment and measurement techniques that would indicate a success have been tested, in this work as well as in cited works \cite{survey}. Considering how other improvements were implemented in the past shows that the way an approach like described above is feasible \cite{moffat_arith}. This, combined with the last point leads to assumption that there is a realistic chance to optimize entropy coding, specifically the arithmetic coding algorithm.\\
  303. This assumption will consolidate by viewing best- and worst-case szenarios that could result from further research. Two important future events are taken into consideration. One would be the theoretical prove of an working optimization approach and the other if Petoukhov's findings develop favorable:
  304. The best case would be described as optimization through exact determination of the whole probability distribution is possible and Petoukhov's findings prove that his rules are universal for genomes between living organisms. This would result in a faster compression in entropy coding. Depending on the dimension, either a tool that is implementing entropy coding only or a hybrid tool, with improved efficiency in its entropy coding algorithms would define a new \texttt{state of the art}.\\
  305. In a worst case szenario, the exact determination of probability distributions would not be possible. This would mean more research should be done in approximating probability distibutions. Additionally, how the use of $A\approx T \approx 0.2914$ and $G\approx C\approx 0.2086$ could provide efficiency improvements in reference-free compression of whole chromosomes and general improvements in the compression of a reference genome in reference-based compression solutions \cite{survey}.\\
  306. Also in this szenario Petoukov would be wrong about the universality of the defined rules, considering the exemplary caculation of probability determination of \texttt{File 1.10} a concern that his rules do not apply to any genomes or that he had a miscalculation is out of the way. This would limit the range of the impact an improvement would create. The combination of which genomes follow Petoukov's rules and a list of tools that specialize on the compression of those would set the new goal for an optimization approach.\\
  307. %From this perspective, how favorable research turns out does not determine if there will be an impact but just how far it will reach.
  308. So, how favorable research turns out does not determine if there will be an impact but just how far it will reach.