3 سال پیش · d0a6e965a0
--- a/latex/tex/&!make
+++ b/latex/tex/&!make
@@ -1,200 +0,0 @@
 
				-\chapter{Results and Discussion}
			
 
				-The two tables \ref{t:effectivity}, \ref{t:efficiency} contain raw measurement values for the two goals, described in \ref{k5:goals}. The first table visualizes how long each compression procedure took, in milliseconds. The second one contains file sizes in bytes. Each row contains information about one of the files following this naming scheme:
			
 
				-
			
 
				-\texttt{Homo\_sapiens.GRCh38.dna.chromosome.}x\texttt{.fa}
			
 
				-
			
 
				-To improve readability, the filename in all tables were replaced by \texttt{File}. To determine which file was compressed, simply replace the placeholder with the number following \texttt{File}.\\
			
 
				-
			
 
				-\section{Interpretation of Results}
			
 
				-The units milliseconds and bytes store a high precision. Unfortunately they are harder to read and compare, solely by the readers eyes. Therefore the data was altered. Sizes in \ref{t:sizepercent} are displayed in percentage, in relation to the respective source file. Meaning the compression with \acs{GeCo} on:
			
 
				-
			
 
				-Homo\_sapiens.GRCh38.dna.chromosome.11.fa 
			
 
				-
			
 
				-resulted in a compressed file which were only 17.6\% as big.
			
 
				-Runtimes in \ref{t:time} were converted into seconds and have been rounded to two decimal places.
			
 
				-Also a line was added to the bottom of each table, showing the average percentage or runtime for each process.\\
			
 
				-\label{t:sizepercent}
			
 
				-\sffamily
			
 
				-\begin{footnotesize}
			
 
				-  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				-    \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
			
 
				-        {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
			
 
				-        \\
			
 
				-    \toprule
			
 
				-     \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\
			
 
				-    \midrule
			
 
				-			File 1& 18.32& 24.51& 22.03\\
			
 
				-			File 2& 20.15& 26.36& 23.7\\
			
 
				-			File 3& 19.96& 26.14& 23.69\\
			
 
				-			File 4& 20.1& 26.26& 23.74\\
			
 
				-			File 5& 17.8& 22.76& 20.27\\
			
 
				-			File 6& 17.16& 22.31& 20.11\\
			
 
				-			File 7& 16.21& 21.69& 19.76\\
			
 
				-			File 8& 17.43& 23.48& 21.66\\
			
 
				-			File 9& 18.76& 25.16& 23.84\\
			
 
				-			File 10& 20.0& 25.31& 23.63\\
			
 
				-			File 11& 17.6& 24.53& 23.91\\
			
 
				-			File 12& 20.28& 26.56& 23.57\\
			
 
				-			File 13& 19.96& 25.6& 23.67\\
			
 
				-			File 14& 16.64& 22.06& 20.44\\
			
 
				-			File 15& 79.58& 103.72& 92.34\\
			
 
				-			File 16& 19.47& 25.52& 22.6\\
			
 
				-			File 17& 19.2& 25.25& 22.57\\
			
 
				-			File 18& 19.16& 25.04& 22.2\\
			
 
				-			File 19& 18.32& 24.4& 22.12\\
			
 
				-			File 20& 18.58& 24.14& 21.56\\
			
 
				-			File 21& 16.22& 22.17& 19.96\\
			
 
				-      &&&\\
			
 
				-			\textbf{Total}& 21.47& 28.24& 25.59\\
			
 
				-    \bottomrule
			
 
				-  \end{longtable}
			
 
				-\end{footnotesize}
			
 
				-\rmfamily
			
 
				-
			
 
				-Overall, Samtools \acs{BAM} resulted in 71.76\% size reduction, the \acs{CRAM} methode improved this by rughly 2.5\%. \acs{GeCo} provided the greatest reduction with 78.53\%. This gap of about 4\% comes with a comparatively great sacrifice in time.\\
			
 
				-
			
 
				-\label{t:time}
			
 
				-\sffamily
			
 
				-\begin{footnotesize}
			
 
				-  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				-    \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
			
 
				-        {Compression duration in seconds} % Caption für die Tabelle selbst
			
 
				-        \\
			
 
				-    \toprule
			
 
				-     \textbf{ID.} & \textbf{\acs{GeCo} } & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM} } \\
			
 
				-    \midrule
			
 
				-			File 1 & 23.5& 3.786& 16.926\\
			
 
				-			File 2 & 24.65& 3.784& 17.043\\
			
 
				-			File 3 & 2.016& 3.123& 13.999\\
			
 
				-			File 4 & 19.408& 3.011& 13.445\\
			
 
				-			File 5 & 18.387& 2.862& 12.802\\
			
 
				-			File 6 & 17.364& 2.685& 12.015\\
			
 
				-			File 7 & 15.999& 2.503& 11.198\\
			
 
				-			File 8 & 14.828& 2.286& 10.244\\
			
 
				-      File 9 & 12.304& 2.078& 9.21\\
			
 
				-			File 10 & 13.493& 2.127& 9.461\\
			
 
				-			File 11 & 13.629& 2.132& 9.508\\
			
 
				-			File 12 & 13.493& 2.115& 9.456\\
			
 
				-			File 13 & 99.902& 1.695& 7.533\\
			
 
				-			File 14 & 92.475& 1.592& 7.011\\
			
 
				-			File 15 & 85.255& 1.507& 6.598\\
			
 
				-			File 16 & 82.765& 1.39& 6.089\\
			
 
				-			File 17 & 82.081& 1.306& 5.791\\
			
 
				-			File 18 & 79.842& 1.277& 5.603\\
			
 
				-			File 19 & 58.605& 0.96& 4.106\\
			
 
				-			File 20 & 64.588& 1.026& 4.507\\
			
 
				-			File 21 & 41.198& 0.721& 3.096\\
			
 
				-      &&&\\
			
 
				-      \textbf{Total}&42.57&2.09&9.32\\
			
 
				-    \bottomrule
			
 
				-  \end{longtable}
			
 
				-\end{footnotesize}
			
 
				-\rmfamily
			
 
				-
			
 
				-As \ref{t:time} is showing, the average compression duration for \acs{GeCo} is at 42.57s. That is a little over 33s, or 78\% longer than the average runtime of samtools for compressing into the \acs{CRAM} format.\\
			
 
				-Since \acs{CRAM} requires a file in \acs{BAM} format, the third row is calculated by adding the time needed to compress into \acs{BAM} with the time needed to compress into \acs{CRAM}. 
			
 
				-While \acs{SAM} format is required for compressing a \acs{FASTA} into \acs{BAM} and further into \acs{CRAM}, in itself it does not features no compression. However, the conversion from \acs{SAM} to \acs{FASTA} can result in a decrease in size. At first this might be contra intuitive since, as described in \ref{k2:sam} \acs{SAM} stores more information than \acs{FASTA}. This can be explained by comparing the sequence storing mechanism. A \acs{FASTA} sequence section can be spread over multiple lines whereas \acs{SAM} files store a sequence in just one line, converting can result in a \acs{SAM} file that is smaller than the original \acs{FASTA} file.
			
 
				-% (hi)storytime
			
 
				-Before interpreting this data further, a quick view into development processes: \acs{GeCo} stopped development in the year 2016 while Samtools is being developed since 2015, to this day, with over 70 people contributing.\\
			
 
				-% todo interpret bit files and compare
			
 
				-
			
 
				-% big tables
			
 
				-Reviewing \ref{t:recal-time} one will notice, that \acs{GeCo} reached a runtime over 60 seconds on every run. Instead of displaying the runtime solely in seconds, a leading number followed by an m indicates how many minutes each run took.
			
 
				-
			
 
				-\label{t:recal-size}
			
 
				-\sffamily
			
 
				-\begin{footnotesize}
			
 
				-  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				-    \caption[Compression Effectivity for greater files]                       % Caption für das Tabellenverzeichnis
			
 
				-        {File sizes in different compression formats in \textbf{percent}} % Caption für die Tabelle selbst
			
 
				-        \\
			
 
				-    \toprule
			
 
				-     \textbf{ID.} & \textbf{\acs{GeCo} \%} & \textbf{Samtools \acs{BAM}\%}& \textbf{Samtools \acs{CRAM} \%} \\
			
 
				-    \midrule
			
 
				-			%geco bam and cram in percent
			
 
				-			File 1& 1.00& 6.28& 5.38\\
			
 
				-			File 2& 0.98& 6.41& 5.52\\
			
 
				-			File 3& 1.21& 8.09& 7.17\\
			
 
				-			File 4& 1.20& 7.70& 6.85\\
			
 
				-			File 5& 1.08& 7.58& 6.72\\
			
 
				-			File 6& 1.09& 7.85& 6.93\\
			
 
				-			File 7& 0.96& 5.83& 4.63\\
			
 
				-      &&&\\
			
 
				-			\textbf{Total}	1.07& 7.11& 6.17\\
			
 
				-    \bottomrule
			
 
				-  \end{longtable}
			
 
				-\end{footnotesize}
			
 
				-\rmfamily
			
 
				-
			
 
				-\label{t:recal-time}
			
 
				-\sffamily
			
 
				-\begin{footnotesize}
			
 
				-  \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
			
 
				-    \caption[Compression Effectivity for greater files]                       % Caption für das Tabellenverzeichnis
			
 
				-        {Compression duration in seconds} % Caption für die Tabelle selbst
			
 
				-        \\
			
 
				-    \toprule
			
 
				-     \textbf{ID.} & \textbf{\acs{GeCo} } & \textbf{Samtools \acs{BAM}}& \textbf{Samtools \acs{CRAM} } \\
			
 
				-    \midrule
			
 
				-			%compress time for geco, bam and cram in seconds
			
 
				-			File 1 & 1m58.427& 16.248& 23.016\\
			
 
				-			File 2 & 1m57.905& 15.770& 22.892\\
			
 
				-			File 3 & 1m09.725& 07.732& 12.858\\
			
 
				-			File 4 & 1m13.694& 08.291& 13.649\\
			
 
				-			File 5 & 1m51.001& 14.754& 23.713\\
			
 
				-			File 6 & 1m51.315& 15.142& 24.358\\
			
 
				-			File 7 & 2m02.065& 16.379& 23.484\\
			
 
				-      &&&\\
			
 
				-			\textbf{Total}	 & 1m43.447& 13.474& 20.567\\
			
 
				-    \bottomrule
			
 
				-  \end{longtable}
			
 
				-\end{footnotesize}
			
 
				-\rmfamily
			
 
				-
			
 
				-In both tables \ref{t:recal-time} and \ref{t:recal-size} the already identified pattern can be observed. Looking at the compression ratio in \ref{t:recal-size} a maximum compression of 99.04\% was reached with \acs{GeCo}. In this set of test files, file seven were the one with the greatest size (\~1.3 Gigabyte). Closely folled by file one and two (\~1.2 Gigabyte). 
			
 
				-% todo greater filesize means better compression
			
 
				-
			
 
				-\section{View on Possible Improvements}
			
 
				-S. Petukhov described new findings about the distribution of nucleotides. With the probability of one nucleotide, in a sequence of sufficient length, information about the direct neighbours is revealed. For example, with the probability of \texttt{C}, the probabilities for sets (n-plets) of any nucleotide \texttt{N}, including \texttt{C} can be determined:\\
			
 
				-%\%C ≈ Σ\%CN ≈ Σ\%NС ≈ Σ\%CNN ≈ Σ\%NCN ≈ Σ\%NNC ≈ Σ\%CNNN ≈ Σ\%NCNN ≈ Σ\%NNCN ≈ Σ\%NNNC\\
			
 
				-
			
 
				-% begin optimization 
			
 
				-Considering this and the meassured results, an improvement in the arithmetic coding process and therefore in \acs{GeCo}s efficiency, would be a good start to equalize the great gap in the compression duration. Combined with a tool that is developed with todays standards, there is a possibility that even greater improvements could be archived.\\
			
 
				-% simple theoretical approach
			
 
				-How would a theoretical improvement approach look like? As described in \ref{k4:arith}, entropy coding requires to determine the probabilies of each symbol in the alphabet. The simplest way to do that, is done by parsing the whole sequence from start to end and increasing a counter for each nucleotide that got parsed. 
			
 
				-With new findings discovered by S. Petukhov in cosideration, the goal would be to create an entropy coding implementation that beats current implementation in the time needed to determine probabilities. A possible approach would be that the probability of one nucleotide can be used to determine the probability of other nucelotides, by a calculation rather than the process of counting each one.
			
 
				-This approach throws a few questions that need to be answered in order to plan a implementation:  
			
 
				-\begin{itemize}
			
 
				-	\item How many probabilities are needed to calculate the others?
			
 
				-	\item Is there space for improvement in the parsing/counting process?
			
 
				-	%\item Is there space for visible improvements, when only counting one nucleotide?
			
 
				-	\item How can the variation between probabilities be determined?
			
 
				-\end{itemize}
			
 
				-
			
 
				-Second point must be asked, because the improvement in counting only one nucleotide in comparison to counting three, would be to little to be called relevant.
			
 
				-%todo compare time needed: to store a variable <-> parsing the sequence
			
 
				-To compare parts of a programm and their complexity, the Big-O notation is used. Unfortunally this is only covering loops and coditions as a whole. Therefore a more detailed view on operations must be created: 
			
 
				-Considering a single threaded loop with the purpose to count every nucleotide in a sequence, the process of counting can be split into several operations, defined by this pseudocode.
			
 
				-
			
 
				-%todo use GeCo arith function with bigO
			
 
				-while (sequence not end):\\
			
 
				-	\-\hspace{1cm} next\_nucleotide = read\_next\_nucleotide(sequence)\\
			
 
				-	for (element in alphabet\_probabilities):\\
			
 
				-		if (element equals next\_nucleotide)\\
			
 
				-			element = element + 1\\
			
 
				-		fi\\
			
 
				-	rof\\
			
 
				-elihw\\
			
 
				-
			
 
				-This loop will itterate over a whole sequence, counting each nucleotide. In line three, a inner loop can be found which itterates over the alphabet, to determine which symbol should be increased. Considering the findings, described above, the inner loop can be left out, because there is no need to compare the read nucleotide against more than one symbol. The Big-O notation for this code, with any sequence with the length of n, would be decreseased from O($n^2$) to O($n\cdot 1)$) or simply O(N) \cite{big-o}. Which is clearly an improvement in complexety and therefor also in runtime.\\
			
 
				-The runtime for calculations of the other symbols probabilities must be considered as well and compared against the nested loop to be certain, that the overall was improved.
			
 
				-% more realistic view on parsing todo need cites
			
 
				-In practice, obviously smarter ways are used, to determine probabilities. Like splitting the sequence in multiple parts and parse each subsequence asynchronous. This results can either sumed up for global probabilities or get used individually on each associated subsequence. Either way, the presented improvement approach should be appliable to both parsing methods.\\
			
 
				-
			
 
				-
			
 
				-% how is data interpreted
			
 
				-% why did the tools result in this, what can we learn
			
 
				-% improvements
			
 
				-% - goal: less time to compress
			
 
				-% 	- approach: optimize probability determination
			
 
				-% 	-> how?
			
--- a/latex/tex/kapitel/k1_introduction.tex
+++ b/latex/tex/kapitel/k1_introduction.tex
@@ -1,7 +1,7 @@
 
				 \chapter{Introduction}
			
 
				 % general information and intro
			
 
				 %Understanding how things in our cosmos work, was and still is a pleasure, that the human being always wants to fulfill. 
			
 
				-Understanding the biological code of living things, is a alsways developing taks which is important for multiple aspekts of our live. The results of reasearch in this area provides knowledge that helps development in the medical sector, agriculture and more \cite{ju_21, wang-22, mo_83}.
			
 
				+Understanding the biological code of living things, is a alsways developing taks which is important for multiple aspekts of our live. The results of reasearch in this area provides knowledge that helps development in the medical sector, agriculture and more \cite{ju_21, wang_22, mo_83}.
			
 
				 Getting insights into this biological code is possible through storing and studying information, embedded in genonmes \cite{dna_structure}. Since live is complex, there is a lot of information, which requires a lot of memory \cite{alok17, survey}.\\
			
 
				 % ...Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
			
 
				 % compression values and goals
			
--- a/latex/tex/literatur.bib
+++ b/latex/tex/literatur.bib
@@ -1,3 +1,35 @@
 
				+@Article{ju_21,
			
 
				+  author       = {Philomin Juliana and Ravi Prakash Singh and Jesse Poland and Sandesh Shrestha and Julio Huerta-Espino and Velu Govindan and Suchismita Mondal and Leonardo Abdiel Crespo-Herrera and Uttam Kumar and Arun Kumar Joshi and Thomas Payne and Pradeep Kumar Bhati and Vipin Tomar and Franjel Consolacion and Jaime Amador Campos Serna},
			
 
				+  date         = {2021-03},
			
 
				+  journaltitle = {Scientific Reports},
			
 
				+  title        = {Elucidating the genetics of grain yield and stress-resilience in bread wheat using a large-scale genome-wide association mapping study with 55,568 lines},
			
 
				+  doi          = {10.1038/s41598-021-84308-4},
			
 
				+  volume       = {11},
			
 
				+  publisher    = {Springer Science and Business Media {LLC}},
			
 
				+}
			
 
				+
			
 
				+@Article{mo_83,
			
 
				+  author       = {Arno G. Motulsky},
			
 
				+  date         = {1983-01},
			
 
				+  journaltitle = {Science},
			
 
				+  title        = {Impact of Genetic Manipulation on Society and Medicine},
			
 
				+  doi          = {10.1126/science.6336852},
			
 
				+  pages        = {135--140},
			
 
				+  volume       = {219},
			
 
				+  publisher    = {American Association for the Advancement of Science ({AAAS})},
			
 
				+}
			
 
				+
			
 
				+@Article{wang_22,
			
 
				+  author       = {Si-Wei Wang and Chao Gao and Yi-Min Zheng and Li Yi and Jia-Cheng Lu and Xiao-Yong Huang and Jia-Bin Cai and Peng-Fei Zhang and Yue-Hong Cui and Ai-Wu Ke},
			
 
				+  date         = {2022-02},
			
 
				+  journaltitle = {Molecular Cancer},
			
 
				+  title        = {Current applications and future perspective of {CRISPR}/Cas9 gene editing in cancer},
			
 
				+  doi          = {10.1186/s12943-022-01518-8},
			
 
				+  number       = {1},
			
 
				+  volume       = {21},
			
 
				+  publisher    = {Springer Science and Business Media {LLC}},
			
 
				+}
			
 
				+
			
 
				 @Article{alok17,
			
 
				   author       = {Anas Al-Okaily and Badar Almarri and Sultan Al Yami and Chun-Hsi Huang},
			
 
				   date         = {2017-04-01},
			
@@ -412,39 +444,6 @@
 
				   year    = {1977},
			
 
				 }
			
 
				 
			
 
				-@Article{wang_22,
			
 
				-  author       = {Si-Wei Wang and Chao Gao and Yi-Min Zheng and Li Yi and Jia-Cheng Lu and Xiao-Yong Huang and Jia-Bin Cai and Peng-Fei Zhang and Yue-Hong Cui and Ai-Wu Ke},
			
 
				-  date         = {2022-02},
			
 
				-  journaltitle = {Molecular Cancer},
			
 
				-  title        = {Current applications and future perspective of {CRISPR}/Cas9 gene editing in cancer},
			
 
				-  doi          = {10.1186/s12943-022-01518-8},
			
 
				-  number       = {1},
			
 
				-  volume       = {21},
			
 
				-  publisher    = {Springer Science and Business Media {LLC}},
			
 
				-}
			
 
				-
			
 
				-@Article{ju_21,
			
 
				-  author       = {Philomin Juliana and Ravi Prakash Singh and Jesse Poland and Sandesh Shrestha and Julio Huerta-Espino and Velu Govindan and Suchismita Mondal and Leonardo Abdiel Crespo-Herrera and Uttam Kumar and Arun Kumar Joshi and Thomas Payne and Pradeep Kumar Bhati and Vipin Tomar and Franjel Consolacion and Jaime Amador Campos Serna},
			
 
				-  date         = {2021-03},
			
 
				-  journaltitle = {Scientific Reports},
			
 
				-  title        = {Elucidating the genetics of grain yield and stress-resilience in bread wheat using a large-scale genome-wide association mapping study with 55,568 lines},
			
 
				-  doi          = {10.1038/s41598-021-84308-4},
			
 
				-  number       = {1},
			
 
				-  volume       = {11},
			
 
				-  publisher    = {Springer Science and Business Media {LLC}},
			
 
				-}
			
 
				-
			
 
				-@Article{mo_83,
			
 
				-  author       = {Arno G. Motulsky},
			
 
				-  date         = {1983-01},
			
 
				-  journaltitle = {Science},
			
 
				-  title        = {Impact of Genetic Manipulation on Society and Medicine},
			
 
				-  doi          = {10.1126/science.6336852},
			
 
				-  number       = {4581},
			
 
				-  pages        = {135--140},
			
 
				-  volume       = {219},
			
 
				-  publisher    = {American Association for the Advancement of Science ({AAAS})},
			
 
				-}
			
 
				 
			
 
				 @Online{bam,
			
 
				   title   = {Sequence Alignment/Map Format Specification},
			
--- a/latex/tex/thesis.tex
+++ b/latex/tex/thesis.tex
@@ -182,7 +182,7 @@
 
				 \cleardoublepage
			
 
				 \begin{flushleft}
			
 
				 \let\clearpage\relax % Fix für leere Seiten (issue #25)
			
 
				-\printbibliography
			
 
				+\printbibliography[nottype=online]
			
 
				 \printbibliography[type=online, sorting=nud title=Online Sources]
			
 
				 \end{flushleft}
			
 
				 \endgroup