فهرست منبع

fixed tables. order of files in size didnt match id. added some sources. started improving k4 algos

u 3 سال پیش
والد
کامیت
0adaeabc89
5فایلهای تغییر یافته به همراه160 افزوده شده و 79 حذف شده
  1. 73 0
      latex/tex/kapitel/a6_results.tex
  2. 25 13
      latex/tex/kapitel/k4_algorithms.tex
  3. 37 65
      latex/tex/kapitel/k6_results.tex
  4. 24 0
      latex/tex/literatur.bib
  5. 1 1
      latex/tex/thesis.tex

+ 73 - 0
latex/tex/kapitel/a6_results.tex

@@ -0,0 +1,73 @@
+\chapter{Erster Anhang: Lange Tabelle}
+\label{t:efficiency}
+
+\sffamily
+\begin{footnotesize}
+  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
+    \caption[Compression Efficiency]                       % Caption für das Tabellenverzeichnis
+        {Compression duration meassured in milliseconds} % Caption für die Tabelle selbst
+        \\
+    \toprule
+     \textbf{ID.} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{BAM}} & \textbf{Samtools \acs{CRAM}} \\
+    \midrule
+     File 1 & 235005& 3786& 16926\\
+     File 2 & 246503& 3784& 17043\\
+     File 3 & 20169& 3123& 13999\\
+     File 4 & 194081& 3011& 13445\\
+     File 5 & 183878& 2862& 12802\\
+     File 6 & 173646& 2685& 12015\\
+     File 7 & 159999& 2503& 11198\\
+     File 8 & 148288& 2286& 10244\\
+     File 9 & 12304& 2078& 9210\\
+     File 10 & 134937& 2127& 9461\\
+     File 11 & 136299& 2132& 9508\\
+     File 12 & 134932& 2115& 9456\\
+     File 13 & 999022& 1695& 7533\\
+     File 14 & 924753& 1592& 7011\\
+     File 15 & 852555& 1507& 6598\\
+     File 16 & 827651& 1390& 6089\\
+     File 17 & 820814& 1306& 5791\\
+     File 18 & 798429& 1277& 5603\\
+     File 19 & 586058& 960& 4106\\
+     File 20 & 645884& 1026& 4507\\
+     File 21 & 411984& 721& 3096\\
+    \bottomrule
+  \end{longtable}
+\end{footnotesize}
+\rmfamily
+
+\label{t:effectivity}
+\sffamily
+\begin{footnotesize}
+  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
+    \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
+        {File sizes in different compression formats} % Caption für die Tabelle selbst
+        \\
+    \toprule
+     \textbf{ID.} & \textbf{Source File} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{CRAM}} \\
+    \midrule
+     File 1& 253105752& 46364770& 55769827\\
+     File 2& 136027438& 27411806& 32238052\\
+     File 3& 137338124& 27408185& 32529673\\
+     File 4& 135496623& 27231126& 32166751\\
+     File 5& 116270459& 20696778& 23568321\\
+     File 6& 108827838& 18676723& 21887811\\
+     File 7& 103691101& 16804782& 20493276\\
+     File 8& 91844042& 16005173& 19895937\\
+     File 9& 84645123& 15877526& 20177456\\
+     File 10& 81712897& 16344067& 19310998\\
+     File 11& 59594634& 10488207& 14251243\\
+     File 12& 246230144& 49938168& 58026123\\
+     File 13& 65518294& 13074402& 15510100\\
+     File 14& 47488540& 7900773& 9708258\\
+     File 15& 51665500& 41117340& 47707954\\
+     File 16& 201600541& 39248276& 45564837\\
+     File 17& 193384854& 37133480& 43655371\\
+     File 18& 184563953& 35355184& 40980906\\
+     File 19& 173652802& 31813760& 38417108\\
+     File 20& 162001796& 30104816& 34926945\\
+     File 21& 147557670& 23932541& 29459829\\
+    \bottomrule
+  \end{longtable}
+\end{footnotesize}
+\rmfamily

+ 25 - 13
latex/tex/kapitel/k4_algorithms.tex

@@ -25,17 +25,27 @@
 
 
 \section{Compression aproaches}
-The process of compressing data serves the goal to generate an output that is smaller than its input data. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or network protocols that are used to transmit video/audio streams live.
-For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Both are described in detail below.\\
+The process of compressing data serves the goal to generate an output that is smaller than its input data.\\
+In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the whole information, which were available in the origin data, by decompressing it.\\
+Before going on, the difference between information and data should be emphasized.\\
+% excurs data vs information
+Data contians information. In digital data  clear, physical limitations delimit what and how much of something can be stored. A bit can only store 0 or 1, eleven bits can store up to $2^11$ combinations of bits and a 1\acs{GB} drive can store no more than 1\acs{GB} data. Information on the other hand, is limited by the way how it is stored. In some cases the knowledge received in a earlier point in time must be considered too, but this can be neglected for reasons described in the subsection \ref{k4:dict}.\\
+% excurs information vs data
+The boundaries of information, when it comes to storing capabilities, can be illustrated by using the example mentioned above. A drive with the capacity of 1\acs{GB} could contain a book in form of images, where the content of each page is stored in a single image. Another, more ressourcefull way would be storing just the text of every page in \acs{UTF-16}. The information, the text would provide to a potential reader would not differ. Changing the text encoding to \acs{ASCII} and/or using compression techniques would reduce the required space even more, without loosing any information.\\
+% excurs end
+In contrast to lossless compression, lossy compression might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to persist the origin information. This works with certain audio and pictures formats, and in network protocols \cite{cnet13}.
+For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Both are described in detail below \cite{cc14}.\\
 
 \subsection{Dictionary coding}
-Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand occurences of strings. Strings are a chain of characters representing a full word or just a part of it. This is explained shortly for a better understanding of dictionary coding but is of no great relevance to the focus of this work:
+\label{k4:dict}
+Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand occurences of strings. Strings are a chain of characters representing a full word or just a part of it. For a better understanding this should be illustrated by a short example:
 % exkurs
 Looking at the string 'stationary' it might be smart to store 'station' and 'ary' as seperate dictionary enties. Which way is more efficient depents on the text that should get compressed. 
 % end exkurs
 The dictionary should only store strings that occour in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is made out of the input data. Each first occourence is left uncompressed. Every occurence of a string, after the first one, points to its first occurence. Since this 'pointer' needs less space than the string it points to, a decrease in the size is created.\\
 
-Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore out of scope for this work. Since finding repetations and their location might also be improved, this chapter will remain.
+Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore out of scope for this work. Since finding repeting sections and their location might also be improved, this chapter will remain.
+
 
 % unuseable due to the lack of probability
 \mycomment{
@@ -46,14 +56,19 @@ The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv cre
 % example 
 }
 
+ % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
+\subsection{\ac{LZ77}}
+ \ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.
+
+
 \subsection{Shannons Entropy}
-The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \autocite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
+The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \cite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
 
 % todo insert Fig. 1 shannon_1948
 
 Altering this figure shows how it can be used for other technology like compression.\\
 The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same phyiscal actor. 
-transmitter and receiver are changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time \autocite{Shannon_1948}.\\
+transmitter and receiver are changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time \cite{Shannon_1948}.\\
 
 Shannons Entropy provides a formular to determine the 'uncertainty of a probability distribution' in a finite field.
 
@@ -65,7 +80,7 @@ Shannons Entropy provides a formular to determine the 'uncertainty of a probabil
   \label{k4:entropy}
 \end{figure}
 
-He defined entropy as shown in figure \ref{k4:entropy}. Let X be a finite probability space. Then x in X are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{k4:info-in-bits}\autocite{delfs_knebl,Shannon_1948}:
+He defined entropy as shown in figure \ref{f4:entropy}. Let X be a finite probability space. Then x in X are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{k4:info-in-bits}\cite{delfs_knebl,Shannon_1948}:
 
 %\bein{math}
 % log_2(frac{1}{prob(x)}) \equiv - log_2(prob(x)).
@@ -74,7 +89,7 @@ He defined entropy as shown in figure \ref{k4:entropy}. Let X be a finite probab
   \centering
   \includegraphics[width=8cm]{k4/information_bits.png}
   \caption{The amount of information measured in bits, in case x is the end state of a probability experiment.}
-  \label{k4:info-in-bits}
+  \label{f4:info-in-bits}
 \end{figure}
 
 %todo explain 2.2 second bulletpoint of delfs_knebl. Maybe read gumbl book
@@ -92,10 +107,6 @@ This means the intervall start of the character is noted, its intervall is split
 To encode in binary, the binary floating point representation of a number inside the intervall, for the last character is calculated, by using a similar process, described above, called subdividing.
 % its finite subdividing because processors bottleneck floatingpoints 
 
- % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
-\subsection{\ac{LZ77}}
- \ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.
-
 \subsection{Huffman encoding}
 % list of algos and the tools that use them
 The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
@@ -109,7 +120,8 @@ The well known Huffman coding, is used in several Tools for genome compression.
   \item the less weight a node has, the higher the probability is, that this node is read next in the symbol sequence
 \end{itemize}
 The process of compressing starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
-Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.
+Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.\\
+% todo shannon fano mention. SF might be older than huffman and inspired it?
 
 \section{DEFLATE}
 % mix of huffman and lz77

+ 37 - 65
latex/tex/kapitel/k6_results.tex

@@ -1,72 +1,44 @@
 \chapter{Results and Discussion}
 
-
-\begin{tabular}{ |p{2cm}||p{3cm}|p{3,5cm}|p{3,5cm}|  }
- \hline
- \multicolumn{4}{|c|}{Compression time in milliseconds} \\
- \hline
-   & \acs{GeCo}& Samtools \acs{BAM}& Samtools \acs{CRAM}\\
- \hline
-   File 1 & 235005& 3786& 16926\\
-   File 2 & 246503& 3784& 17043\\
-   File 3 & 20169& 3123& 13999\\
-   File 4 & 194081& 3011& 13445\\
-   File 5 & 183878& 2862& 12802\\
-   File 6 & 173646& 2685& 12015\\
-   File 7 & 159999& 2503& 11198\\
-   File 8 & 148288& 2286& 10244\\
-   File 9 & 12304& 2078& 9210\\
-   File 10 & 134937& 2127& 9461\\
-   File 11 & 136299& 2132& 9508\\
-   File 12 & 134932& 2115& 9456\\
-   File 13 & 999022& 1695& 7533\\
-   File 14 & 924753& 1592& 7011\\
-   File 15 & 852555& 1507& 6598\\
-   File 16 & 827651& 1390& 6089\\
-   File 17 & 820814& 1306& 5791\\
-   File 18 & 798429& 1277& 5603\\
-   File 19 & 586058& 960& 4106\\
-   File 20 & 645884& 1026& 4507\\
-   File 21 & 411984& 721& 3096\\
- \hline
-\end{tabular}
-
-
-\begin{tabular}{ |p{3cm}||p{3cm}|p{3cm}|p{3cm}|  }
- \hline
- \multicolumn{4}{|c|}{File sizes in bytes} \\
- \hline
-   & Source file& \acs{GeCo}& Samtools \acs{CRAM}\\
- \hline
-  File 1& 253105752& 46364770& 55769827\\
-  File 2& 136027438& 27411806& 32238052\\
-  File 3& 137338124& 27408185& 32529673\\
-  File 4& 135496623& 27231126& 32166751\\
-  File 5& 116270459& 20696778& 23568321\\
-  File 6& 108827838& 18676723& 21887811\\
-  File 7& 103691101& 16804782& 20493276\\
-  File 8& 91844042& 16005173& 19895937\\
-  File 9& 84645123& 15877526& 20177456\\
-  File 10& 81712897& 16344067& 19310998\\
-  File 11& 59594634& 10488207& 14251243\\
-  File 12& 246230144& 49938168& 58026123\\
-  File 13& 65518294& 13074402& 15510100\\
-  File 14& 47488540& 7900773& 9708258\\
-  File 15& 51665500& 41117340& 47707954\\
-  File 16& 201600541& 39248276& 45564837\\
-  File 17& 193384854& 37133480& 43655371\\
-  File 18& 184563953& 35355184& 40980906\\
-  File 19& 173652802& 31813760& 38417108\\
-  File 20& 162001796& 30104816& 34926945\\
-  File 21& 147557670& 23932541& 29459829\\
- \hline
-\end{tabular}
-
+\label{t:effectivity}
+\sffamily
+\begin{footnotesize}
+  \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
+    \caption[Compression Effectivity]                       % Caption für das Tabellenverzeichnis
+        {File sizes in different compression formats} % Caption für die Tabelle selbst
+        \\
+    \toprule
+     \textbf{ID.} & \textbf{Source File} & \textbf{\acs{GeCo}} & \textbf{Samtools \acs{CRAM}} \\
+    \midrule
+     File 1& 253105752& 46364770& 55769827\\
+     File 2& 136027438& 27411806& 32238052\\
+     File 3& 137338124& 27408185& 32529673\\
+     File 4& 135496623& 27231126& 32166751\\
+     File 5& 116270459& 20696778& 23568321\\
+     File 6& 108827838& 18676723& 21887811\\
+     File 7& 103691101& 16804782& 20493276\\
+     File 8& 91844042& 16005173& 19895937\\
+     File 9& 84645123& 15877526& 20177456\\
+     File 10& 81712897& 16344067& 19310998\\
+     File 11& 59594634& 10488207& 14251243\\
+     File 12& 246230144& 49938168& 58026123\\
+     File 13& 65518294& 13074402& 15510100\\
+     File 14& 47488540& 7900773& 9708258\\
+     File 15& 51665500& 41117340& 47707954\\
+     File 16& 201600541& 39248276& 45564837\\
+     File 17& 193384854& 37133480& 43655371\\
+     File 18& 184563953& 35355184& 40980906\\
+     File 19& 173652802& 31813760& 38417108\\
+     File 20& 162001796& 30104816& 34926945\\
+     File 21& 147557670& 23932541& 29459829\\
+    \bottomrule
+  \end{longtable}
+\end{footnotesize}
+\rmfamily
 % raw data and charts
-% differences in used algos/ algos in tools
+% differences in used algos/ algos in tools <- k5?
 % optimization approach
-% further research focus
-% (how optimization would be recognizable in testdata)
+% further research focus <- ask if wanted
 
 % todo ms to minutes and bytes to mb. Those tables move to the appendix
 The two tables above contain rather raw measurement values for the two goals, described in \ref{k5:goals}. The first table shows how long each compression procedure took. Each row contains information about one of the \texttt{Homo\_sapiens.GRCh38.dna.chromosome.}x\texttt{.fa} files. To improve readability, the filename were replaced by \texttt{File}. To determine which file was compressed, simply replace the placeholder with the number following \texttt{File}.\\

+ 24 - 0
latex/tex/literatur.bib

@@ -153,4 +153,28 @@
   subtitle  = {Principles and Applications (Information Security and Cryptography)},
 }
 
+@Article{cc14,
+  author       = {Kashfia Sailunaz and Mohammed Rokibul Alam Kotwal and Mohammad Nurul Huda},
+  date         = {2014-03},
+  journaltitle = {International Journal of Computer Applications},
+  title        = {Data Compression Considering Text Files},
+  doi          = {10.5120/15765-4456},
+  number       = {11},
+  pages        = {27--32},
+  volume       = {90},
+  publisher    = {Foundation of Computer Science},
+}
+
+@Article{cnet13,
+  author       = {Manish RajShivare and Yogendra P. S. Maravi and Sanjeev Sharma},
+  date         = {2013-10},
+  journaltitle = {International Journal of Computer Applications},
+  title        = {Analysis of Header Compression Techniques for Networks: A Review},
+  doi          = {10.5120/13856-1701},
+  number       = {5},
+  pages        = {13--20},
+  volume       = {80},
+  publisher    = {Foundation of Computer Science},
+}
+
 @Comment{jabref-meta: databaseType:biblatex;}

+ 1 - 1
latex/tex/thesis.tex

@@ -196,6 +196,6 @@
 % Anhang. Wenn Sie keinen Anhang haben, entfernen Sie einfach
 % diesen Teil.
 \appendix
-\input{kapitel/anhang-a}
+\input{kapitel/a6_results}
 
 \end{document}