u 3 anni fa
parent
commit
34ce57dd0f
2 ha cambiato i file con 68 aggiunte e 16 eliminazioni
  1. 22 9
      latex/tex/kapitel/k4_algorithms.tex
  2. 46 7
      latex/tex/literatur.bib

+ 22 - 9
latex/tex/kapitel/k4_algorithms.tex

@@ -97,6 +97,7 @@ He defined entropy as shown in figure \ref{f4:entropy}. Let X be a finite probab
 %This can be used to find the maximum amount of bits needed to store information.\\ 
 % alphabet, chain of symbols, kurz entropy erklären
 
+\label{k4:arith}
 \subsection{Arithmetic coding}
 Arithmetic coding is an approach to solve the problem of wasting memeory due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if there would be a possibility to encode two letters in the alphabet with one bit and the other one with a two byte combination. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted either as the letter it should represent or as two one bit letters.
 % check this wording 'simulating' with sources 
@@ -109,25 +110,37 @@ To encode in binary, the binary floating point representation of a number inside
 
 \subsection{Huffman encoding}
 % list of algos and the tools that use them
-The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
-% binary view for alphabet
-% length n of sequence to compromize
-% greedy algo
+D. A. Huffmans work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano and named after its developers, which worked similar. The \ac{SF} coding is not used today, due to the superiority in both efficiency and effectivity, in comparison to Huffman. % todo any source to last sentence. Rethink the use of finite in the following text
+Even though his work was released in 1952, the method he developed is in use  today. Not only tools for genome compression but in compression tools with a more general ussage \cite{rfcgzip}.\\ 
+Compression with the Huffman algorithm works only on finite alphabets. It also provides a solution to the problem, described at the beginning of \ref{k4:arith}, on waste through unused bits, for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bits used per symbol in a message. This is possible by setting individual bit lengths for symbols, used in the text that should get compressed \cite{huf52}. 
+As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how the symbols will be encoded. As in arithmetic coding, the probability of a letter is calculated for given text. The binary tree will be constructed after following guidelines:
+% greedy algo?
 \begin{itemize}
-  \item every symbol of the alphabet is one leaf
-  \item the right branch from every not is marked as a 1, the left one is marked as a 0
-  \item every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text
-  \item the less weight a node has, the higher the probability is, that this node is read next in the symbol sequence
+  \item Every symbol of the alphabet is one leaf.
+  \item The right branch from every knot is marked as a 1, the left one is marked as a 0.
+  \item Every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text.
+  \item The less weight a leaf has, the higher the probability is, that this node is read next in the symbol sequence.
+  \item The leaf with the lowest probability is most left and the one with the highest probability is most right in the tree. 
 \end{itemize}
+%todo tree building explanation
+Constructing the tree begins with as many nodes as there are symbols, in the alphabet. 
+% storytime might need to be rearranged
+A often mentioned difference between \acs{FA} and Huffman coding, is that first is working top down while the latter is working bottom up. This means the tree starts with the lowest weights. The nodes that are not leafs have no value ascribed to them. They only need their weight, which is defined by the weights of their individual child nodes.\\
+So starting with the two lowest weightened symbols, a node is added to connect both. from there on, the two leafs will only get rearranged through the rearrangement of their temporary root node. With the added, blank node the count of available nodes got down by one. The new node weights as much as the summ of weights of its child nodes. Now the two lowest weights are paired as described until there are only two subtrees left which can be combined by a root.\\
+With the fact in mind, that left branches are assigned with 0 and right branches with 1, following a path until a leaf is reached reveals the encoding for this particular leaf. Since high weightened and therefore often occuring leafs are positioned to the left, short paths lead to them and so only few bits are needed to encode them. Following the tree on the other side, the symbols occur more rarely, paths get longer and so do the bit counts. % todo <- rewrite '...counts'
+
+
+In our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient.
 The process of compressing starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
 Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.\\
 % todo shannon fano mention. SF might be older than huffman and inspired it?
+% -> yes
+
 
 \section{DEFLATE}
 % mix of huffman and lz77
 The DEFLATE compression algorithm combines \ac{lz77} and huffman coding. It is used in well known tools like gzip.
 
-
 \subsubsection{misc}
 
 %check if (small) text coding is done with this:

+ 46 - 7
latex/tex/literatur.bib

@@ -109,13 +109,6 @@
   url   = {https://github.com/samtools/hts-specs.},
 }
 
-@Online{cram,
-  author = {Markus Hsi-Yang Fritz and Vadim Zalunin},
-  date   = {2022-10-28},
-  title  = {CRAM Compressed Reference-oriented Alignment Map},
-  url    = {https://samtools.github.io/hts-specs/CRAMv3.pdf},
-}
-
 @Online{bed,
   author = {Sanger Institute, Genome Research Limited},
   date   = {2022-10-20},
@@ -177,4 +170,50 @@
   publisher    = {Foundation of Computer Science},
 }
 
+@TechReport{rfcgzip,
+  author       = {L. Peter Deutsch and Jean-Loup Gailly and Mark Adler and L. Peter Deutsch and Glenn Randers-Pehrson},
+  institution  = {RFC Editor},
+  title        = {GZIP file format specification version 4.3},
+  note         = {\url{http://www.rfc-editor.org/rfc/rfc1952.txt}},
+  number       = {1952},
+  type         = {RFC},
+  url          = {http://www.rfc-editor.org/rfc/rfc1952.txt},
+  howpublished = {Internet Requests for Comments},
+  issn         = {2070-1721},
+  month        = {May},
+  publisher    = {RFC Editor},
+  year         = {1996},
+}
+
+@Article{huf52,
+  author      = {Huffman, David A.},
+  title       = {A Method for the Construction of Minimum-Redundancy Codes},
+  number      = {9},
+  pages       = {1098-1101},
+  volume      = {40},
+  added-at    = {2009-01-14T00:43:43.000+0100},
+  biburl      = {https://www.bibsonomy.org/bibtex/2585b817b85d7278b868329672ddded96/dret},
+  description = {dret'd bibliography},
+  interhash   = {d00a180c1c2e7851560c2d51e0fd8f92},
+  intrahash   = {585b817b85d7278b868329672ddded96},
+  journal     = {Proceedings of the Institute of Radio Engineers},
+  keywords    = {imported},
+  month       = {September},
+  timestamp   = {2009-01-14T00:43:44.000+0100},
+  uri         = {http://compression.graphicon.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf},
+  year        = {1952},
+}
+
+@Article{Moffat_2020,
+  author       = {Alistair Moffat},
+  date         = {2020-07},
+  journaltitle = {{ACM} Computing Surveys},
+  title        = {Huffman Coding},
+  doi          = {10.1145/3342555},
+  number       = {4},
+  pages        = {1--35},
+  volume       = {52},
+  publisher    = {Association for Computing Machinery ({ACM})},
+}
+
 @Comment{jabref-meta: databaseType:biblatex;}