u 3 lat temu
rodzic
commit
12f692c827
2 zmienionych plików z 23 dodań i 2 usunięć
  1. BIN
      latex/result/thesis.pdf
  2. 23 2
      latex/tex/kapitel/k4_algorithms.tex

BIN
latex/result/thesis.pdf


+ 23 - 2
latex/tex/kapitel/k4_algorithms.tex

@@ -23,6 +23,7 @@
 % file structure <-> datatypes. länger beschreiben: e.g. File formats to store dna
 % 3.2.1 raus
 
+\chapter{Compression aproaches}
 \chapter{Compression aproaches}
 The process of compressing data serves the goal to generate an output that is smaller than its input data. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or network protocols that are used to transmit video/audio streams live.
 For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Both are described in detail below.\\
@@ -43,8 +44,28 @@ The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv cre
 % example 
 }
 
-\section{Entropy coding}
-\subsection{Huffman coding}
+\section{Shannons Entropy}
+The great mathematician, electrical engineer and cryptographer Claude Elwood Shannon developed information entropy and published it in 1948 \autocite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
+
+% todo insert Fig. 1 shannon_1948
+
+Altering this figure shows how it can be used for other technology like compression.\\
+The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same phyiscal actor. 
+transmitter and receiver are changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time.
+
+Shannons Entropy provides a formular to determine the '(un)certainty of a probability distribution'. This is used today to find the maximum amount of bits needed to store information. 
+% alphabet, chain of symbols, kurz entropy erklären
+
+\section{Arithmetic coding}
+Arithmetic coding is an approach to solve the problem of waste of memory, due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if there would be a possibility to encode two letters in the alphabet with one bit and the other one with a two byte combination. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted either as the letter it should represent or as two one bit letters.
+% check this wording 'simulating' with sources 
+% this is called subdividing
+Arithmetic coding works by simulating a n-letter binary encoding for a n-letter alphabet. This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point number in the space between 0.0 and 1.0 (exclusively), which is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). To encode a sequence of characters, the intervall start of the character is noted, its intervall is split into smaller intervalls with the ratios of the initial intervalls between 0.0 and 1.0. With this, the second character is choosen. This process is repeated for until a intervall for the last character is choosen.\\
+To encode in binary, the binary floating point representation of a number inside the intervall, for the last character is calculated, by using a similar process, described above, called subdividing.
+% its finite subdividing because processors bottleneck floatingpoints 
+
+\section{Huffman encoding}
+% list of algos and the tools that use them
 The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
 % binary view for alphabet
 % length n of sequence to compromize