Răsfoiți Sursa

working on arithmetic coding, fixed some acros

u 3 ani în urmă
părinte
comite
d0653d4e29

+ 3 - 1
latex/tex/kapitel/abkuerzungen.tex

@@ -7,6 +7,7 @@
 \begin{acronym}[IEEE]
   \acro{ANS}{Arithmetic Numeral System}
   \acro{ASCII}{American Standard Code for Information Interchange}
+  \acro{BAM}{Binary Alignment Map}
   \acro{CABAC}{Context-Adaptive Arithmetic Coding}
   \acro{CRAM}{Compressed Reference-oriented Alignment Map}
   \acro{DNA}{Deoxyribonucleic Acid}
@@ -15,11 +16,12 @@
   \acro{FASTq}{File Format Based on FASTA}
   \acro{FTP}{File Transfere Protocol}
   \acro{GA4GH}{Global Alliance for Genomics and Health}
+	\acro{GB}{Gigabyte}
   \acro{GeCo}{Genome Compressor}
   \acro{IUPAC}{International Union of Pure and Applied Chemistry}
   \acro{LZ77}{Lempel Ziv 1977}
   \acro{LZ78}{Lempel Ziv 1978}
   \acro{RAM}{Random Access Memory} 
   \acro{SAM}{Sequence Alignment Map}
-  \acro{BAM}{Binary Alignment Map}
+	\acro{UTF}{Unicode Transformation Format}
 \end{acronym}

+ 28 - 15
latex/tex/kapitel/k4_algorithms.tex

@@ -44,22 +44,18 @@ Looking at the string 'stationary' it might be smart to store 'station' and 'ary
 % end demo
 The dictionary should only store strings that occour in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is made out of the input data. Each first occourence is left uncompressed. Every occurence of a string, after the first one, points to its first occurence. Since this 'pointer' needs less space than the string it points to, a decrease in the size is created.\\
 
-Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore out of scope for this work. Since finding repeting sections and their location might also be improved, this chapter will remain.
-
 
 % unuseable due to the lack of probability
-\mycomment{
 % - known algo
 \subsubsection{The LZ Family}
 The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv created multiple algorithms that are based on dictionary coding. They can be recognized by the substring \texttt{LZ} in its name, like \texttt{LZ77 and LZ78} which are short for Lempel Ziv 1977 and 1978. The number at the end indictates when the algorithm was published. Today LZ78 is widely used in unix compression solutions like gzip and bz2. Those tools are also used in compressing \ac{DNA}.\\
 \ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
 % example 
-}
 
  % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
-\subsection{\ac{LZ77}}
- \ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.
+\ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
 
+Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore out of scope for this work. Since finding repeting sections and their location might also be improved, this chapter will remain.\\
 
 \subsection{Shannons Entropy}
 The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \cite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
@@ -111,6 +107,8 @@ This coding method is an approach to solve the problem of wasting memeory due to
 Dr. Jorma Rissanen described arithmetic coding in a publication in 1976 \autocite{ris76}. % Besides information theory and math, he also published stuff about dna
 This works goal was to define an algorithm that requires no blocking. Meaning the input text could be encoded as one instead of splitting it and encoding the smaller texts or single symbols. He stated that the coding speed of arithmetic coding is comparable to that of conventional coding methods \cite{ris76}.  
 
+% unusable because equation is only half correct
+\mycomment{
 The coding algorithm works with probabilities for symbols in an alphabet. From any text, the alphabet is defined by the set of individual symbols, used in the text. The probability for a single symbol is defined as its distribution. In a \texttt{n} symbol long text, with the first letter in the alphabet occuring \texttt{o} times, its probability is $\frac{o}{n}$.\\
 
 % todo rethink this equation stuff (and compare it to the original work <-compl.)
@@ -126,12 +124,27 @@ The coding algorithm works with probabilities for symbols in an alphabet. From a
 \begin{equation}
   R_{i} = R_{i-1} \cdot p_{j}
 \end{equation}
+}
+
+This is possible by projecting the input text on a binary encoded fraction between 0 and 1. To get there, each character in the alphabet is represented by an interval between two floating point numbers in the space between 0.0 and 1.0 (exclusively). This interval is determined by the symbols distribution in the input text (interval start) and the the start of the next character (interval end). The sum of all intervals will result in one.\\
+To encode a text, subdividing is used, step by step on the text symbols from start to the end. The interval that represents the current character will be subdivided. Meaning the choosen interval will be divided into subintervals with the proportional size of the intervals calculated in the beginning.\\
+To store as few informations as possible and due to the fact that fractions in binary have limited accuracity, only a single number, that lays between upper and lower end of the last intervall will be stored. To encode in binary, the binary floating point representation of any number inside the interval, for the last character is calculated, by using a similar process, described above.
+ To summarize the encoding process in short: 
+\begin{itemize}
+	\item The interval representing the first character is noted. 
+	\item Its interval is split into smaller intervals, with the ratios of the initial intervals between 0.0 and 1.0. 
+	\item The interval representing the second character is choosen.
+	\item This process is repeated, until a interval for the last character is determined.
+	\item A binary floating point number is determined wich lays in between the interval that represents the represents the last symbol. 
+\end{itemize}
+% its finite subdividing because of the limitation that comes with processor architecture
 
-This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point numbers in the space between 0.0 and 1.0 (exclusively). This intervall is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). The whole intervall is a sum of all subintervalls that reference symbols from the alphabet. It extends from 0.0 to 1.0. To encode a sequence of characters subdividing is used.
-% exkurs on subdividing?
-This means the intervall start of the character is noted, its intervall is split into smaller intervalls with the ratios of the initial intervalls between 0.0 and 1.0. With this, the second character is choosen. This process is repeated for until a intervall for the last character is choosen.\\
-To encode in binary, the binary floating point representation of a number inside the intervall, for the last character is calculated, by using a similar process, described above, called subdividing.
-% its finite subdividing because of processor architecture
+For the decoding process to work, the \ac{EOF} symbol must be be present as the last symbol in the text. The compressed file will store the probabilies of each alphabet symbol as well as the floatingpoint number. The decoding process executes in a simmilar procedure as the encoding. The stored probabilies determine intervals. Those will get subdivided, by using the encoded floating point as guidance, until the \ac{EOF} symbol is found. By noting in which interval the floating point is found, for every new subdivision, and projecting the probabilies associated with the intervals onto the alphabet, the origin text can be read.\\
+% sclaing
+In computers, arithmetic operations on floating point numbers are processed with integer representations of given floating point number \cite{ieee-float}. The number 0.4 + would be represented by $4\cdot 10^-1$.\\
+Intervals for the first symbol would be represented by natural numbers between 0 and 100 and $... \cdot 10^-x$. \texttt{x} starts with the value 2 and grows as the intgers grow in length, meaning only if a uneven number is divided. For example: Dividing a uneven number like $5\cdot 10^-1$ by two, will result in $25\cdot 10^-2$. On the other hand, subdividing $4\cdot 10^y$ by two, with any negativ real number as y would not result in a greater \texttt{x} the length required to display the result will match the length required to display the input number.\\
+% finite percission
+The described coding is only feasible on machines with infinite percission. As soon as finite precission comes into play, the algorithm must be extendet, so that a certain length in the resulting number will not be exceeded. Since digital datatypes are limited in their capacity, like unsigned 64-bit integers which can store up to $2^64-1$ bits or any number between 0 and 18.446.744.073.709.551.615. That might seem like a great ammount at first, but considering a unfavorable alphabet, that extends the results lenght by one on each symbol that is read, only texts with the length of 63 can be encoded (62 if \acs{EOF} is exclued).
 
 \subsection{Huffman encoding}
 % list of algos and the tools that use them
@@ -175,8 +188,8 @@ Leaving the theory and entering the practice, brings some details that lessen th
 most formats, used for persisting \acs{DNA}, store more than just nucleotides and therefore require more characters. What compression ratios implementations of huffman coding provide, will be discussed in \ref{k5:results}.\\
 
 \section{DEFLATE}
-% mix of huffman and lz77
-The DEFLATE compression algorithm combines \ac{lz77} and huffman coding. It is used in well known tools like gzip.
+% mix of huffman and LZ77
+The DEFLATE compression algorithm combines \ac{LZ77} and huffman coding. It is used in well known tools like gzip. 
 
 \subsubsection{misc}
 
@@ -203,8 +216,8 @@ Modified -> used in cram
 
 % following text is irelevant. Just describe used algorithms in comparison chapter and refere to their base algo
 
-% mix of Huffman and lz77
+% mix of Huffman and LZ77
 The DEFLATE compression algorithm combines \ac{LZ77} and Huffman coding. To get more specific, the raw data is compressed with \ac{LZ77} and remaining data is shortened by using Huffman coding. 
 % huffman - little endian
-% lz77 compressed - big endian (least significant byte first/most left)
+% LZ77 compressed - big endian (least significant byte first/most left)
 }

+ 23 - 0
latex/tex/literatur.bib

@@ -240,4 +240,27 @@
   publisher    = {{IBM}},
 }
 
+@TechReport{rfc1951,
+  author       = {L. Peter Deutsch},
+  institution  = {RFC Editor},
+  title        = {DEFLATE Compressed Data Format Specification version 1.3},
+  note         = {\url{http://www.rfc-editor.org/rfc/rfc1951.txt}},
+  number       = {1951},
+  type         = {RFC},
+  url          = {http://www.rfc-editor.org/rfc/rfc1951.txt},
+  howpublished = {Internet Requests for Comments},
+  issn         = {2070-1721},
+  month        = {May},
+  publisher    = {RFC Editor},
+  year         = {1996},
+}
+
+@Article{ieee-float,
+  title   = {IEEE Standard for Floating-Point Arithmetic},
+  doi     = {10.1109/IEEESTD.2019.8766229},
+  pages   = {1-84},
+  journal = {IEEE Std 754-2019 (Revision of IEEE 754-2008)},
+  year    = {2019},
+}
+
 @Comment{jabref-meta: databaseType:biblatex;}