| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406 |
- %SUMMARY
- %- ABSTRACT
- %- INTRODUCTION
- %# BASICS
- %- \acs{DNA} STRUCTURE
- %- DATA TYPES
- % - BAM/FASTQ
- % - NON STANDARD
- %- COMPRESSION APPROACHES
- % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
- % - HUFFMAN ENCODING
- % - PROBABILITY APPROACHES (WITH BASE?)
- %
- %# COMPARING TOOLS
- %-
- %# POSSIBLE IMPROVEMENT
- %- \acs{DNA}S STOCHASTICAL ATTRIBUTES
- %- IMPACT ON COMPRESSION
- \newcommand{\mycomment}[1]{}
- % entropie fim doku grundlagen2
- % dna nucleotide zu einem kapitel -> structure of dna. auch kapitel wegstreichen (zu generisch)
- % file structure/format <-> datatypes. länger beschreiben: e.g. File formats to store dna
- \section{Compression Aproaches}
- The process of compressing data serves the goal to generate an output that is smaller than its input \cite{dict}.\\
- In many cases, like in gene compressing, the compression is idealy lossless. This means, it is possible to receive the full information that was available in the origin data by decompressing any kind of compressed data. Lossy compression on the other hand might exclude parts of data in the compression process, in order to increase the compression rate. The excluded parts are typically not necessary to transmit the original information. This works with certain audio and picture files, or with network protocols like \ac{UDP} which are used to transmit video/audio streams live \cite{rfc-udp, cnet13}.\\
- For storing \acs{DNA} a lossless compression is needed. To be precise a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its exact position are needed for the sequence to be complete and useful.\\
- Before going on, the difference between information and data should be emphasized.\\
- % excurs data vs information
- Data contains information. In digital data , clear physical limitations delimit what and how much of something can be stored. A bit can only store 0 or 1, eleven bit can store up to $2^{11}$ combinations of bit and a 1~\acs{GB} drive can store no more than 1~\acs{GB} of data. Information on the other hand is limited by the way it is stored. What exactly defines information, depends on multiple factors. The context in which information is transmitted and the source and destination of the information. This can be in form of a signal, transferred from one entity to another, or information that is persisted, so it can be obtained at a later point in time.\\
- % excurs information vs data
- For the scope of this work, information will be seen as the type and position of nucleotides, sequenced from \acs{DNA}. To be even more precise, it is a chain of characters from an alphabet of \texttt{A, C, G, and T}, since this is the \textit{de facto} standard for digital persistence of \acs{DNA} \cite{isompeg}.
- When it comes to storing capabilities, the boundaries of information, can be illustrated by using the example mentioned above. A drive with the capacity of 1~\acs{GB} could contain a book in form of images, where the content of each page is stored in a single image. Another, more resourceful way would be storing just the text of every page in \acs{UTF}-16 \cite{isoutf}. The information the text would provide to a potential reader would not differ. Changing the text encoding to \acs{ASCII} and/or using compression techniques would reduce the required space even more, without losing any information.\\
- % excurs end
- For \acs{DNA} a lossless compression is needed. To be precise a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression, two mayor approaches are known: the dictionary coding and the entropy coding. Methods from both fields, that acquired reputation, are described in detail below \cite{cc14, moffat20, moffat_arith, alok17}.\\
- \subsection{Dictionary Coding}
- \label{k4:dict}
- Dictionary coding, as the name suggest, uses a dictionary to eliminate redundant occurrences of strings. Strings are a chain of characters representing a full word or just a part of it. For a better understanding, this should be illustrated by a short example:
- % demo substrings
- Looking at the string 'stationary' it might be smart to store 'station' and 'ary' as separate dictionary entries. Which way is more efficient depends on the text that should get compressed.
- % end demo
- The dictionary should only store strings that occur in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dictionary is part of the text. Each first occurrence is left uncompressed. Each occurrence of a string, after the first one, points either to to its first occurrence or to the last replacement of its occurrence. Which method is used depends on the algorithm.\\
- \ref{k4:dict-fig} illustrates how this process is executed. The bar on top of the figure, which extends over the full width, symbolizes any text. The squares inside the text are repeating occurrences of text segments.
- In the dictionary coding process, the square annotated as \texttt{first occ.} is added to the dictionary. \texttt{Second} and \texttt{third occ.} get replaced by a structure \texttt{<pos, len>} consisting of a pointer to the position of the first occurrence \texttt{pos} and the length of that occurrence \texttt{len}.
- The bar at the bottom of the figure shows how the compressed text for this example would be structured. The dotted lines would only consist of two bytes, storing position and lenght, pointing to \texttt{first occ.}. Decompressing this text would only require parsing the text from left to right and to replace every \texttt{<pos, len>} with the already parsed word from the dictionary. This means jumping back to the parsed position stored in the replacement, reading for as long as the length dictates, copying the read section, jumping back and pasting the section.\\
- % offsets are volatile when replacing
- \begin{figure}[H]
- \centering
- \includegraphics[width=15cm]{k4/dict-coding.png}
- \caption{Schematic sketch, illustrating the replacement of multiple occurrences done in dictionary coding.}
- \label{k4:dict-fig}
- \end{figure}
- \label{k4:lz}
- \subsubsection{The LZ Family}
- The computer scientist Abraham Lempel and the electrical engineer Jacob Ziv created multiple algorithms that are based on dictionary coding. They can be recognized by the substring \texttt{LZ} in their name; like \texttt{LZ77 and LZ78} which are short for Lempel Ziv 1977 and 1978 \cite{lz77}. The number at the end indicates when the algorithm was published. Today, members of the LZ family are widely used in compression implementations like rar, zip, gzip and bz2 \cite{rfcgzip}. Some of those are also used to compress \ac{DNA}.\\
- \acs{LZ77} basically works, by removing all repetitions of a string or substring and replacing them with information where to find the first occurrence and how long it is. The distance between the first occurrence and a replacement is limited, because each pointer has a static amount of storage available. A pointer, length pair is typically stored in two bytes. One bit is reseverd to indicate that the next 15 bit are a position, lenght pair. More than 8 bit are available to store the pointer and the rest is reserved for storing the length. Exact amounts depend on the implementation \cite{rfc1951, lz77}.
- % rewrite and implement this:
- %This method is limited by the space a pointer is allowed to take. Other variants let the replacement store the offset to the last replaced occurrence, therefore it is harder to reach a position where the space for a pointer runs out.
- Unfortunally, implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore not in the main scope for this work. To strengthen the general understanding of compression algortihms and because it is a part of hybrid coding implementations, this section remains.\\
- \subsection{Shannons Entropy}
- The founder of information theory Claude Elwood Shannon described entropy and published his work in 1948 \cite{Shannon_1948}. Here, he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only useful for forms of information transmission.
- % todo insert Fig. 1 shannon_1948
- \begin{figure}[H]
- \centering
- \includegraphics[width=15cm]{k4/com-sys.png}
- \caption{Schematic diagram of a general communication system by Shannons definition. \cite{Shannon_1948}}
- \label{k4:comsys}
- \end{figure}
- Altering \ref{k4:comsys} would show how this can be applied to other technology like compression. The information source and destination are left unchanged; one has to keep in mind, it is possible that both are represented by the same physical actor.\\
- Transmitter and receiver would be changed to compression/encoding and decompression/decoding. Inbetween those two, there is no signal but instead any period of time \cite{Shannon_1948}.\\
- Shannon's Entropy provides a formula to determine the ``uncertainty of a probability distribution'' in a finite field.
- \begin{equation}\label{eq:entropy}
- %\resizebox{.9 \textwidth}{!}
- %{$
- H(X):=\sum\limits_{x\in X, prob(x)\neq 0} prob(x) \cdot log_2(\frac{1}{prob(x)})
- \equiv-\sum\limits_{x\in X, prob(x)\neq 0} prob(x) \cdot log_2 (prob(x)).
- %$}
- \end{equation}
- %\begin{figure}[H]
- % \centering
- % \includegraphics[width=12cm]{k4/shannon_entropy.png}
- % \caption{Shannons definition of entropy.}
- % \label{k4:entropy}
- %\end{figure}
- He defined entropy as shown in figure \eqref{eq:entropy}. Let X be a finite probability space. Then $x\in X$ are possible final states of a probability experiment over X. Every state that actually occurs, while executing the experiment generates information which is measured in binary digits \textit{bits} for short with the part of the equation displayed in \ref{eq:info-in-bit} \cite{delfs_knebl,Shannon_1948}:
- \begin{equation}\label{eq:info-in-bit}
- log_2(\frac{1}{prob(x)}) \equiv - log_2(prob(x)).
- \end{equation}
- %\begin{figure}[H]
- % \centering
- % \includegraphics[width=8cm]{k4/information_bit.png}
- % \caption{The amount of information measured in bit, in case x is the end state of a probability experiment.}
- % \label{f4:info-in-bit}
- %\end{figure}
- %Noteable here is, that with \textit{Bits} a unit for the information entropy is meant. Even though they store the same form of data, no indications could be found, that there is a direct connection to the binary digit (\,bit)\, that describes the physical unit to store iformation in computer science.
- %todo explain 2.2 second bulletpoint of delfs_knebl. Maybe read gumbl book
- %This can be used to find the maximum amount of bit needed to store information.\\
- % alphabet, chain of symbols, kurz entropy erklären
- \label{k4:arith}
- \subsection{Arithmetic coding}
- This coding method is an approach to solve the problem of wasting memory due to the overhead which is created by encoding certain lengths of alphabets in binary \cite{ris76, moffat_arith}. For example: Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bit, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective and thinking a step further: Less storage would be required, if there was a possibility to encode more than one letter in two bit.\\
- Dr. Jorma Rissanen described arithmetic coding in a publication in 1976 \cite{ris76}. % Besides information theory and math, he also published stuff about dna
- This works goal was to define an algorithm that requires no blocking. Meaning the input text could be encoded as one instead of splitting it and encoding the smaller texts or single symbols. He stated that the coding speed of arithmetic coding is comparable to that of conventional coding methods \cite{ris76}.
- % unusable because equation is only half correct
- \mycomment{
- The coding algorithm works with probabilities for symbols in an alphabet. From any text, the alphabet is defined by the set of individual symbols, used in the text. The probability for a single symbol is defined as its distribution. In a \texttt{n} symbol long text, with the first letter in the alphabet occuring \texttt{o} times, its probability is $\frac{o}{n}$.\\
- \begin{itemize}
- \item $p_{i}$ represents the probability of the symbol, at position \texttt{i} in the alphabet.
- \item L will contain the bit sequence with which the text is encoded. This sequence can be seen as a single value. A fraction between 0 and 1 which gets more precise with every processed symbol.
- \item R represents the product of probabilities of processed symbols.
- \end{itemize}
- \begin{equation}
- L_{i} = L_{i-1} + R \cdot \sum{i=1}^{j-1} \cdot p_{i}
- \end{equation}
- \begin{equation}
- R_{i} = R_{i-1} \cdot p_{j}
- \end{equation}
- }
- % math and computers
- Before getting into the arithmetic coding algorithm, the following section will go over some details on how digital fractions are handled by computers. This knowledge will be helpful in understanding how arithmetic coding works.\\
- In computers, arithmetic operations on floating point numbers are processed with integer representations \cite{ieee-float}. The number 0.4 for example would be represented by $4\cdot 10^{-1}$.\\
- An interval would be represented by natural numbers between 0 and 100 and $... \cdot 10^-x$. \texttt{x} starts with the value 2 and grows as the integers grow in length; meaning only if a uneven number is divided. For example: Dividing an uneven number like $5\cdot 10^{-1}$ by two, will result in $25\cdot 10^{-2}$. On the other hand, subdividing $4\cdot 10^y$ by two, with any negative real number as y would not result in a greater \texttt{x}. The length required to display the result will match the length required to display the input number \cite{witten87, moffat_arith}.\\
- Binary fractions are limited in form of representing decimal fractions. This is due to the fact that every other digit adds zero or half of the value before. In other terms: $b \cdot 2^{-n}$ determines the value of $b \in {0,1}$ at position n behind the decimal point.\\
- %todo example including figure
- % example unscaled
- \begin{figure}[H]
- \centering
- \includegraphics[width=15cm]{k4/arith-unscaled.png}
- \caption{Illustrative example of arithmetic coding.}
- \label{k4:arith-unscaled}
- \end{figure}
- The encoding of the input text, or a sequence is possible by projecting it on a binary encoded fraction between 0 and 1. To get there, each character in the alphabet is represented by an interval between two fractions, in the space between 0.0 and 1.0. In \ref{k4:arith-unscaled} this space is illustrated by the line in the upper center, with a scaling from 0.0 on the left, to 1.0 on the right side. The interval for each symbol is determined by its distribution, in the input text (interval start) and the start of the next character (interval end). The sum of all intervals will result in one \cite{moffat_arith}.\\
- In order to remain in a presentable range, the example in \ref{k4:arith-unscaled} uses an alphabet of only three characters: \texttt{A, C and G}. For the sequence \texttt{AGAC} a probability distribution as shown in the upper left corner and listed in \ref{t:arith-prob} was calculated. The intervals resulting from these probabilities are visualized by the three sections marked by outwards pointing arrows at the top. The interval for \texttt{A} extends from 0.0 until the start of \texttt{C} at 0.5, which extends to the start of \texttt{G} at 0.75 and so on.\\
- \label{t:arith-prob}
- \sffamily
- \begin{footnotesize}
- \begin{longtable}[c]{ p{.2\textwidth} p{.2\textwidth} p{.5\textwidth}}
- \caption[Probability distribution for arithmetic encoding example shown in \ref{k4:arith-unscaled}] % Caption für das Tabellenverzeichnis
- {Probabilities for \texttt{A,C and G} as shown in example \ref{k4:arith-unscaled}} % Caption für die Tabelle selbst
- \\
- \toprule
- \textbf{Symbol} & \textbf{Probability} & \textbf{Interval}\\
- \midrule
- A & $\frac{2}{4}=0.11$ & $[ 0.0,$ $ 0.5 ) $ \\ %${x\in \mathbb{Q} | 0.0 <= x < 0.5}$\\
- C & $\frac{1}{4}=0.71$ & $[ 0.5,$ $ 0.75 ) $ \\ %${x\in \mathbb{Q} | 0.5 <= x < 0.75}$\\
- G & $\frac{1}{4}=0.13$ & $[ 0.75,$ $ 1.0 ) $ \\ %${x\in \mathbb{Q} | 0.75 <= x < 1.0}$\\
- \bottomrule
- \end{longtable}
- \end{footnotesize}
- \rmfamily
- In the encoding process, the first symbol read from the sequence determines a interval that its symbol is associated with. Every following symbol determines a subinterval, which is formed by subdividing the previous interval into sections proportional to the probabilities from \ref{t:arith-prob}.
- Starting with \texttt{A}, the most left interval in \ref{k4:arith-unscaled} is subdivided into intervals visualized below. Leaving an available space of $[0.0, 0.5)$. From there, the interval representing \texttt{G} is subdivided, and so on until the last symbol \texttt{C} is processed. This leaves an interval of $[0.40625, 0.421275)$. This is marked in \ref{k4:arith-unscaled} with a red line. Since the interval is comparably small, in the illustration it seems like a point in the interval is marked. This is not the case, the red line shows the position of the last mentioned interval.\\
- %To encode a text, subdividing is used, step by step on the text symbols from start to the end
- To store the encoding result in as few bits as possible, only a single number between the upper and the lower end of the last interval will be stored. To encode in binary, the binary floating point representation of any number inside the interval for the last character is calculated.\\
- In this example, the number \texttt{0.41484375} in decimal, or \texttt{0.0110101} in binary, would be calculated.\\
- %todo compression ratio
- To summarize the encoding process in short \cite{moffat_arith, witten87}:\\
-
- \begin{itemize}
- \item The interval representing the first character is noted.
- \item Its interval is split into smaller intervals, with the ratios of the initial intervals between 0.0 and 1.0.
- \item The interval representing the second character is chosen.
- \item This process is repeated until an interval for the last character is determined.
- \item A binary floating point number is determined wich lays in between the interval that represents the last symbol.\\
- \end{itemize}
- % its finite subdividing because of the limitation that comes with processor architecture
- For the decoding process to work, the \ac{EOF} symbol must be present as the last symbol in the text. The compressed file will store the probabilities of each alphabet symbol as well as the floatingpoint number. The decoding process executes in a similar procedure as the encoding. The stored probabilities determine intervals. Those will get subdivided by using the encoded floating point as guidance until the \ac{EOF} symbol is found. By noting in which interval the floating point is found for every new subdivision, and projecting the probabilities associated with the intervals onto the alphabet, the origin text can be read \cite{witten87, moffat_arith, ris76}.\\
- % rescaling
- \begin{figure}[H]
- \centering
- %\includegraphics[width=15cm]{k4/arith-resize.png}
- \includegraphics[width=15cm]{k4/arith-scaled.png}
- \caption{Illustrative rescaling in arithmetic coding process.}
- \label{k4:arith-scaled}
- \end{figure}
- % finite percission
- The described coding is only feasible on machines with infinite precision \cite{witten87}. As soon as finite precision comes into play, the algorithm must be extended, so that a certain length in the resulting number will not be exceeded. This is due to the fact that digital datatypes are limited in their capacity for example, the unsigned 64-bit integers which can store up to $2^64-1$ bit or any number between 0 and 18,446,744,073,709,551,615. That might seem like a great amount at first, but considering a unfavorable alphabet that extends the results lenght by one on each symbol that is read, only sequences with the length of 63 can be encoded (62 if \acs{EOF} is exclued) \cite{moffat_arith}. For the compression with finite percission, rescaling is used. This method works by scaling up the intervals which result from subdividing. The upscaling process is illustrated in \ref{k4:arith-scaled}. The vertical lines illustrate the interval of each step. The smaller, black lines between them indicate which previous section was scaled up. The red lines indicate the final interval and the letters at the bottom indicate which symbol gets encoded in this step.
- \label{k4:huff}
- \subsection{Huffman Encoding}
- % list of algos and the tools that use them
- D. A. Huffman's work focused on finding a method to encode messages with a minimum of redundance. He referenced a coding procedure developed by Shannon and Fano, named after its developers, which worked similar. The Shannon-Fano coding is not used today due to the superiority of Huffman's algorithm in both efficiency and effectivity \cite{moffat_arith}.\\
- Even though his work was released in 1952, the method he developed is in use today. Not only tools for genome compression but in compression tools with a more general ussage \cite{rfcgzip}.\\
- Compression with the Huffman algorithm also provides a solution to the problem, described at the beginning of \ref{k4:arith}; of waste through unused bit for certain alphabet lengths. Huffman did not save more than one symbol in one bit, like it is done in arithmetic coding, but he decreased the number of bit used per symbol in a message. This is possible by setting individual bit lengths for symbols used in the text that should get compressed \cite{huf52}.
- As with other codings, a set of symbols must be defined. For any text constructed with symbols from mentioned alphabet, a binary tree is constructed, which will determine how each individual symbols will be encoded. The binary tree will be constructed after following guidelines \cite{alok17}:
- % greedy algo?
- \begin{itemize}
- \item Every symbol of the alphabet is one leaf.
- \item The right branch from every knot is marked as a 1, the left one is marked as a 0.
- \item Every symbol got a weigh. The weight is defined by the frequency the symbol occurs in the input text. This might be a fraction between 0 and 1 or an integer. In this scenario it will described as the first.
- \item The less weight a leaf has, the higher is the probability, that this node is read next in the symbol sequence.
- \item Pairs of the lowest weighting nodes are formed. This pair will from there on be represented by a node which weight is equal to the sum of the weight of its child nodes.
- \item Higher weighting nodes are positioned left, lower ones right.
- \end{itemize}
- %todo tree building explanation
- An often-mentioned difference between Shannon-Fano and Huffman coding is that the first is working top down while the latter is working bottom up. Meaning the first Shannon-Fano is starting with the highest probabilities while Huffman starts with the lowest \cite{moffat20, alok17}.\\
- Given \texttt{K(W,L)} as a node structure, with the weigth or probability as \texttt{$W_{i}$} and codeword length as \texttt{$L_{i}$} for the node \texttt{$K_{i}$}. Then will \texttt{$L_{av}$} be the average length for \texttt{L} in a finite chain of symbols, with a distribution that is mapped onto \texttt{W} \cite{huf52}.
- \begin{equation}\label{eq:huf}
- L_{av}=\sum_{i=0}^{n-1}w_{i}\cdot l_{i}
- \end{equation}
- The equation \eqref{eq:huf} describes the path, to the desired state, for the tree. The upper bound \texttt{n} is assigned the length of the input text. The tuple in any node \texttt{K} consists of a weight \texttt{$w_{i}$}, that also references a symbol, and the length of a codeword \texttt{$l_{i}$}. This codeword will later encode a single symbol from the alphabet. Working with digital codewords, an element in \texttt{l} contains a sequence of zeros and ones. Since there in this coding method, there is no fixed length for codewords, the premise of \texttt{prefix-free code} must be adhered to. This means there can be no codeword that match the sequence of any prefix of another codeword. To illustrate this: 0, 10, 11 would be a set of valid codewords but adding a codeword like 01 or 00 would make the set invalid because of the prefix 0, which is already a single codeword.\\
- With all important elements described: the sum that results from this equation is the average length a symbol in the encoded input text will require to be stored \cite{huf52, moffat20}.
- % example
- % todo illustrate
- For this example a four letter alphabet, containing \texttt{A, C, G and T} will be used. For this alphabet, the binary representation encoded in ASCII is listed in the second column of \ref{t:huff-pre}.
- The average length for any symbol encoded in \acs{ASCII} is eight, while only using four of the available $2^8$ symbols, a overhead of 252 unused bit combinations. For this example it is more vivid, using a imaginary encoding format, without overhead. It would result in a average codeword length of two, because four symbols need a minimum of $2^2$ bit.\\
- \label{t:huff-pre}
- \sffamily
- \begin{footnotesize}
- \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
- \caption[Huffman example table no. 1 (pre encoding)] % Caption für das Tabellenverzeichnis
- {ASCII Codes and probabilities for \texttt{A,C,G and T}} % Caption für die Tabelle selbst
- \\
- \toprule
- \textbf{Symbol} & \textbf{\acs{ASCII} Code} & \textbf{Probability} & \textbf{Occurences}\\
- \midrule
- A & 0100 0001 & $\frac{11}{100}=0.11$ & 11\\
- C & 0100 0011 & $\frac{71}{100}=0.71$ & 71\\
- G & 0101 0100 & $\frac{13}{100}=0.13$ & 13\\
- T & 0000 1010 & $\frac{5}{100}=0.05$ & 5\\
- \bottomrule
- \end{longtable}
- \end{footnotesize}
- \rmfamily
- The exact input text is not relevant, since only the resulting probabilities are needed. To make this example more illustrative, possible occurrences are listed in the most right column of \ref{t:huff-pre}. The probability for each symbol is calculated by dividing the message length by the times the symbol occured. This and the resulting probabilities on a scale between 0.0 and 1.0, for this example are shown in \ref{t:huff-pre} \cite{huf52}.\\
- Creating a tree is done bottom-up. In the first step, for each symbol from the alphabet, a node without any connection is formed .\\
- \texttt{<A>, <T>, <C>, <G>}\\
- Starting with the two lowest weightened symbols, a node is added to connect both. With the added, blank node the count of available nodes got reduces by one. The new node weights as much as the sum of weights of its child nodes so the probability of 0.16 is assigned to \texttt{<A,T>}.\\
- \texttt{<A, T>, <C>, <G>}\\
- From there on, the two leafs will only get rearranged through the rearrangement of their temporary root node. Now the two lowest weights are paired as described, until there are only two subtrees or nodes left which can be combined by a root.\\
- \texttt{<G, <A, T> >, <C>}\\
- The \texttt{<G, <A, T> >} has a probability of 0.29. Adding the last, highest weightened node \texttt{C} results in a root node with the probability of 1.0.
- For a better understanding of this example, and to help further explanations, the resulting tree is illustrated in \ref{k4:huff-tree}.\\
- \begin{figure}[H]
- \centering
- \includegraphics[width=8cm]{k4/Huffman-tree.png}
- \caption{Final version of the Huffman tree for described example.}
- \label{k4:huff-tree}
- \end{figure}
- As illustrated in \ref{k4:huff-tree} the left branches are assigned with 0 and right branches with 1, following a path until a leaf is reached reveals the encoding for this particular leaf. With a corresponding tree, created from with the weights, the binary sequences to encode the alphabet can be seen in the second column of \ref{t:huff-post}.\\
- \label{t:huff-post}
- \sffamily
- \begin{footnotesize}
- \begin{longtable}[ht]{ p{.2\textwidth} p{.2\textwidth} p{.2\textwidth} p{.2\textwidth}}
- \caption[Huffman example table no. 2 (post encoding)] % Caption für das Tabellenverzeichnis
- {Huffman codes for \texttt{A,C,G and T}} % Caption für die Tabelle selbst
- \\
- \toprule
- \textbf{Symbol} & \textbf{Huffman Code} & \textbf{Occurences}\\
- \midrule
- A & 100 & 11\\
- C & 0 & 71\\
- G & 11 & 13\\
- T & 101 & 5\\
- \bottomrule
- \end{longtable}
- \end{footnotesize}
- \rmfamily
- Since high weightened and therefore often occuring leafs are positioned to the left, short paths lead to them and so only few bit are needed to encode them. Following the tree on the other side, the symbols occur more rarely, paths get longer and so do the codeword. Applying \eqref{eq:huf} to this example, results in 1.45 bit per encoded symbol. In this example the text would require over one bit less storage for every second symbol \cite{huf52}.\\
- % impacting the dark ground called reality
- Leaving the theory and entering the practice, brings some details that lessen this improvement by a bit. A few bytes are added through the need of storing the information contained in the tree. Also, like described in \ref{chap:file formats} most formats, used for persisting \acs{DNA}, store more than just nucleotides and therefore require more characters \cite{Cock_2009, sam12}.\\
- \section{Implementations in Relevant Tools}
- This section should give the reader a overview, how a small variety of compression tools implement described compression algorithms. It is written with the goal to compensate a problem that ocurs in scientific papers, and sometimes in technical specifications for programs. They often lack information on the implementation, in a satisfying dimension \cite{sam12, geco, bam}.\\
- The information on the following pages was received through static code analysis. Meaning the comprehension of a programs behaviour or its interactions due to the analysis of its source code. This is possible because the analysed tools are openly published and licenced under \ac{GPL} v3 \cite{geco} and \ac{MIT}/Expat \cite{bam}, which permits the free use for scientific purposes \cite{gpl, mitlic}.\\
- \label{k4:geco}
- \subsection{\ac{GeCo}} % geco
- % differences between geco geco2 and geco3
- This tool has three development stages. the first \acs{GeCo} released in 2016 \acs{GeCo}. This tool happens to have the smalles codebase, with only eleven C files. The two following extensions \acs{GeCo}2, released in 2020 and the latest version \acs{GeCo}3 have bigger codebases \cite{geco-repo}. They also provide features like the ussage of a neural network, which are of no help for this work. Since the file, providing arithmetic coding functionality, do not differ between all three versions, the first release was analyzed.\\
- % explain header files
- The header files, that this tool includes in \texttt{geco.c}, can be split into three categories: basic operations, custom operations and compression algorithms.
- The basic operations include header files for general purpose functions, that can be found in almost any c++ Project. The provided functionality includes operations for text-output on the command line inferface, memory management, random number generation and several calculations on numbers from natural to real.\\
- Custom operations happens to include general purpose functions too, with the difference that they were written, altered or extended by \acs{GeCo}s developer. The last category cosists of several C Files, containing implementations of two arithmetic coding algorithms: \textbf{first} \texttt{bitio.c} and \texttt{arith.c}, \textbf{second} \texttt{arith\_aux.c}.\\
- The first two were developed by John Carpinelli, Wayne Salamonsen, Lang Stuiver and Radford Neal. Comparing the two files, \texttt{bitio.c} has less code, shorter comments and much more not functioning code sections. Overall the conclusion would be likely that \texttt{arith.c} is some kind of official release, wheras \texttt{bitio.c} severs as a experimental file for the developers to create proof of concepts. The described files adapt code from Armando J. Pinho licenced by University of Aveiro DETI/IEETA written in 1999.\\
- The second implementation was also licensed by University of Aveiro DETI/IEETA, but no author is mentioned. From interpreting the function names and considering the lenght of function bodys \texttt{arith\_aux.c} could serve as a wrapper for basic functions that are often used in arithmetic coding.\\
- Since original versions of the files licensed by University of Aveiro could not be found, there is no way to determine if the files comply with their originals or if changes has been made. This should be considered while following the static analysis.
- Following function calls in all three files led to the conclusion that the most important function is defined as \texttt{arithmetic\_encode} in \texttt{arith.c}. In this function the actual artihmetic encoding is executed. This function has no redirects to other files, only one function call \texttt{ENCODE\_RENORMALISE} the remaining code consists of arithmetic operations only \cite{geco-repo}.
- % if there is a chance for improvement, this function should be consindered as a entry point to test improving changes.
- While following function calls in the \texttt{compressor} section of \texttt{geco.c}, to find the locations where \texttt{arith.c} gets executed, no sign of multithreading could be identified. This fact leaves additional optimization possibilities.
- %useless? -> Both, \texttt{bitio.c} and \texttt{arith.c} are pretty simliar. They were developed by the same authors, execpt for Radford Neal who is only mentioned in \texttt{arith.c}, both are based on the work of A. Moffat \cite{moffat_arith}.
- %\subsection{genie} % genie
- \subsection{Samtools} % samtools
- \subsubsection{BAM}
- Compression in this fromat is done by a implementation called BGZF, which is a block compression on top of a widely used algorithm called DEFLATE.
- \label{k4:deflate}
- \paragraph{DEFLATE}
- % mix of Huffman and LZ77
- The DEFLATE compression algorithm combines \acs{LZ77} and Huffman coding. It is used in well known tools like gzip.
- Data is split into blocks. Each block stores a header consisting of three bit. A single block can be stored in one of three forms. Each of which is represented by a identifier that is stored with the last two bit in the header.
- \begin{itemize}
- \item \texttt{00} No compression.
- \item \texttt{01} Compressed with a fixed set of Huffman codes.
- \item \texttt{10} Compressed with dynamic Huffman codes.
- \end{itemize}
- The last combination \texttt{11} is reserved to mark a faulty block. The third, leading bit is set to flag the last data block \cite{rfc1951}.\\
- % lz77 part
- %As described in \ref{k4:lz} the compression with \acs{LZ77} results in literals, a length for each literal and pointers that are represented by the distance between pointer and the literal it points to.
- The \acs{LZ77} algorithm is executed before the Huffman algorithm. Further compression steps differ from the already described algorithm and will extend to the end of this section.\\
- % Huffman part
- Besides header bit and a data block, two Huffman code trees are store. One encodes literals and lenghts and the other distances. They happen to be in a compact form. This is achieved by a addition of two rules on top of the rules described in \ref{k4:huff}: Codes of identical lengths are orderd lexicographically, directed by the characters they represent. And the simple rule: shorter codes precede longer codes.
- To illustrated this with an example:
- For a text consisting out of \texttt{C} and \texttt{G}, following codes would be set, for a encoding of two bit per character: \texttt{C}: 00, \texttt{G}: 01. With another character \texttt{A} in the alphabet, which would occur more often than the other two characters, the codes would change to a representation like this:
- \sffamily
- \begin{footnotesize}
- \begin{longtable}[h]{ p{.3\textwidth} p{.3\textwidth}}
- \toprule
- \textbf{Symbol} & \textbf{Huffman code}\\
- \midrule
- A & 0\\
- C & 10\\
- G & 11\\
- \bottomrule
- \end{longtable}
- \end{footnotesize}
- \rmfamily
- Since \texttt{A} precedes \texttt{C} and \texttt{G}, it is represented by a 0. To maintain prefix-free codes, the two remaining codes are not allowed to contain a leading 0. \texttt{C} precedes \texttt{G} lexicographically, therefor the (in a numerical sense) smaller code is set to represent \texttt{C}.\\
- With this simple rules, the alphabet can be compressed too. Instead of storing codes itself, only the codelength stored \cite{rfc1951}. This might seem unnecessary when looking at a single compressed bulk of data, but when compressing blocks of data, a samller alphabet can make a relevant difference.\\
- % example header, alphabet, data block?
- BGZF extends this by creating a series of blocks. Each can not extend a limit of 64 Kilobyte. Each block contains a standard gzip file header, followed by compressed data.\\
- \subsubsection{CRAM}
- The improvement of \acs{BAM} \cite{cram-origin} called \acs{CRAM}, also features a block structure \cite{bam}. The whole file can be separated into four sections, stored in ascending order: File definition, a CRAM Header Container, multiple Data Container and a final CRAM EOF Container.\\
- The complete structure is displayed in \ref{k4:cram-struct}. The following paragrph will give a brief description to the high level view of a \acs{CRAM} fiel, illustrated as the most upper bar. Followed by a closer look at the data container, which components are listed in the bar, at the center of \ref{k4:cram-struct}. The most in deph explanation will be given to the bottom bar, which shows the structure of so called slices.\\
- \begin{figure}[H]
- \centering
- \includegraphics[width=15cm]{k4/cram-structure.png}
- \caption{File Format Structure of Samtools \acs{CRAM} \cite{bam}.}
- \label{k4:cram-struct}
- \end{figure}
- The File definition, illustrated on the left side of the first bar in \ref{k4:cram-struct}, consists of 26 uncompressed bytes, storing formating information and a identifier. The CRAM header contains meta information about Data Containers and is optionally compressed with gzip. This container can also contain a uncompressed zero-padded section, reseved for \acs{SAM} header information \cite{bam}. This saves time, in case the compressed file is altered and its compression need to be updated. The last container in a \acs{CRAM} file serves as a indicator that the \acs{EOF} is reached. Since in addition information about the file and its structure is stored, a maximum of 38 uncompressed bytes can be reached.\\
- A Data Container can be split into three sections. From this sections the one storing the actual sequence consists of blocks itself, displayed in \ref{k4:cram-struct} as the bottom row.\\
- \begin{itemize}
- \item Container Header.
- \item Compression Header.
- \item A variable amount of Slices.
- \begin{itemize}
- \item Slice Header.
- \item Core Data Block.
- \item A variable amount of External Data Blocks.
- \end{itemize}
- \end{itemize}
- The Container Header stores information on how to decompress the data stored in the following block sections. The Compression Header contains information about what kind of data is stored and some encoding information for \acs{SAM} specific flags \cite{bam}.
- The actual data is stored in the Data Blocks. Those consist of encoded bit streams. According to the Samtools specification, the encoding can be one of the following: External, Huffman and two other methods which happen to be either a form of Huffman coding or a shortened binary representation of integers \cite{bam}. The External option allows to use gzip, bzip2 which is a form of multiple coding methods including run length encoding and Huffman, a encoding from the LZ family called LZMA or a combination of arithmetic and Huffman coding called rANS \cite{sam12}.
- % possible encodings:
- % external: no encoding or gzip, bzip2, lzma
- % Huffman
- % Byte array coding -> Huffman or external...
- % Beta coding -> binary representation
|