ソースを参照

created feasability chapter. Working on algorithms

u 3 年 前
コミット
6a901ec7f3

+ 1 - 0
latex/tex/kapitel/abkuerzungen.tex

@@ -7,6 +7,7 @@
 \begin{acronym}[IEEE]
   \acro{ANS}{Arithmetic Numeral System}
   \acro{ASCII}{American Standard Code for Information Interchange}
+  \acro{CABAC}{Context-Adaptive Arithmetic Coding}
   \acro{CRAM}{Compressed Reference-oriented Alignment Map}
   \acro{DNA}{Deoxyribonucleic Acid}
   \acro{EOF}{End of File}

+ 1 - 0
latex/tex/kapitel/k3_datatypes.tex

@@ -24,6 +24,7 @@
 % what is our focus (and maybe 'why')
 
 \chapter{Datatypes}
+\label{chap:datatypes}
 As described in previous chapters \ac{DNA} can be represented by a String with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
 Storing a single \textit{A} with \ac{ascii} encoding requires 8 bit (\,excluding magic bytes and the bytes used to mark \ac{EOF})\, since there are at least $2^8$ or 128 displayable symbols. Since the \ac{DNA} buildingblocks only require a minimum of four letters, two bits are needed e.g.: \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. Depending on the sequencing method, more than four letters are used. The complex process of sequencing \ac{DNA} is not 100\% preceice, so additional Letters are used to mark nucelotides that could not or could only partly get determined.\\
 More common everyday-usage text encodings like unicode require 16 bits per letter. So settling with \ac{ascii} has improvement capabilitie but is, on the other side, more efficient than using bulkier alternatives like unicode.\\

+ 39 - 19
latex/tex/kapitel/k4_algorithms.tex

@@ -18,41 +18,61 @@
 %- IMPACT ON COMPRESSION
 
 \chapter{Compression aproaches}
+The process of compressing data serves the goal to generate an output that is smaller than its input data. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or network protocols that are used to transmit video/audio streams live.
+For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete.\\
+
 % begin with entropy encoding/shannons source coding theorem
+\section{Shannons Entropy}
+The great mathematician, electrical engineer and cryptographer Claude Elwood Shannon developed information entropy and published it in 1948 \autocite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
+
+% todo insert Fig. 1 shannon_1948
+
+Altering this figure shows how it can be used for other technology like compression.\\
+The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same phyiscal actor. 
+transmitter and receiver are changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time.
 
-The process of compressing data serves the goal to generate an output, that is smaller than its input. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible with any compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or with network protocols which are used to transmit video/audio streams live.\\
-For storing \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its exact position is needed for the sequence to be complete and usefull.\\
+Shannons Entropy provides a formular to determine the '(un)certainty of a probability distribution'. This is used today to find the maximum amount of bits needed to store information. 
+% alphabet, chain of symbols, kurz entropy erklären
 
 \section{Arithmetic coding}
-Arithmetic coding is an approach to solve the problem of waste of memory, due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is left unused. So the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if it would be possible to encode two letters with one bit and the other one with a combination of two bits. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted as two one bit letters rather than the letter it should represent.
+Arithmetic coding is an approach to solve the problem of waste of memory, due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if there would be a possibility to encode two letters in the alphabet with one bit and the other one with a two byte combination. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted either as the letter it should represent or as two one bit letters.
 % check this wording 'simulating' with sources 
 % this is called subdividing
-Arithmetic coding works by simulating a n-letter binary encoding for a n-letter alphabet. This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an interval between two floating point numbers between 0.0 and 1.0 (exclusively). This interval is determined by its distribution in the input text (interval start) and the the start of the next character (interval end).\\
-To encode a sequence of characters, the interval start of the first character is noted, its interval is split into smaller intervals, mapping the ratios of the initial intervals between 0.0 and 1.0. In this smaller distribution the interval representing the second character is choosen. This process is repeated for until a interval for the last character is determined.\\
-% explain abstract ussage to show the goal of splitting intervals
-To encode in binary, the floating point representation of a number inside the interval, for the last character is calculated. This is done by using a similar process to the one described above, called subdividing.
+Arithmetic coding works by simulating a n-letter binary encoding for a n-letter alphabet. This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point number in the space between 0.0 and 1.0 (exclusively), which is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). To encode a sequence of characters, the intervall start of the character is noted, its intervall is split into smaller intervalls with the ratios of the initial intervalls between 0.0 and 1.0. With this, the second character is choosen. This process is repeated for until a intervall for the last character is choosen.\\
+To encode in binary, the binary floating point representation of a number inside the intervall, for the last character is calculated, by using a similar process, described above, called subdividing.
 % its finite subdividing because processors bottleneck floatingpoints 
 
+\subsection{\ac{CABAC}}
+% a form of entropy coding
+% https://en.wikipedia.org/wiki/Context-adaptive_binary_arithmetic_coding
+
 \section{Huffman encoding}
 % list of algos and the tools that use them
-The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into to much details. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
+The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
 % binary view for alphabet
 % length n of sequence to compromize
 % greedy algo
 \begin{itemize}
-  \item every symbol of the alphabet is one leaf.
-  \item the right branch from every node is marked as a 1, the left one is marked as a 0.
-  \item every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text just like in the section Arithmetic coding.
-  \item the less weight a node has, the higher the probability is, that this node is read next in the symbol sequence.
+  \item every symbol of the alphabet is one leaf
+  \item the right branch from every not is marked as a 1, the left one is marked as a 0
+  \item every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text
+  \item the less weight a node has, the higher the probability is, that this node is read next in the symbol sequence
 \end{itemize}
-The process of compressing starts with the nodes that has the lowest weight and stepwise builds up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol that has the highest weight, therefore occours the most in the input data.\\
-Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in \ac{ascii} is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that contains a distribution of characters like the following \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, build from the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}.\\
-Besides the compressed data, the information contained in the tree msut be saved for the decompression process.
+The process of compressing starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
+Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.
 
 % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
-\subsection{LZ77}
+\subsection{\ac{LZ77}}
 \ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.
 
-\section{DEFLATE}
-% mix of huffman and lz77
-The DEFLATE compression algorithm combines \ac{lz77} and huffman coding. It is used in well known tools like gzip.
+\section{Implementations}
+% SAM - LZ4 src: https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md
+% GeCo - arithmetic coding
+% Genie - CABAC
+
+% following text is irelevant. Just describe used algorithms in comparison chapter and refere to their base algo
+
+% mix of Huffman and lz77
+The DEFLATE compression algorithm combines \ac{LZ77} and Huffman coding. To get more specific, the raw data is compressed with \ac{LZ77} and remaining data is shortened by using Huffman coding. 
+% huffman - little endian
+% lz77 compressed - big endian (least significant byte first/most left)

+ 0 - 32
latex/tex/kapitel/k5_.tex

@@ -1,32 +0,0 @@
-%SUMMARY
-%- ABSTRACT
-%- INTRODUCTION
-%# BASICS
-%- \acs{DNA} STRUCTURE
-%- DATA TYPES
-% - BAM/FASTQ
-% - NON STANDARD
-%- COMPRESSION APPROACHES
-% - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
-% - HUFFMAN ENCODING
-% - PROBABILITY APPROACHES (WITH BASE?)
-%
-%# COMPARING TOOLS
-%- 
-%# POSSIBLE IMPROVEMENT
-%- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
-%- IMPACT ON COMPRESSION
-
-%\chapter{Analysis for Possible Compression Improvements}
-\chapter{Feasibillity Analysis for New Algorithm Considering Stochastic Organisation of Genomes}
-
-% first thoughts:
-% - just save one nuceleotide every n bits
-% - save checksum for whole genome
-
-% - use algorithms (from new discoveries) to recreate genome
-% - check checksum -> finished : retry
-
-% - can run recursively and threaded
-
-% - im falle von testdata: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo'

+ 91 - 0
latex/tex/kapitel/k5_feasability.tex

@@ -0,0 +1,91 @@
+%SUMMARY
+%- ABSTRACT
+%- INTRODUCTION
+%# BASICS
+%- \acs{DNA} STRUCTURE
+%- DATA TYPES
+% - BAM/FASTQ
+% - NON STANDARD
+%- COMPRESSION APPROACHES
+% - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
+% - HUFFMAN ENCODING
+% - PROBABILITY APPROACHES (WITH BASE?)
+%
+%# COMPARING TOOLS
+%- 
+%# POSSIBLE IMPROVEMENT
+%- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
+%- IMPACT ON COMPRESSION
+
+%\chapter{Analysis for Possible Compression Improvements}
+\chapter{Feasibillity Analysis for New Algorithm Considering Stochastic Organisation of Genomes}
+
+% first thoughts:
+% - just save one nuceleotide every n bits
+% - save checksum for whole genome
+
+% - use algorithms (from new discoveries) to recreate genome
+% - check checksum -> finished : retry
+
+% - can run recursively and threaded
+
+% - im falle von testdata: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo'
+
+The first attempt to determine feasability of this project consists of setting basevalues, a further improvement can be meassured by. For this to be recreateable, a few specifications must be known:\\
+CPU Core information `cat /proc/cpuinfo`\\
+
+Output for the last core:\\
+
+processor	: 15
+vendor\_id	: AuthenticAMD
+cpu family	: 23
+model		: 1
+model name	: AMD EPYC Processor (with IBPB)
+stepping	: 2
+microcode	: 0x1000065
+cpu MHz		: 2400.000
+cache size	: 512 KB
+physical id	: 15
+siblings	: 1
+core id		: 0
+cpu cores	: 1
+apicid		: 15
+initial apicid	: 15
+fpu		: yes
+fpu\_exception	: yes
+cpuid level	: 13
+wp		: yes
+flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr\_opt pdpe1gb rdtscp lm rep\_good nopl cpuid extd\_apicid tsc\_known\_freq pni pclmulqdq ssse3 fma cx16 sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand hypervisor lahf\_lm cmp\_legacy cr8\_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr\_core ssbd ibpb vmmcall fsgsbase tsc\_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha\_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr virt\_ssbd arat arch\_capabilities
+bugs		: sysret\_ss\_attrs null\_seg spectre\_v1 spectre\_v2 spec\_store\_bypass
+bogomips	: 4800.00
+TLB size	: 1024 4K pages
+clflush size	: 64
+cache\_alignment	: 64
+address sizes	: 48 bits physical, 48 bits virtual
+power management:\\
+Memory capacity (and more todo list):
+`dmidecode --type memory` || `dmidecode --type 17`
+\section{Pool of Tools}
+For an initial test, a small pool of three tools was choosen. 
+\begin{itemize}
+  \item Samtools
+  \item GeCo
+  \item genie
+\end{itemize}
+Each of this tools comply with the criteria choosen in \autoref{chap:datatypes}.\\
+To test each tool, the same set of data were used. The genome of a homo sapien id: GRCh38 were chosen due to its size TODO: find more exact criteria for testdata.
+The Testdata is available via an open FTP Server, hotsed by ensembl. Source:\url{http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/}\\
+Testparameters that were focused on:
+\begin{itemize}
+  \item Efficiency: \textbf{Duration} the Process had run for
+  \item Effectivity: The difference in \textbf{Size} between input and compressed data
+  \item todo: fehlerquote!
+\end{itemize}
+First was captured by 
+TODO choose:
+- a linux tool to output the exact runtime (time <cmd>)
+- a alteration in the c code that outputs the time at start and end of the process runtime.
+\section{Installation}
+\section{Alteration of Code to Determine Runtime}
+\section{Execution}
+\section{Data analysis}

+ 12 - 0
latex/tex/literatur.bib

@@ -76,4 +76,16 @@
   publisher = {{RFC} Editor},
 }
 
+@Article{Shannon_1948,
+  author       = {C. E. Shannon},
+  date         = {1948-07},
+  journaltitle = {Bell System Technical Journal},
+  title        = {A Mathematical Theory of Communication},
+  doi          = {10.1002/j.1538-7305.1948.tb01338.x},
+  number       = {3},
+  pages        = {379--423},
+  volume       = {27},
+  publisher    = {Institute of Electrical and Electronics Engineers ({IEEE})},
+}
+
 @Comment{jabref-meta: databaseType:biblatex;}

+ 2 - 1
latex/tex/thesis.tex

@@ -137,7 +137,8 @@
 \input{kapitel/k1_introduction} 
 \input{kapitel/k2_dna_structure}
 \input{kapitel/k3_datatypes} 
-\input{kapitel/k4_algorithms} % Externe Datei einbinden
+\input{kapitel/k4_algorithms}
+\input{kapitel/k5_feasability} 
 % ------------------------------------------------------------------
 
 \label{lastpage}