Browse Source

reread before first prof presentation

u 3 years ago
parent
commit
c9e24005b7

+ 2 - 2
latex/tex/docinfo.tex

@@ -39,7 +39,7 @@
 %          erkannt.
 %          erkannt.
 
 
 % Kurze (maximal halbseitige) Beschreibung, worum es in der Arbeit geht auf Deutsch
 % Kurze (maximal halbseitige) Beschreibung, worum es in der Arbeit geht auf Deutsch
-\newcommand{\hsmaabstractde}{Jemand musste Josef K. verleumdet haben, denn ohne dass er etwas Böses getan hätte, wurde er eines Morgens verhaftet. Wie ein Hund! sagte er, es war, als sollte die Scham ihn überleben. Als Gregor Samsa eines Morgens aus unruhigen Träumen erwachte, fand er sich in seinem Bett zu einem ungeheueren Ungeziefer verwandelt. Und es war ihnen wie eine Bestätigung ihrer neuen Träume und guten Absichten, als am Ziele ihrer Fahrt die Tochter als erste sich erhob und ihren jungen Körper dehnte. Es ist ein eigentümlicher Apparat, sagte der Offizier zu dem Forschungsreisenden und überblickte mit einem gewissermaßen bewundernden Blick den ihm doch wohl bekannten Apparat. Sie hätten noch ins Boot springen können, aber der Reisende hob ein schweres, geknotetes Tau vom Boden, drohte ihnen damit und hielt sie dadurch von dem Sprunge ab. In den letzten Jahrzehnten ist das Interesse an Künstlern sehr zurückgegangen. Aber sie überwanden sich, umdrängten den Käfig und wollten sich gar nicht fortrühren.}
+\newcommand{\hsmaabstractde}{TBD.}
 
 
 % Kurze (maximal halbseitige) Beschreibung, worum es in der Arbeit geht auf Englisch
 % Kurze (maximal halbseitige) Beschreibung, worum es in der Arbeit geht auf Englisch
-\newcommand{\hsmaabstracten}{The European languages are members of the same family. Their separate existence is a myth. For science, music, sport, etc, Europe uses the same vocabulary. The languages only differ in their grammar, their pronunciation and their most common words. Everyone realizes why a new common language would be desirable: one could refuse to pay expensive translators. To achieve this, it would be necessary to have uniform grammar, pronunciation and more common words. If several languages coalesce, the grammar of the resulting language is more simple and regular than that of the individual languages. The new common language will be more simple and regular than the existing European languages. It will be as simple as Occidental; in fact, it will be Occidental. To an English person, it will seem like simplified English, as a skeptical Cambridge friend of mine told me what Occidental is.}
+\newcommand{\hsmaabstracten}{TBD.}

+ 9 - 2
latex/tex/kapitel/abkuerzungen.tex

@@ -5,7 +5,14 @@
 % ACHTUNG: Sie müssen die Abkürzungen von Hand alphabetisch
 % ACHTUNG: Sie müssen die Abkürzungen von Hand alphabetisch
 %          sortieren. Das passiert nicht automatisch.
 %          sortieren. Das passiert nicht automatisch.
 \begin{acronym}[IEEE]
 \begin{acronym}[IEEE]
-  \acro{DNA}{Deoxyribonucleic acid}
-  \acro{ANS}{Arithmetic numeral system}
+  \acro{ANS}{Arithmetic Numeral System}
+  \acro{ASCII}{American Standard Code for Information Interchange}
+  \acro{CRAM}{Compressed Reference-oriented Alignment Map}
+  \acro{DNA}{Deoxyribonucleic Acid}
+  \acro{EOF}{End of File}
   \acro{GA4GH}{Global Alliance for Genomics and Health}
   \acro{GA4GH}{Global Alliance for Genomics and Health}
+  \acro{IUPAC}{International Union of Pure and Applied Chemistry}
+  \acro{LZ77}{Lempel Ziv 77}
+  \acro{SAM}{Sequence Alignment Map}
+  \acro{BAM}{Binary Alignment Map}
 \end{acronym}
 \end{acronym}

+ 4 - 2
latex/tex/kapitel/k1_introduction.tex

@@ -1,7 +1,9 @@
 \chapter{Introduction}
 \chapter{Introduction}
 % general information and intro
 % general information and intro
-Understanding how things in our cosmos work was and still ist a pleasure the human being always wants to fullfill. Gettings insights into the rawest form of organic live is possible through reading and storing information embeded in genecode. Since live is complex, this information requires a lot of memory to be stored digitally. Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
+Understanding how things in our cosmos work, was and still is a pleasure the human being always wants to fullfill. Getting insights into the rawest form of organic live is possible through storing and studying information, embeded in genecode. Since live is complex, there is a lot information, which requires a lot of memory.\\
+% [...] Communication with other researches means sending huge chunks of data through cables or through waves over the air, which costs time and makes raw data vulnerable to erorrs.\\
 % compression values and goals
 % compression values and goals
-With compression tools this problem is reduced. Compressed data requires less space, requires less time to be send over networks and do to its smaller size, is statistically a little less vulnerable to errors. This advantage is scaleable and since there is much to discover about genomes, new findings in this field are nothing unusuall. From some of this findings, new tools can be developed which optimally increase two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data.\\
+With compression tools this problem is reduced. Compressed data requires less space and less time to be tranported over networks. This advantage is scaleable and since there is much to discover about genomes, new findings in this field are nothing unusuall. From some of this findings, new tools can be developed which optimally increase two factors: the speed at which data is compressed and the compresseion ratio, meaning the difference between uncompressed and compressed data.\\
+% [...]
 % more exact explanation
 % more exact explanation
 New discoveries in the universal rules of stochastical organisation of genomes might provide a base for new algoriths and therefore new tools for genome compression. The aim of this work is to analyze the current state of the art for probabilistic compression tools and their algorithms, and ultimately determine whether mentioned discoveries are already used. If this is not the case, there will be an analysation of how this new approach could improve compression methods.\\
 New discoveries in the universal rules of stochastical organisation of genomes might provide a base for new algoriths and therefore new tools for genome compression. The aim of this work is to analyze the current state of the art for probabilistic compression tools and their algorithms, and ultimately determine whether mentioned discoveries are already used. If this is not the case, there will be an analysation of how this new approach could improve compression methods.\\

+ 4 - 3
latex/tex/kapitel/k2_dna_structure.tex

@@ -18,10 +18,11 @@
 %- IMPACT ON COMPRESSION
 %- IMPACT ON COMPRESSION
 
 
 \chapter{Structure Of Biological Data}
 \chapter{Structure Of Biological Data}
-To strengthen the understanding of how and where biological information is stored, this section starts with a quick and general rundown on the structure of any living organism.
+To strengthen the understanding of how and where biological information is stored, this section starts with a quick and general rundown on the structure of any living organism.\\
 % todo add picture
 % todo add picture
 All living organisms, like plants and animals, are made of cells (a human body can consist out of several trillion cells) \cite{cells}.
 All living organisms, like plants and animals, are made of cells (a human body can consist out of several trillion cells) \cite{cells}.
 A cell in itself is a living organism; The smalles one possible. It has two layers from which the inner one is called nucleus. The nucleus contains chromosomes and those chromosomes hold the genetic information in form of \ac{DNA}. 
 A cell in itself is a living organism; The smalles one possible. It has two layers from which the inner one is called nucleus. The nucleus contains chromosomes and those chromosomes hold the genetic information in form of \ac{DNA}. 
+% nucelosome and histone?
  
  
 \section{DNA}
 \section{DNA}
 \ac{DNA} is often seen in the form of a double helix. A double helix consists, as the name suggestes, of two single helix. 
 \ac{DNA} is often seen in the form of a double helix. A double helix consists, as the name suggestes, of two single helix. 
@@ -33,9 +34,9 @@ A cell in itself is a living organism; The smalles one possible. It has two laye
   \label{k2:dna-struct}
   \label{k2:dna-struct}
 \end{figure}
 \end{figure}
 
 
-Each of them consists of two main components: the Suggar Phosphat backbone, which is irelavant for this Paper and the Bases. The arrangement of Bases represents the Information stored in the \ac{DNA}. A base is an organic molecule, they are called Nucleotides \cite{dna_structure}. %Nucleotides have special attributes and influence other Nucleotides in the \acs{DNA} Sequence
+Each of them consists of two main components: the Suggar Phosphat backbone, which is irelavant for this work and the Bases. The arrangement of Bases represents the Information stored in the \ac{DNA}. A base is an organic molecule, they are called Nucleotides \cite{dna_structure}. %Nucleotides have special attributes and influence other Nucleotides in the \acs{DNA} Sequence
 % describe Genomes?
 % describe Genomes?
 
 
 \section{Nucleotides}
 \section{Nucleotides}
-For this work, nucleotides are the most important parts of the \acs{DNA}. A Nucleotide can have one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a Counterpart with which a bond can be established: adenine can bond with thymine, guanine can bond with cytosine. For someone who whishes to persist this information it means the content of one helix can be determined by ``inverting'' the other one, in other words: the nucleotides of only one helix needs to be stored physically to save the information of the whole \ac{DNA}. The counterpart for e.g.: \texttt{adenine, guanine, adenine} chain would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide but only use its initial: \texttt{AGA} in one Helix, \texttt{TCT} in the other.
+For this work, nucleotides are the most important parts of the \acs{DNA}. A Nucleotide can have one of four forms: it can be either adenine, thymine, guanine or cytosine. Each of them got a Counterpart with which a bond can be established: adenine can bond with thymine, guanine can bond with cytosine. For someone who whishes to persist this information, it means the content of one helix can be determined by ``inverting'' the other one, in other words: the nucleotides of only one (entire) helix needs to be stored physically, to save the information of the whole \ac{DNA}. The counterpart for e.g.: \texttt{adenine, guanine, adenine} chain would be a chain of \texttt{thymine, cytosine, thymine}. For the sake of simplicity, one does not write out the full name of each nucleotide, but only its initial: \texttt{AGA} in one Helix, \texttt{TCT} in the other.
 
 

+ 44 - 39
latex/tex/kapitel/k3_datatypes.tex

@@ -24,42 +24,50 @@
 % what is our focus (and maybe 'why')
 % what is our focus (and maybe 'why')
 
 
 \chapter{Datatypes}
 \chapter{Datatypes}
-% \section{}
-As described in previous chapters \ac{DNA} can be represented by a String with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store any symbol. 
-Storing a single \textit{A} with ASCII encoding requires 8 bit (excluding magic bytes and the \ac{EOF}) since there are at least 2 \times 8 or 128 displayable symbols.Since the \ac{DNA} buildingblocks only require a minimum of four letters, two bits are needed e.g.: \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. More common Text encodings like unicode require 16 bits per letter. So settling with ASCII has improvement capabilitie but is, on the other side, more efficient than using bulkier alternatives like unicode.
-\\
-Several people and groups have developed different fileformats to store genomes. Unfortunally for this work, there is no defined standard filetype or set of filetypes therefor one has to gather information on which types exist and how they function by themself. In order to not go beyond scope, this work will focus only on fileformats that fullfill two factors:\\
-1. it has reputation, either through a scientific paper that proove its superiority by comparison with other, relevant tools or through a broad ussage of the format.//
-2. 
+As described in previous chapters \ac{DNA} can be represented by a String with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
+Storing a single \textit{A} with \ac{ascii} encoding requires 8 bit (\,excluding magic bytes and the bytes used to mark \ac{EOF})\, since there are at least $2^8$ or 128 displayable symbols. Since the \ac{DNA} buildingblocks only require a minimum of four letters, two bits are needed e.g.: \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. Depending on the sequencing method, more than four letters are used. The complex process of sequencing \ac{DNA} is not 100\% preceice, so additional Letters are used to mark nucelotides that could not or could only partly get determined.\\
+More common everyday-usage text encodings like unicode require 16 bits per letter. So settling with \ac{ascii} has improvement capabilitie but is, on the other side, more efficient than using bulkier alternatives like unicode.\\
+
+Several people and groups have developed different fileformats to store genomes. Unfortunally for this work, there is no defined standard filetype or set of filetypes therefor one has to gather information on which types exist and how they function by themself. In order to not go beyond scope, this work will focus only on fileformats that fullfill following factors:\\
+\begin{itemize}
+  \item{the format has reputation, either through a scientific paper, that prooved its superiority to other relevant tools or through a broad ussage of the format.}
+  \item{the format does not include \ac{IUPAC} codes besides A, C, G and T \autocite{iupac}.}
+  \item{the format is open source.}
+  \item{the compression methode used in the format is based on probabilities.}
+\end{itemize}
+
+Some common fileformats would be:
 \begin{itemize}
 \begin{itemize}
 % which is relevant? 
 % which is relevant? 
   \item{FASTA}
   \item{FASTA}
-  \item{twoBit}
   \item{FASTQ}
   \item{FASTQ}
+  \item{twoBit}
   \item{SAM/BAM}
   \item{SAM/BAM}
   \item{VCF}
   \item{VCF}
   \item{BED}
   \item{BED}
 \end{itemize}
 \end{itemize}
 % src: http://help.oncokdm.com/en/articles/1195700-what-is-a-bam-fastq-vcf-and-bed-file
 % src: http://help.oncokdm.com/en/articles/1195700-what-is-a-bam-fastq-vcf-and-bed-file
 
 
-Since methods to store this kind of Data are still in development, there are many more filetypes. The few, mentioned above are used by different organisations and researchers and backed by a scientific publication. % todo find sources to both points in last sentence
+Since methods to store this kind of Data are still in development, there are many more filetypes. The few, mentioned above are used by different organisations and researchers and are backed by a scientific publication. % todo find sources to both points in last sentence
+%rewrite:
 In order to not go beyond the scope, this paper will only focuse on compression tools which are using standard formats.
 In order to not go beyond the scope, this paper will only focuse on compression tools which are using standard formats.
 
 
-\subsection{FASTQ}
+\section{FASTQ}
 Is a text base format for storing sequenced data. It saves nucleotides as letters and in extend to that, the quality values are saved.
 Is a text base format for storing sequenced data. It saves nucleotides as letters and in extend to that, the quality values are saved.
 FASTQ files are split into multiples of four, each four lines contain the informations for one sequence. The exact structure of FASTQ format is as follows:
 FASTQ files are split into multiples of four, each four lines contain the informations for one sequence. The exact structure of FASTQ format is as follows:
 \texttt{
 \texttt{
-Line 1: Sequence identifier aka. Title, starting with an @ and an optional description.
-Line 2: The seuqence consisting of nucleoids, symbolized by A, T, G and C.
-Line 3: A '+' that functions as a seperator. Optionally followed by content of Line 1.
-Line 4: quality line(s). consisting of letters and special characters in the ASCII scope.}
+Line 1: Sequence identifier aka. Title, starting with an @ and an optional description.\\
+Line 2: The seuqence consisting of nucleoids, symbolized by A, T, G and C.\\
+Line 3: A '+' that functions as a seperator. Optionally followed by content of Line 1.\\
+Line 4: quality line(s). consisting of letters and special characters in the \ac{ascii} scope.}\\
 
 
-The quality values have no fixed type, to name a few there is the sanger format, the solexa format introduced by Solexa Inc., the Illumina and the QUAL format, generated by the PHRED software. 
+The quality values have no fixed type, to name a few there is the sanger format, the solexa format introduced by Solexa Inc., the Illumina and the QUAL format which is generated by the PHRED software. 
 The quality value shows the estimated probability of error in the sequencing process.
 The quality value shows the estimated probability of error in the sequencing process.
+[...]
 
 
-\subsection{SAM/BAM}
+\section{SAM/BAM}
 % src https://github.com/samtools/samtools
 % src https://github.com/samtools/samtools
-SAM Sequence Alignment/Map format, often just called BAM like its fileextension, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 as defined in RFC1345. The structure is more complex than the one in FASTQ and described best, accompanied by an example:
+\ac{SAM} often seen in its compressed, binary representation \ac{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 as defined in RFC1345. The structure is more complex than the one in FASTQ and described best, accompanied by an example:
 
 
 \begin{figure}[ht]
 \begin{figure}[ht]
   \centering
   \centering
@@ -68,35 +76,32 @@ SAM Sequence Alignment/Map format, often just called BAM like its fileextension,
   \label{k_datatypes:bam-struct}
   \label{k_datatypes:bam-struct}
 \end{figure}
 \end{figure}
 
 
-\subsubsection{Valid Symbols}
-%- char set restrictions: ASCII in range ! to ~ apart from '@'
+\subsection{Valid Symbols}
+%- char set restrictions: \ac{ascii} in range ! to ~ apart from '@'
 %- ... as regex (todo: describe in words): 
 %- ... as regex (todo: describe in words): 
 % This RegEx -> \[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
 % This RegEx -> \[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
 The regulare expression, shown above, filters touple of characters from a to z in lower and uppercase, numbers and several special characters. There are minor differences between the special chars first and second element of the touple, in the latter '*' and '=' are allowed which are not allowed in the first.
 The regulare expression, shown above, filters touple of characters from a to z in lower and uppercase, numbers and several special characters. There are minor differences between the special chars first and second element of the touple, in the latter '*' and '=' are allowed which are not allowed in the first.
- 
-
-\subsubsection{Index File}
 
 
+%\subsection{Index File}
+ 
 % notes
 % notes
-- may contian metadata: magic string, ref seq, bins, offsets, unmapped reads
-- allows viewing BAM data (localy and remote via ftp/http)
-- file extention: <filename>.bam.bai
+%- may contian metadata: magic string, ref seq, bins, offsets, unmapped reads
+%- allows viewing BAM data (localy and remote via ftp/http)
+%- file extention: <filename>.bam.bai
 
 
-- stores more data than FASTQ
+%- stores more data than FASTQ
  
  
 % src: https://support.illumina.com/help/BS_App_RNASeq_Alignment_OLH_1000000006112/Content/Source/Informatics/BAM-Format.htm
 % src: https://support.illumina.com/help/BS_App_RNASeq_Alignment_OLH_1000000006112/Content/Source/Informatics/BAM-Format.htm
-- allignment section includes
- - RG Read Group
- - BC Barcode Tag
- - SM Single-end alignment quality
- - AS Paired-end alignment quality
- - NM Edit distance tag
- - XN amplicon name tag
+%- allignment section includes
+% - RG Read Group
+% - BC Barcode Tag
+% - SM Single-end alignment quality
+% - AS Paired-end alignment quality
+% - NM Edit distance tag
+% - XN amplicon name tag
 
 
-- BAM index files nameschema: <filename>.bam.bai 
+%- BAM index files nameschema: <filename>.bam.bai 
 
 
-\subsection{CRAM - Compressed Reference-oriented Ailgnment Map}
-% src https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access.html#cram-format
-% ga4ah https://www.ga4gh.org/cram/
-A highly space efficient file format for sequenced data, maintained by \ac{GA4GH}. It features both lossy and lossless compression mode. Even though it is part of \ac{GA4GH} suite, the file format can be used independently.\\
-The basic idea behind this format, is to split data into smaller sections 
+\section{CRAM - Compressed Reference-oriented Ailgnment Map}
+\ac{CRAM} was developed as an alternative to the \ac{SAM} and \ac{BAM} Format. It specification is maintained by \ac{GA4GH}. It features both lossy and lossless compression mode. Since it is not relevant to this work, the lossy compression is ignored from here on. Even though it is part of \ac{GA4GH} suite, the file format can be used independently.\\
+The format saves data in containers which consist out of slices. Each slice is represented by a line in the file. Container and slices each store metadata in a header. Data is stored as blocks in slices, in a compressed form.

+ 25 - 20
latex/tex/kapitel/k4_algorithms.tex

@@ -20,34 +20,39 @@
 \chapter{Compression aproaches}
 \chapter{Compression aproaches}
 % begin with entropy encoding/shannons source coding theorem
 % begin with entropy encoding/shannons source coding theorem
 
 
-The process of compressing data serves the goal to generate an output that is smaller than its input data. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or network protocols that are used to transmit video/audio streams live.
-For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete.
+The process of compressing data serves the goal to generate an output, that is smaller than its input. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible with any compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or with network protocols which are used to transmit video/audio streams live.\\
+For storing \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its exact position is needed for the sequence to be complete and usefull.\\
+
+\section{Arithmetic coding}
+Arithmetic coding is an approach to solve the problem of waste of memory, due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is left unused. So the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if it would be possible to encode two letters with one bit and the other one with a combination of two bits. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted as two one bit letters rather than the letter it should represent.
+% check this wording 'simulating' with sources 
+% this is called subdividing
+Arithmetic coding works by simulating a n-letter binary encoding for a n-letter alphabet. This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an interval between two floating point numbers between 0.0 and 1.0 (exclusively). This interval is determined by its distribution in the input text (interval start) and the the start of the next character (interval end).\\
+To encode a sequence of characters, the interval start of the first character is noted, its interval is split into smaller intervals, mapping the ratios of the initial intervals between 0.0 and 1.0. In this smaller distribution the interval representing the second character is choosen. This process is repeated for until a interval for the last character is determined.\\
+% explain abstract ussage to show the goal of splitting intervals
+To encode in binary, the floating point representation of a number inside the interval, for the last character is calculated. This is done by using a similar process to the one described above, called subdividing.
+% its finite subdividing because processors bottleneck floatingpoints 
 
 
 \section{Huffman encoding}
 \section{Huffman encoding}
 % list of algos and the tools that use them
 % list of algos and the tools that use them
-The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
+The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into to much details. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
 % binary view for alphabet
 % binary view for alphabet
 % length n of sequence to compromize
 % length n of sequence to compromize
 % greedy algo
 % greedy algo
 \begin{itemize}
 \begin{itemize}
-  \item every symbol of the alphabet is one leaf
-  \item the right branch from every not is marked as a 1, the left one is marked as a 0
-  \item every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text
-  \item the less weight a node has, the higher the probability is, that this node is read next in the symbol sequence
+  \item every symbol of the alphabet is one leaf.
+  \item the right branch from every node is marked as a 1, the left one is marked as a 0.
+  \item every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text just like in the section Arithmetic coding.
+  \item the less weight a node has, the higher the probability is, that this node is read next in the symbol sequence.
 \end{itemize}
 \end{itemize}
-The process of compressing starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
-Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.
+The process of compressing starts with the nodes that has the lowest weight and stepwise builds up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol that has the highest weight, therefore occours the most in the input data.\\
+Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in \ac{ascii} is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that contains a distribution of characters like the following \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, build from the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}.\\
+Besides the compressed data, the information contained in the tree msut be saved for the decompression process.
 
 
 % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
 % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
+\subsection{LZ77}
+\ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.
 
 
-\section{Arithmetic coding}
-Arithmetic coding is an approach to solve the problem of waste of memory, due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if there would be a possibility to encode two letters in the alphabet with one bit and the other one with a two byte combination. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted either as the letter it should represent or as two one bit letters.
-% check this wording 'simulating' with sources 
-% this is called subdividing
-Arithmetic coding works by simulating a n-letter binary encoding for a n-letter alphabet. This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point number in the space between 0.0 and 1.0 (exclusively), which is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). To encode a sequence of characters, the intervall start of the character is noted, its intervall is split into smaller intervalls with the ratios of the initial intervalls between 0.0 and 1.0. With this, the second character is choosen. This process is repeated for until a intervall for the last character is choosen.\\
-To encode in binary, the binary floating point representation of a number inside the intervall, for the last character is calculated, by using a similar process, described above, called subdividing.
-% its finite subdividing because processors bottleneck floatingpoints 
-
-
-\section{Probability aproaches}
-
+\section{DEFLATE}
+% mix of huffman and lz77
+The DEFLATE compression algorithm combines \ac{lz77} and huffman coding. It is used in well known tools like gzip.

+ 32 - 0
latex/tex/kapitel/k5_.tex

@@ -0,0 +1,32 @@
+%SUMMARY
+%- ABSTRACT
+%- INTRODUCTION
+%# BASICS
+%- \acs{DNA} STRUCTURE
+%- DATA TYPES
+% - BAM/FASTQ
+% - NON STANDARD
+%- COMPRESSION APPROACHES
+% - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
+% - HUFFMAN ENCODING
+% - PROBABILITY APPROACHES (WITH BASE?)
+%
+%# COMPARING TOOLS
+%- 
+%# POSSIBLE IMPROVEMENT
+%- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
+%- IMPACT ON COMPRESSION
+
+%\chapter{Analysis for Possible Compression Improvements}
+\chapter{Feasibillity Analysis for New Algorithm Considering Stochastic Organisation of Genomes}
+
+% first thoughts:
+% - just save one nuceleotide every n bits
+% - save checksum for whole genome
+
+% - use algorithms (from new discoveries) to recreate genome
+% - check checksum -> finished : retry
+
+% - can run recursively and threaded
+
+% - im falle von testdata: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo'

+ 21 - 151
latex/tex/literatur.bib

@@ -1,154 +1,3 @@
-@Online{Gao2017,
-  author        = {Gao, Liangcai and Yi, Xiaohan and Hao, Leipeng and Jiang, Zhuoren and Tang, Zhi},
-  title         = {{ICDAR 2017 POD Competition: Evaluation}},
-  url           = {http://www.icst.pku.edu.cn/cpdp/ICDAR2017_PODCompetition/evaluation.html},
-  urldate       = {2017-05-30},
-  bdsk-url-1    = {http://www.icst.pku.edu.cn/cpdp/ICDAR2017_PODCompetition/evaluation.html},
-  date-added    = {2017-06-19 19:21:12 +0000},
-  date-modified = {2017-06-19 19:21:12 +0000},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2017},
-}
-
-@Book{Kornmeier2011,
-  author        = {Martin Kornmeier},
-  title         = {Wissenschaftlich schreiben leicht gemacht},
-  edition       = {4. Auflage},
-  publisher     = {UTB},
-  date-added    = {2012-04-04 12:07:45 +0000},
-  date-modified = {2012-04-04 12:09:25 +0000},
-  keywords      = {Writing},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2011},
-}
-
-@Book{Kramer2009,
-  author        = {Walter Kr{\"a}mer},
-  title         = {Wie schreibe ich eine Seminar- oder Examensarbeit?},
-  edition       = {3. Auflage},
-  publisher     = {Campus Verlag},
-  date-added    = {2011-10-27 13:55:22 +0000},
-  date-modified = {2011-10-27 14:01:55 +0000},
-  keywords      = {Writing},
-  month         = {9},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2009},
-}
-
-@Book{Willberg1999,
-  author        = {Hans Peter Willberg and Friedrich Forssmann},
-  title         = {Erste Hilfe in Typographie},
-  publisher     = {Verlag Hermann Schmidt},
-  date-added    = {2011-11-10 08:58:09 +0000},
-  date-modified = {2012-01-24 19:24:12 +0000},
-  keywords      = {Writing},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {1999},
-}
-
-@Book{Forssman2002,
-  author        = {Friedrich Forssman and Ralf de Jong},
-  title         = {Detailtypografie},
-  publisher     = {Verlag Hermann Schmidt},
-  date-added    = {2012-01-24 19:20:46 +0000},
-  date-modified = {2012-01-24 19:21:56 +0000},
-  keywords      = {Writing},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2002},
-}
-
-@Online{Weber2006,
-  author        = {Stefan Weber},
-  title         = {Wissenschaft als Web-Sampling},
-  url           = {http://www.heise.de/tp/druck/mb/artikel/24/24221/1.html},
-  urldate       = {2011-10-27},
-  bdsk-url-1    = {http://www.heise.de/tp/druck/mb/artikel/24/24221/1.html},
-  date-added    = {2011-10-27 14:30:30 +0000},
-  date-modified = {2011-10-27 14:32:34 +0000},
-  journal       = {Telepolis},
-  keywords      = {Writing},
-  lastchecked   = {2011-10-27},
-  month         = {12},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2006},
-}
-
-@Online{Wikipedia_HarveyBalls,
-  author        = {{Harvey Balls}},
-  title         = {Harvey Balls -- Wikipedia},
-  url           = {https://de.wikipedia.org/w/index.php?title=Harvey_Balls&oldid=116517396},
-  urldate       = {2018-02-07},
-  date-added    = {2011-10-27 14:30:30 +0000},
-  date-modified = {2011-10-27 14:32:34 +0000},
-  lastchecked   = {2018-02-07},
-  month         = {4},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2013},
-}
-
-@Online{Volere,
-  author        = {{Volere Template}},
-  title         = {Snowcards -- Volere},
-  url           = {http://www.volere.co.uk},
-  urldate       = {2019-01-31},
-  date-added    = {2011-10-27 14:30:30 +0000},
-  date-modified = {2011-10-27 14:32:34 +0000},
-  lastchecked   = {2019-01-31},
-  month         = {1},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2018},
-}
-
-@TechReport{Barbacci2003,
-  author          = {Barbacci, Mario R. and Ellison, Robert and Lattanze, Anthony J. and Stafford, Judith A. and Weinstock, Charles B. and Wood, William G.},
-  institution     = {Software Engineering Institue - Carnegie Mellon},
-  title           = {{Quality Attribute Workshops (QAWs), Third Edition}},
-  number          = {August},
-  abstract        = {The Quality Attribute Workshop (QAW) is a facilitated method that engages system stake- holders early in the life cycle to discover the driving quality attributes of a software-intensive system. The QAW was developed to complement the Architecture Tradeoff Analysis Meth- odSM (ATAMSM) and provides a way to identify important quality attributes and clarify system requirements before the software architecture has been created. This is the third edition of a technical report describing the QAW. We have narrowed the scope of a QAW to the creation of prioritized and refined scenarios. This report describes the newly revised QAW and describes potential uses of the refined scenarios generated during it.},
-  address         = {Pttsburgh},
-  booktitle       = {Quality},
-  file            = {::},
-  keywords        = {QAW, Quality Attribute Workshop, attribute requirements, attribute tradeoffs, quality attributes, scenarios},
-  mendeley-groups = {SEI,Architecture},
-  ranking         = {rank1},
-  relevance       = {relevant},
-  year            = {2003},
-}
-
-@Book{Bass2003,
-  author    = {Bass, Len and Clements, Paul and Kazman, Rick},
-  title     = {{Software Architecture in Practice}},
-  edition   = {2nd editio},
-  publisher = {Addison-Wesley},
-  series    = {SEI Series in Software Engineering},
-  keywords  = {Architecture},
-  ranking   = {rank1},
-  relevance = {relevant},
-  year      = {2003},
-}
-
-@TechReport{ISO25010,
-  author      = {{International Organization for Standardization}},
-  institution = {International Organization for Standardization},
-  title       = {{Systems and software engineering -- Systems and software Quality Requirements -- and Evaluation (SQuaRE) -- System and software quality models}},
-  type        = {Standard},
-  address     = {Case postale 56, CH-1211 Geneva 20},
-  key         = {ISO/IEC 25010:2011(E)},
-  month       = mar,
-  ranking     = {rank1},
-  relevance   = {relevant},
-  volume      = {2011},
-  year        = {2011},
-}
-
 @Article{Al_Okaily_2017,
 @Article{Al_Okaily_2017,
   author       = {Anas Al-Okaily and Badar Almarri and Sultan Al Yami and Chun-Hsi Huang},
   author       = {Anas Al-Okaily and Badar Almarri and Sultan Al Yami and Chun-Hsi Huang},
   date         = {2017-04-01},
   date         = {2017-04-01},
@@ -206,4 +55,25 @@
   publisher    = {Springer Science and Business Media {LLC}},
   publisher    = {Springer Science and Business Media {LLC}},
 }
 }
 
 
+@Article{iupac,
+  author       = {Andrew D. Johnson},
+  date         = {2010-03},
+  journaltitle = {Bioinformatics},
+  title        = {An extended {IUPAC} nomenclature code for polymorphic nucleic acids},
+  doi          = {10.1093/bioinformatics/btq098},
+  number       = {10},
+  pages        = {1386--1389},
+  volume       = {26},
+  publisher    = {Oxford University Press ({OUP})},
+}
+
+@TechReport{defalte,
+  author    = {L Peter Deutsch},
+  date      = {1996-05},
+  title     = {{DEFLATE} Compressed Data Format Specification version 1.3},
+  doi       = {10.17487/rfc1951},
+  url       = {https://www.rfc-editor.org/rfc/rfc1951},
+  publisher = {{RFC} Editor},
+}
+
 @Comment{jabref-meta: databaseType:biblatex;}
 @Comment{jabref-meta: databaseType:biblatex;}

+ 0 - 206
latex/tex/literatur.bib.sav.tmp

@@ -1,206 +0,0 @@
-@Online{Gao2017,
-  author        = {Gao, Liangcai and Yi, Xiaohan and Hao, Leipeng and Jiang, Zhuoren and Tang, Zhi},
-  title         = {{ICDAR 2017 POD Competition: Evaluation}},
-  url           = {http://www.icst.pku.edu.cn/cpdp/ICDAR2017_PODCompetition/evaluation.html},
-  urldate       = {2017-05-30},
-  bdsk-url-1    = {http://www.icst.pku.edu.cn/cpdp/ICDAR2017_PODCompetition/evaluation.html},
-  date-added    = {2017-06-19 19:21:12 +0000},
-  date-modified = {2017-06-19 19:21:12 +0000},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2017},
-}
-
-@Book{Kornmeier2011,
-  author        = {Martin Kornmeier},
-  title         = {Wissenschaftlich schreiben leicht gemacht},
-  edition       = {4. Auflage},
-  publisher     = {UTB},
-  date-added    = {2012-04-04 12:07:45 +0000},
-  date-modified = {2012-04-04 12:09:25 +0000},
-  keywords      = {Writing},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2011},
-}
-
-@Book{Kramer2009,
-  author        = {Walter Kr{\"a}mer},
-  title         = {Wie schreibe ich eine Seminar- oder Examensarbeit?},
-  edition       = {3. Auflage},
-  publisher     = {Campus Verlag},
-  date-added    = {2011-10-27 13:55:22 +0000},
-  date-modified = {2011-10-27 14:01:55 +0000},
-  keywords      = {Writing},
-  month         = {9},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2009},
-}
-
-@Book{Willberg1999,
-  author        = {Hans Peter Willberg and Friedrich Forssmann},
-  title         = {Erste Hilfe in Typographie},
-  publisher     = {Verlag Hermann Schmidt},
-  date-added    = {2011-11-10 08:58:09 +0000},
-  date-modified = {2012-01-24 19:24:12 +0000},
-  keywords      = {Writing},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {1999},
-}
-
-@Book{Forssman2002,
-  author        = {Friedrich Forssman and Ralf de Jong},
-  title         = {Detailtypografie},
-  publisher     = {Verlag Hermann Schmidt},
-  date-added    = {2012-01-24 19:20:46 +0000},
-  date-modified = {2012-01-24 19:21:56 +0000},
-  keywords      = {Writing},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2002},
-}
-
-@Online{Weber2006,
-  author        = {Stefan Weber},
-  title         = {Wissenschaft als Web-Sampling},
-  url           = {http://www.heise.de/tp/druck/mb/artikel/24/24221/1.html},
-  urldate       = {2011-10-27},
-  bdsk-url-1    = {http://www.heise.de/tp/druck/mb/artikel/24/24221/1.html},
-  date-added    = {2011-10-27 14:30:30 +0000},
-  date-modified = {2011-10-27 14:32:34 +0000},
-  journal       = {Telepolis},
-  keywords      = {Writing},
-  lastchecked   = {2011-10-27},
-  month         = {12},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2006},
-}
-
-@Online{Wikipedia_HarveyBalls,
-  author        = {{Harvey Balls}},
-  title         = {Harvey Balls -- Wikipedia},
-  url           = {https://de.wikipedia.org/w/index.php?title=Harvey_Balls&oldid=116517396},
-  urldate       = {2018-02-07},
-  date-added    = {2011-10-27 14:30:30 +0000},
-  date-modified = {2011-10-27 14:32:34 +0000},
-  lastchecked   = {2018-02-07},
-  month         = {4},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2013},
-}
-
-@Online{Volere,
-  author        = {{Volere Template}},
-  title         = {Snowcards -- Volere},
-  url           = {http://www.volere.co.uk},
-  urldate       = {2019-01-31},
-  date-added    = {2011-10-27 14:30:30 +0000},
-  date-modified = {2011-10-27 14:32:34 +0000},
-  lastchecked   = {2019-01-31},
-  month         = {1},
-  ranking       = {rank1},
-  relevance     = {relevant},
-  year          = {2018},
-}
-
-@TechReport{Barbacci2003,
-  author          = {Barbacci, Mario R. and Ellison, Robert and Lattanze, Anthony J. and Stafford, Judith A. and Weinstock, Charles B. and Wood, William G.},
-  institution     = {Software Engineering Institue - Carnegie Mellon},
-  title           = {{Quality Attribute Workshops (QAWs), Third Edition}},
-  number          = {August},
-  abstract        = {The Quality Attribute Workshop (QAW) is a facilitated method that engages system stake- holders early in the life cycle to discover the driving quality attributes of a software-intensive system. The QAW was developed to complement the Architecture Tradeoff Analysis Meth- odSM (ATAMSM) and provides a way to identify important quality attributes and clarify system requirements before the software architecture has been created. This is the third edition of a technical report describing the QAW. We have narrowed the scope of a QAW to the creation of prioritized and refined scenarios. This report describes the newly revised QAW and describes potential uses of the refined scenarios generated during it.},
-  address         = {Pttsburgh},
-  booktitle       = {Quality},
-  file            = {::},
-  keywords        = {QAW, Quality Attribute Workshop, attribute requirements, attribute tradeoffs, quality attributes, scenarios},
-  mendeley-groups = {SEI,Architecture},
-  ranking         = {rank1},
-  relevance       = {relevant},
-  year            = {2003},
-}
-
-@Book{Bass2003,
-  author    = {Bass, Len and Clements, Paul and Kazman, Rick},
-  title     = {{Software Architecture in Practice}},
-  edition   = {2nd editio},
-  publisher = {Addison-Wesley},
-  series    = {SEI Series in Software Engineering},
-  keywords  = {Architecture},
-  ranking   = {rank1},
-  relevance = {relevant},
-  year      = {2003},
-}
-
-@TechReport{ISO25010,
-  author      = {{International Organization for Standardization}},
-  institution = {International Organization for Standardization},
-  title       = {{Systems and software engineering -- Systems and software Quality Requirements -- and Evaluation (SQuaRE) -- System and software quality models}},
-  type        = {Standard},
-  address     = {Case postale 56, CH-1211 Geneva 20},
-  key         = {ISO/IEC 25010:2011(E)},
-  month       = mar,
-  ranking     = {rank1},
-  relevance   = {relevant},
-  volume      = {2011},
-  year        = {2011},
-}
-
-@Article{Al_Okaily_2017,
-  author       = {Anas Al-Okaily and Badar Almarri and Sultan Al Yami and Chun-Hsi Huang},
-  date         = {2017-04-01},
-  journaltitle = {Journal of Computational Biology},
-  title        = {Toward a Better Compression for {DNA} Sequences Using Huffman Encoding},
-  doi          = {10.1089/cmb.2016.0151},
-  number       = {4},
-  pages        = {280--288},
-  volume       = {24},
-  publisher    = {Mary Ann Liebert Inc},
-}
-
-@Online{bam,
-  author  = {The SAM/BAM Format Specification Working Group},
-  date    = {2022-08-22},
-  title   = {Sequence Alignment/Map Format Specification},
-  url     = {https://github.com/samtools/hts-specs},
-  urldate = {2022-09-12},
-  version = {44b4167},
-}
-
-@Article{Cock_2009,
-  author       = {Peter J. A. Cock and Christopher J. Fields and Naohisa Goto and Michael L. Heuer and Peter M. Rice},
-  date         = {2009-12},
-  journaltitle = {Nucleic Acids Research},
-  title        = {The Sanger {FASTQ} file format for sequences with quality scores, and the Solexa/Illumina {FASTQ} variants},
-  doi          = {10.1093/nar/gkp1137},
-  number       = {6},
-  pages        = {1767--1771},
-  volume       = {38},
-  publisher    = {Oxford University Press ({OUP})},
-}
-
-@Article{cells,
-  author       = {Eva Bianconi and Allison Piovesan and Federica Facchin and Alina Beraudi and Raffaella Casadei and Flavia Frabetti and Lorenza Vitale and Maria Chiara Pelleri and Simone Tassani and Francesco Piva and Soledad Perez-Amodio and Pierluigi Strippoli and Silvia Canaider},
-  date         = {2013-07},
-  journaltitle = {Annals of Human Biology},
-  title        = {An estimation of the number of cells in the human body},
-  doi          = {10.3109/03014460.2013.807878},
-  number       = {6},
-  pages        = {463--471},
-  volume       = {40},
-  publisher    = {Informa {UK} Limited},
-}
-
-@Article{dna_structure,
-  author       = {J. D. WATSON and F. H. C. CRICK},
-  date         = {1953-04},
-  journaltitle = {Nature},
-  title        = {Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid},
-  doi          = {10.1038/171737a0},
-  number       = {4356},
-  pages        = {737--738},
-  volume       = {171},
-  publisher    = {Springer Scie

+ 1 - 1
latex/tex/thesis.tex

@@ -88,7 +88,7 @@
 % Abgabeform festlegen
 % Abgabeform festlegen
 % Bei einer digitalen Abgabe, wird das Dokument einseitig erzeugt und der Titel wird
 % Bei einer digitalen Abgabe, wird das Dokument einseitig erzeugt und der Titel wird
 % zentriert.
 % zentriert.
-\newcommand{\hsmaabgabe}{papier} % Wie erfolgt die Abgabe: "papier" oder "digital"?
+\newcommand{\hsmaabgabe}{digital} % Wie erfolgt die Abgabe: "papier" oder "digital"?
 
 
 % Preambel mit Einstellungen importieren
 % Preambel mit Einstellungen importieren
 \input{preambel}
 \input{preambel}