Prechádzať zdrojové kódy

raw finish k5. starting with result analysis

u 3 rokov pred
rodič
commit
258394aee8

BIN
latex/result/thesis.pdf


BIN
latex/tex/bilder/k4/information_bits.png


BIN
latex/tex/bilder/k4/shannon_entropy.png


+ 4 - 0
latex/tex/kapitel/abkuerzungen.tex

@@ -11,10 +11,14 @@
   \acro{CRAM}{Compressed Reference-oriented Alignment Map}
   \acro{DNA}{Deoxyribonucleic Acid}
   \acro{EOF}{End of File}
+  \acro{FASTA}{File Format for Storing Genomic Data}
+  \acro{FASTQ}{File Format Based on FASTA}
+  \acro{FTP}{File Transfere Protocol}
   \acro{GA4GH}{Global Alliance for Genomics and Health}
   \acro{IUPAC}{International Union of Pure and Applied Chemistry}
   \acro{LZ77}{Lempel Ziv 1977}
   \acro{LZ78}{Lempel Ziv 1978}
+  \acro{RAM}{Random Access Memory} 
   \acro{SAM}{Sequence Alignment Map}
   \acro{BAM}{Binary Alignment Map}
 \end{acronym}

+ 2 - 1
latex/tex/kapitel/k2_dna_structure.tex

@@ -17,7 +17,8 @@
 %- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
 %- IMPACT ON COMPRESSION
 
-\chapter{Structure of Human Genetic Data}
+\chapter{The Structure of the Human Genome and how its Digital Form is Compressed}
+\section{Structure of Human \ac{DNA}}
 To strengthen the understanding of how and where biological information is stored, this section starts with a quick and general rundown on the structure of any living organism.\\
 % todo add picture
 All living organisms, like plants and animals, are made of cells (a human body can consist out of several trillion cells) \cite{cells}.

+ 6 - 3
latex/tex/kapitel/k3_datatypes.tex

@@ -18,12 +18,13 @@
 %- IMPACT ON COMPRESSION
 
 % todo: use this https://www.reddit.com/r/bioinformatics/comments/7wfdra/eli5_what_are_the_differences_between_fastq_and/
+
 % bigger picture - structure chapters like this:
 % what is it/how does it work
 % where are limits (e.g. BAM)
 % what is our focus (and maybe 'why')
 
-\chapter{File Types Used to Store DNA}
+\section{File Formats used to Store DNA}
 \label{chap:filetypes}
 As described in previous chapters \ac{DNA} can be represented by a string with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store each single symbol.\\
 The \ac{ascii} table is a characterset, registered in 1975 and to this day still in use to encode texts digitally. For the purpose of communication bigger charactersets replaced \ac{ascii}. It is still used in situations where storage is short.
@@ -63,7 +64,7 @@ Since methods to store this kind of Data are still in development, there are man
 are backed by scientific papers.\\
 Considering the first criteria, by searching through anonymously accesable \ac{ftp} servers, only two formats are used commonly: FASTA or its extension \ac{FASTQ} and the \ac{BAM} Format. %todo <- add ftp servers to cite
 
-\section{\ac{FASTQ}}
+\subsection{\ac{FASTQ}}
 % todo add some fasta knowledge
 Is a text base format for storing sequenced data. It saves nucleotides as letters and in extend to that, the quality values are saved.
 \ac{FASTQ} files are split into multiples of four, each four lines contain the informations for one sequence. The exact structure of \ac{FASTQ} format is as follows:
@@ -77,7 +78,7 @@ The quality values have no fixed type, to name a few there is the sanger format,
 The quality value shows the estimated probability of error in the sequencing process.
 [...]
 
-\section{Sequence Alignment Map}
+\subsection{Sequence Alignment Map}
 % src https://github.com/samtools/samtools
 \ac{SAM} often seen in its compressed, binary representation \ac{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 as defined in RFC1345. The structure is more complex than the one in \ac{FASTQ} and described best, accompanied by an example:
 
@@ -87,3 +88,5 @@ The quality value shows the estimated probability of error in the sequencing pro
   \caption{SAM/BAM file structure example}
   \label{k_datatypes:bam-struct}
 \end{figure}
+
+

+ 64 - 14
latex/tex/kapitel/k4_algorithms.tex

@@ -20,51 +20,79 @@
 
 % entropie fim doku grundlagen2 
 % dna nucleotide zu einem kapitel -> structure of dna. auch kapitel wegstreichen (zu generisch)
-% file structure <-> datatypes. länger beschreiben: e.g. File formats to store dna
+% file structure/format <-> datatypes. länger beschreiben: e.g. File formats to store dna
 % 3.2.1 raus
 
-\chapter{Compression aproaches}
-\chapter{Compression aproaches}
+
+\section{Compression aproaches}
 The process of compressing data serves the goal to generate an output that is smaller than its input data. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or network protocols that are used to transmit video/audio streams live.
 For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete. For lossless compression two mayor approaches are known: the dictionary coding and the entropy coding. Both are described in detail below.\\
 
-\section{Dictionary coding}
+\subsection{Dictionary coding}
 Dictionary coding, as the name suggest, uses a dictionary to eliminate redundand occurences of strings. Strings are a chain of characters representing a full word or just a part of it. This is explained shortly for a better understanding of dictionary coding but is of no great relevance to the focus of this work:
 % exkurs
 Looking at the string 'stationary' it might be smart to store 'station' and 'ary' as seperate dictionary enties. Which way is more efficient depents on the text that should get compressed. 
 % end exkurs
 The dictionary should only store strings that occour in the input data. Also storing a dictionary in addition to the (compressed) input data, would be a waste of resources. Therefore the dicitonary is made out of the input data. Each first occourence is left uncompressed. Every occurence of a string, after the first one, points to its first occurence. Since this 'pointer' needs less space than the string it points to, a decrease in the size is created.\\
 
+Unfortunally, known implementations like the ones out of LZ Family, do not use probabilities to compress and are therefore out of scope for this work. Since finding repetations and their location might also be improved, this chapter will remain.
+
 % unuseable due to the lack of probability
 \mycomment{
 % - known algo
-\subsection{The LZ Family}
+\subsubsection{The LZ Family}
 The computer scientist Abraham Lempel and the electrical engineere Jacob Ziv created multiple algorithms that are based on dictionary coding. They can be recognized by the substring \texttt{LZ} in its name, like \texttt{LZ77 and LZ78} which are short for Lempel Ziv 1977 and 1978. The number at the end indictates when the algorithm was published. Today LZ78 is widely used in unix compression solutions like gzip and bz2. Those tools are also used in compressing \ac{DNA}.\\
 \ac{LZ77} basically works, by removing all repetition of a string or substring and replacing them with information where to find the first occurence and how long it is. Typically it is stored in two bytes, whereby more than one one byte can be used to point to the first occurence because usually less than one byte is required to store the length.\\
 % example 
 }
 
-\section{Shannons Entropy}
-The great mathematician, electrical engineer and cryptographer Claude Elwood Shannon developed information entropy and published it in 1948 \autocite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
+\subsection{Shannons Entropy}
+The founder of information theory Claude Elwood Shannon described entropy and published it in 1948 \autocite{Shannon_1948}. In this work he focused on transmitting information. His theorem is applicable to almost any form of communication signal. His findings are not only usefull for forms of information transmition. 
 
 % todo insert Fig. 1 shannon_1948
 
 Altering this figure shows how it can be used for other technology like compression.\\
 The Information source and destination are left unchanged, one has to keep in mind, that it is possible that both are represented by the same phyiscal actor. 
-transmitter and receiver are changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time.
+transmitter and receiver are changed to compression/encoding and decompression/decoding and inbetween ther is no signal but any period of time \autocite{Shannon_1948}.\\
+
+Shannons Entropy provides a formular to determine the 'uncertainty of a probability distribution' in a finite field.
+
+%H(X) \defd \Sum{x\in X, prob(x)\neq0}{}{prob(x) * log_2(frac{1}{prob(x)})} \equiv  - \Sum { x\in X, prob(x)\neq0 } {} {prob(x) * log_2 (prob(x))}.
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=12cm]{k4/shannon_entropy.png}
+  \caption{Shannons definition of entropy.}
+  \label{k4:entropy}
+\end{figure}
 
-Shannons Entropy provides a formular to determine the '(un)certainty of a probability distribution'. This is used today to find the maximum amount of bits needed to store information. 
+He defined entropy as shown in figure \ref{k4:entropy}. Let X be a finite probability space. Then x in X are possible final states of an probability experimen over X. Every state that actually occours, while executing the experiment generates infromation which is meassured in \textit{Bits} with the part of the formular displayed in \ref{k4:info-in-bits}\autocite{delfs_knebl,Shannon_1948}:
+
+%\bein{math}
+% log_2(frac{1}{prob(x)}) \equiv - log_2(prob(x)).
+%\end{math} 
+\begin{figure}[H]
+  \centering
+  \includegraphics[width=8cm]{k4/information_bits.png}
+  \caption{The amount of information measured in bits, in case x is the end state of a probability experiment.}
+  \label{k4:info-in-bits}
+\end{figure}
+
+%todo explain 2.2 second bulletpoint of delfs_knebl. Maybe read gumbl book
+
+%This can be used to find the maximum amount of bits needed to store information.\\ 
 % alphabet, chain of symbols, kurz entropy erklären
 
-\section{Arithmetic coding}
-Arithmetic coding is an approach to solve the problem of waste of memory, due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if there would be a possibility to encode two letters in the alphabet with one bit and the other one with a two byte combination. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted either as the letter it should represent or as two one bit letters.
+\subsection{Arithmetic coding}
+Arithmetic coding is an approach to solve the problem of wasting memeory due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if there would be a possibility to encode two letters in the alphabet with one bit and the other one with a two byte combination. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted either as the letter it should represent or as two one bit letters.
 % check this wording 'simulating' with sources 
 % this is called subdividing
-Arithmetic coding works by simulating a n-letter binary encoding for a n-letter alphabet. This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point number in the space between 0.0 and 1.0 (exclusively), which is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). To encode a sequence of characters, the intervall start of the character is noted, its intervall is split into smaller intervalls with the ratios of the initial intervalls between 0.0 and 1.0. With this, the second character is choosen. This process is repeated for until a intervall for the last character is choosen.\\
+Arithmetic coding works by translating a n-letter alphabet into a n-letter binary encoding. This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point numbers in the space between 0.0 and 1.0 (exclusively). This intervall is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). To encode a sequence of characters subdividing is used.
+% exkurs on subdividing?
+This means the intervall start of the character is noted, its intervall is split into smaller intervalls with the ratios of the initial intervalls between 0.0 and 1.0. With this, the second character is choosen. This process is repeated for until a intervall for the last character is choosen.\\
 To encode in binary, the binary floating point representation of a number inside the intervall, for the last character is calculated, by using a similar process, described above, called subdividing.
 % its finite subdividing because processors bottleneck floatingpoints 
 
-\section{Huffman encoding}
+\subsection{Huffman encoding}
 % list of algos and the tools that use them
 The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
 % binary view for alphabet
@@ -79,7 +107,29 @@ The well known Huffman coding, is used in several Tools for genome compression.
 The process of compressing starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
 Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.
 
-\section{Implementations}
+\subsubsection{misc}
+
+%check if (small) text coding is done with this:
+Arithmetic Asymmetric numeral systems 
+Golomb 
+Huffman 
+Adaptive 
+Canonical 
+Modified 
+Range 
+Shannon 
+Shannon–Fano 
+Shannon–Fano–Elias 
+Tunstall 
+Unary 
+Universal 
+Exp-Golomb 
+Fibonacci 
+Gamma 
+Levenshtein
+
+
+\section{Implementations in Relevant Tools}
 \subsection{} % geco
 \subsection{} % genie
 \subsection{} % samtools 

+ 167 - 59
latex/tex/kapitel/k5_feasability.tex

@@ -17,8 +17,174 @@
 %- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
 %- IMPACT ON COMPRESSION
 
+% Structure:
+% - Focus/Goal  (why and what)
+% - Procedure   (what and how)
+% . Specs and used tools
+
 %\chapter{Analysis for Possible Compression Improvements}
-\chapter{Feasibillity Analysis for New Algorithm Considering Stochastic Organisation of Genomes}
+\chapter{Environment and Procedure to Determine the State of The Art Efficiency and Compressionratio of Relevant Tools}
+% goal define
+Since improvements must be meassured, defining a baseline which would need to be beaten bevorhand is neccesary. Others have dealt with this task several times with common algorithms and tools, and published their results. But since the test case, that need to be build for this work, is rather uncommon in its compilation, the available data are not very usefull. Therefore new testdata must be created.\\
+The goal of this is, to determine a baseline for efficiendy and effectivity of state of the art tools, used to compress \ac{DNA}. This baseline is set by two important factors:
+
+\begin{itemize}
+  \item Efficiency: \textbf{Duration} the Process had run for
+  \item Effectivity: The difference in \textbf{Size} between input and compressed data
+\end{itemize}
+
+As a third point, the compliance that files were compressed losslessly should be verified. This is done by comparing the source file to a copy that got compressed and than decompressed again. If one of the two processes should operate lossy, a difference between the source file and the copy a difference in size should be recognizeable. 
+
+%environment, test setup, raw results
+\section{Sever specifications and test environment}
+To be able to recreate this in the future, relevant specifications and the commands that reveiled this information are listed in this section.\\
+
+Reading from /proc/cpuinfo reveals processor spezifications. Since most of the information displayed in the seven entrys is redundant, only the last entry is shown. Below are relevant specifications listed:
+
+\noindent
+\begin{lstlisting}[language=bash]
+  cat /proc/cpuinfo
+\end{lstlisting}
+\begin{itemize}
+  \item available logical processors: 0 - 7
+  \item vendor: GenuineIntel
+  \item cpu family: 6
+  \item model nr, name: 58, Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz 
+  \item microcode: 0x15
+  \item MHz: 2280.874
+  \item cache size: 8192 KB
+  \item cpu cores: 4
+  \item fpu and fpu exception: yes
+  \item address sizes: 36 bits physical, 48 bits virtual
+\end{itemize}
+
+Full CPU secificaiton can be found in appendix.%todo finish
+
+% explanation on some entry: https://linuxwiki.de/proc/cpuinfo
+%\begin{em}
+%processor	: 7
+%vendor\_id	: GenuineIntel
+%cpu family	: 6
+%model		: 58
+%model name	: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
+%stepping	: 9
+%microcode	: 0x15
+%cpu MHz		: 2280.874
+%cache size	: 8192 KB
+%physical id	: 0
+%siblings	: 8
+%core id		: 3
+%cpu cores	: 4
+%apicid		: 7
+%initial apicid	: 7
+%fpu		: yes
+%fpu\_exception	: yes
+%cpuid level	: 13
+%wp		: yes
+%flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant\_tsc arch\_perfmon pebs bts rep\_good nopl xtopology nonstop\_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds\_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4\_1 sse4\_2 x2apic popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand lahf\_lm cpuid\_fault epb pti tpr\_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
+%vmx flags	: vnmi preemption\_timer invvpid ept\_x\_only flexpriority tsc\_offset vtpr mtf vapic ept vpid unrestricted\_guest
+%bugs		: cpu\_meltdown spectre\_v1 spectre\_v2 spec\_store\_bypass l1tf mds swapgs itlb\_multihit srbds mmio\_unknown
+%bogomips	: 6784.88
+%clflush size	: 64
+%cache\_alignment	: 64
+%address sizes	: 36 bits physical, 48 bits virtual
+%power management:
+%\end{em}
+
+The installed \ac{RAM} was offering a total of 16GB with four 4GB instances. 
+For this paper relevant specifications are listed below:
+\noindent Command used to list 
+\begin{lstlisting}[language=bash]
+   dmidecode --type 17
+\end{lstlisting}
+
+\begin{itemize}
+  \item{Total/Data Width: 64 bits}
+  \item{Size: 4GB}
+  \item{Type: DDR3}
+  \item{Type Detail: Synchronous}
+  \item{Speed/Configured Memory Speed: 1600 Megatransfers/s}
+\end{itemize}
+
+%dmidecode --type 17
+% ...
+%Handle 0x0062, DMI type 17, 34 bytes
+%Memory Device
+%	Array Handle: 0x0056
+%	Error Information Handle: Not Provided
+%	Total Width: 64 bits
+%	Data Width: 64 bits
+%	Size: 4 GB
+%	Form Factor: DIMM
+%	Set: None
+%	Locator: DIMM B2
+%	Bank Locator: BANK 3
+%	Type: DDR3
+%	Type Detail: Synchronous
+%	Speed: 1600 MT/s
+%	Manufacturer: Samsung
+%	Serial Number: 148A8133
+%	Asset Tag: 9876543210
+%	Part Number: M378B5273CH0-CK0  
+%	Rank: 2
+%	Configured Memory Speed: 1600 MT/s
+%
+
+\section{Operating System and Additionally Installed Packages}
+To leave the testing environment in a consistent state, not project specific processes running in the background, should be avoided. 
+Due to following circumstances, a current Linux distribution was choosen as a suiteable operating system:
+\begin{itemize}
+  \item{factors that interfere with a consistent efficiency value should be avoided}
+  \item{packages, support and user experience should be present to an reasonable ammount}
+\end{itemize}
+Some backround processes will run while the compression analysis is done. This is owed to the demand of an increasingly complex operating system to execute complex programs. Considering that different tools will be exeuted in this environment, minimizing the backround processes would require building a custom operating system or configuring an existing one to fit this specific use case. The boundary set by the time limitation for this work rejects named alternatives. 
+%By comparing the values of explaied factors, a sweet spot can be determined:
+% todo: add preinstalled package/programm count and other specs
+Choosing \textbf{Debian GNU/Linux} version \textbf{11} features enough packages to run every tool without spending to much time on the setup.\\
+The graphical user interface and most other optional packages were ommited. The only additional package added in the installation process is the ssh server package. Further a list of packages required by the compression tools were installed. At last, some additional packages were installed for the purpose of simplifying work processes and increasing the safety of the environment.
+\begin{itemize}
+  \item{installation process: ssh-server}
+  \item{tool requirements:, git, libhts-dev, autoconf, automake, cmake, make, gcc, perl, zlib1g-dev, libbz2-dev, liblzma-dev, libcurl4-gnutls-dev, libssl-dev, libncurses5-dev, libomp-dev}
+  \item{additional packages: ufw, rsync, screen, sudo} 
+\end{itemize}
+
+A complete list of installed packages as well as individual versions can be found in appendix. % todo appendix
+
+%user@debian raw$\ cat /etc/os-release 
+%PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
+%NAME="Debian GNU/Linux"
+%VERSION_ID="11"
+%VERSION="11 (bullseye)"
+%VERSION_CODENAME=bullseye
+%ID=debian
+%HOME_URL="https://www.debian.org/"
+%SUPPORT_URL="https://www.debian.org/support"
+%BUG_REPORT_URL="https://bugs.debian.org/"
+
+\section{Selection, Receivement, and Preperation of Testdata}
+Following criteria is requiered for testdata to be appropriate:
+\begin{itemize}
+  \item{The testfile is in a format that all or at least most of the tools can work with.}
+  \item{The file is publicly available and free to use.}
+\end{itemize}
+Since there are multiple open \ac{FTP} servers which distribute a varety of files, finding a suiteable one is rather easy. The ensembl databse featured defined criteria, so the first suiteable were choosen: Homo\_sapiens.GRCh38.dna.chromosome. This sample includes over 20 chromosomes, whereby considering the filenames, one chromosome was contained in a single file. After retrieving and unpacking the files, write priviledges on them was withdrawn. So no tool could alter any file contents.\\
+\noindent Following tools and parameters where used in this process:
+\begin{lstlisting}[language=bash]
+  \$ wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{2,3,4,5,6,7,8,9,10}.fa.gz
+  \$ gzip -d ./*
+  \$ chmod -w ./*
+\end{lstlisting}
+
+The choosen tools are able to handle the \ac{FASTA} format. However some, like samtools, require to convert \ac{FASTA} into another format like \ac{SAM}.\\ Simply comparing the size is not sufficient, therefore both files are temporarly stripped from metadata and formating, so the raw data of both files can be compared.
+
+% remove metadata: grep -E 'A|C|G|N' <sourcefile> > <destfile>
+% remove newlines: tr -d '\n' 
+
+% convert just once. test for losslessness?
+% get testdata: wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{2,3,4,5,6,7,8,9,10}.fa.gz
+% unzip it: gzip -d ./*
+% withdraw write priv: chmod -w ./*
+
 
 % first thoughts:
 % - just save one nuceleotide every n bits
@@ -31,61 +197,3 @@
 
 % - im falle von testdata: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo'
 
-The first attempt to determine feasability of this project consists of setting basevalues, a further improvement can be meassured by. For this to be recreateable, a few specifications must be known:\\
-CPU Core information `cat /proc/cpuinfo`\\
-
-Output for the last core:\\
-
-processor	: 15
-vendor\_id	: AuthenticAMD
-cpu family	: 23
-model		: 1
-model name	: AMD EPYC Processor (with IBPB)
-stepping	: 2
-microcode	: 0x1000065
-cpu MHz		: 2400.000
-cache size	: 512 KB
-physical id	: 15
-siblings	: 1
-core id		: 0
-cpu cores	: 1
-apicid		: 15
-initial apicid	: 15
-fpu		: yes
-fpu\_exception	: yes
-cpuid level	: 13
-wp		: yes
-flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr\_opt pdpe1gb rdtscp lm rep\_good nopl cpuid extd\_apicid tsc\_known\_freq pni pclmulqdq ssse3 fma cx16 sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand hypervisor lahf\_lm cmp\_legacy cr8\_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr\_core ssbd ibpb vmmcall fsgsbase tsc\_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha\_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr virt\_ssbd arat arch\_capabilities
-bugs		: sysret\_ss\_attrs null\_seg spectre\_v1 spectre\_v2 spec\_store\_bypass
-bogomips	: 4800.00
-TLB size	: 1024 4K pages
-clflush size	: 64
-cache\_alignment	: 64
-address sizes	: 48 bits physical, 48 bits virtual
-power management:\\
-Memory capacity (and more todo list):
-`dmidecode --type memory` || `dmidecode --type 17`
-\section{Pool of Tools}
-For an initial test, a small pool of three tools was choosen. 
-\begin{itemize}
-  \item Samtools
-  \item GeCo
-  \item genie
-\end{itemize}
-Each of this tools comply with the criteria choosen in \autoref{chap:filetypes}.\\
-To test each tool, the same set of data were used. The genome of a homo sapien id: GRCh38 were chosen due to its size TODO: find more exact criteria for testdata.
-The Testdata is available via an open FTP Server, hotsed by ensembl. Source:\url{http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/}\\
-Testparameters that were focused on:
-\begin{itemize}
-  \item Efficiency: \textbf{Duration} the Process had run for
-  \item Effectivity: The difference in \textbf{Size} between input and compressed data
-  \item todo: fehlerquote!
-\end{itemize}
-First was captured by 
-TODO choose:
-- a linux tool to output the exact runtime (time <cmd>)
-- a alteration in the c code that outputs the time at start and end of the process runtime.
-\section{Installation}
-\section{Alteration of Code to Determine Runtime}
-\section{Execution}
-\section{Data analysis}

+ 8 - 0
latex/tex/kapitel/k6_results.tex

@@ -0,0 +1,8 @@
+% raw data and charts
+% differences in used algos/ algos in tools
+% optimization approach
+% further research focus
+% (how optimization would be recognizable in testdata)
+
+\chapter{Results and Discussion}
+

+ 30 - 0
latex/tex/literatur.bib

@@ -123,4 +123,34 @@
   url    = {https://samtools.github.io/hts-specs/BEDv1.pdf},
 }
 
+@InProceedings{compr-visual,
+  author    = {Sami Khuri and Hsiu-Chin Hsu},
+  booktitle = {Proceedings of the 2000 {ACM} symposium on Applied computing - {SAC} {\textquotesingle}00},
+  date      = {2000},
+  title     = {Tools for visualizing text compression algorithms},
+  doi       = {10.1145/335603.335716},
+  publisher = {{ACM} Press},
+}
+
+@Article{lcqs,
+  author       = {Jiabing Fu and Bixin Ke and Shoubin Dong},
+  date         = {2020-03},
+  journaltitle = {{BMC} Bioinformatics},
+  title        = {{LCQS}: an efficient lossless compression tool of quality scores with random access functionality},
+  doi          = {10.1186/s12859-020-3428-7},
+  number       = {1},
+  volume       = {21},
+  publisher    = {Springer Science and Business Media {LLC}},
+}
+
+@Book{delfs_knebl,
+  author    = {Delfs, Hans and Knebl, Helmut},
+  date      = {2007},
+  title     = {Introduction to Cryptography},
+  isbn      = {9783540492436},
+  pages     = {368},
+  publisher = {Springer},
+  subtitle  = {Principles and Applications (Information Security and Cryptography)},
+}
+
 @Comment{jabref-meta: databaseType:biblatex;}

+ 1 - 0
latex/tex/thesis.tex

@@ -139,6 +139,7 @@
 \input{kapitel/k3_datatypes} 
 \input{kapitel/k4_algorithms}
 \input{kapitel/k5_feasability} 
+\input{kapitel/k6_results} 
 % ------------------------------------------------------------------
 
 \label{lastpage}

+ 147 - 0
results/geco/compression.txt

@@ -0,0 +1,147 @@
+compressing file 1
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 230481012] ...
+Done!                          
+Total bytes: 46364770 (44.2 MB), 1.609 bpb, 1.609 bps w/ no header, Normalized Dissimilarity Rate: 0.804661
+Spent 235.005 sec.
+compressing file 2
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 240548228] ...
+Done!                          
+Total bytes: 49938168 (47.6 MB), 1.661 bpb, 1.661 bps w/ no header, Normalized Dissimilarity Rate: 0.830406
+Spent 246.503 sec.
+compressing file 3
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 198100135] ...
+Done!                          
+Total bytes: 41117340 (39.2 MB), 1.66 bpb, 1.66 bps w/ no header, Normalized Dissimilarity Rate: 0.830233
+Spent 201.69 sec.
+compressing file 4
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 189752667] ...
+Done!                          
+Total bytes: 39248276 (37.4 MB), 1.655 bpb, 1.655 bps w/ no header, Normalized Dissimilarity Rate: 0.827357
+Spent 194.081 sec.
+compressing file 5
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 181265378] ...
+Done!                          
+Total bytes: 37133480 (35.4 MB), 1.639 bpb, 1.639 bps w/ no header, Normalized Dissimilarity Rate: 0.819428
+Spent 183.878 sec.
+compressing file 6
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 170078522] ...
+Done!                          
+Total bytes: 35355184 (33.7 MB), 1.663 bpb, 1.663 bps w/ no header, Normalized Dissimilarity Rate: 0.831503
+Spent 173.646 sec.
+compressing file 7
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 158970131] ...
+Done!                          
+Total bytes: 31813760 (30.3 MB), 1.601 bpb, 1.601 bps w/ no header, Normalized Dissimilarity Rate: 0.800497
+Spent 159.999 sec.
+compressing file 8
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 144768136] ...
+Done!                          
+Total bytes: 30104816 (28.7 MB), 1.664 bpb, 1.664 bps w/ no header, Normalized Dissimilarity Rate: 0.831808
+Spent 148.288 sec.
+compressing file 9
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 121790550] ...
+Done!                          
+Total bytes: 23932541 (22.8 MB), 1.572 bpb, 1.572 bps w/ no header, Normalized Dissimilarity Rate: 0.786023
+Spent 123.04 sec.
+compressing file 10
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 133262962] ...
+Done!                          
+Total bytes: 27411806 (26.1 MB), 1.646 bpb, 1.646 bps w/ no header, Normalized Dissimilarity Rate: 0.822788
+Spent 134.937 sec.
+compressing file 11
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 134533742] ...
+Done!                          
+Total bytes: 27408185 (26.1 MB), 1.63 bpb, 1.63 bps w/ no header, Normalized Dissimilarity Rate: 0.814909
+Spent 136.299 sec.
+compressing file 12
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 133137816] ...
+Done!                          
+Total bytes: 27231126 (26.0 MB), 1.636 bpb, 1.636 bps w/ no header, Normalized Dissimilarity Rate: 0.818133
+Spent 134.932 sec.
+compressing file 13
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 97983125] ...
+Done!                          
+Total bytes: 20696778 (19.7 MB), 1.69 bpb, 1.69 bps w/ no header, Normalized Dissimilarity Rate: 0.844912
+Spent 99.9022 sec.
+compressing file 14
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 90568149] ...
+Done!                          
+Total bytes: 18676723 (17.8 MB), 1.65 bpb, 1.65 bps w/ no header, Normalized Dissimilarity Rate: 0.824869
+Spent 92.4753 sec.
+compressing file 15
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 84641325] ...
+Done!                          
+Total bytes: 16804782 (16.0 MB), 1.588 bpb, 1.588 bps w/ no header, Normalized Dissimilarity Rate: 0.794164
+Spent 85.2555 sec.
+compressing file 16
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 81805943] ...
+Done!                          
+Total bytes: 16005173 (15.3 MB), 1.565 bpb, 1.565 bps w/ no header, Normalized Dissimilarity Rate: 0.782592
+Spent 82.7651 sec.
+compressing file 17
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 82920204] ...
+Done!                          
+Total bytes: 15877526 (15.1 MB), 1.532 bpb, 1.532 bps w/ no header, Normalized Dissimilarity Rate: 0.765918
+Spent 82.0814 sec.
+compressing file 18
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 80089605] ...
+Done!                          
+Total bytes: 16344067 (15.6 MB), 1.633 bpb, 1.633 bps w/ no header, Normalized Dissimilarity Rate: 0.816289
+Spent 79.8429 sec.
+compressing file 19
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 58440758] ...
+Done!                          
+Total bytes: 10488207 (10.0 MB), 1.436 bpb, 1.436 bps w/ no header, Normalized Dissimilarity Rate: 0.717869
+Spent 58.6058 sec.
+compressing file 20
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 63944257] ...
+Done!                          
+Total bytes: 13074402 (12.5 MB), 1.636 bpb, 1.636 bps w/ no header, Normalized Dissimilarity Rate: 0.817862
+Spent 64.5884 sec.
+compressing file 21
+Analyzing data and creating models ...
+Done!
+Compressing target sequence 1 [bases: 40088619] ...
+Done!                          
+Total bytes: 7900773 (7.5 MB), 1.577 bpb, 1.577 bps w/ no header, Normalized Dissimilarity Rate: 0.788331
+Spent 41.1984 sec.

+ 23 - 0
results/geco/size.txt

@@ -0,0 +1,23 @@
+total 540004
+-rw-r--r-- 1 user user 46364770 Nov  2 20:59 Homo_sapiens.GRCh38.dna.chromosome.1.fa.co
+-rw-r--r-- 1 user user 27411806 Nov  2 21:25 Homo_sapiens.GRCh38.dna.chromosome.10.fa.co
+-rw-r--r-- 1 user user 27408185 Nov  2 21:27 Homo_sapiens.GRCh38.dna.chromosome.11.fa.co
+-rw-r--r-- 1 user user 27231126 Nov  2 21:29 Homo_sapiens.GRCh38.dna.chromosome.12.fa.co
+-rw-r--r-- 1 user user 20696778 Nov  2 21:31 Homo_sapiens.GRCh38.dna.chromosome.13.fa.co
+-rw-r--r-- 1 user user 18676723 Nov  2 21:32 Homo_sapiens.GRCh38.dna.chromosome.14.fa.co
+-rw-r--r-- 1 user user 16804782 Nov  2 21:34 Homo_sapiens.GRCh38.dna.chromosome.15.fa.co
+-rw-r--r-- 1 user user 16005173 Nov  2 21:35 Homo_sapiens.GRCh38.dna.chromosome.16.fa.co
+-rw-r--r-- 1 user user 15877526 Nov  2 21:37 Homo_sapiens.GRCh38.dna.chromosome.17.fa.co
+-rw-r--r-- 1 user user 16344067 Nov  2 21:38 Homo_sapiens.GRCh38.dna.chromosome.18.fa.co
+-rw-r--r-- 1 user user 10488207 Nov  2 21:39 Homo_sapiens.GRCh38.dna.chromosome.19.fa.co
+-rw-r--r-- 1 user user 49938168 Nov  2 21:03 Homo_sapiens.GRCh38.dna.chromosome.2.fa.co
+-rw-r--r-- 1 user user 13074402 Nov  2 21:40 Homo_sapiens.GRCh38.dna.chromosome.20.fa.co
+-rw-r--r-- 1 user user  7900773 Nov  2 21:41 Homo_sapiens.GRCh38.dna.chromosome.21.fa.co
+-rw-r--r-- 1 user user 41117340 Nov  2 21:06 Homo_sapiens.GRCh38.dna.chromosome.3.fa.co
+-rw-r--r-- 1 user user 39248276 Nov  2 21:09 Homo_sapiens.GRCh38.dna.chromosome.4.fa.co
+-rw-r--r-- 1 user user 37133480 Nov  2 21:12 Homo_sapiens.GRCh38.dna.chromosome.5.fa.co
+-rw-r--r-- 1 user user 35355184 Nov  2 21:15 Homo_sapiens.GRCh38.dna.chromosome.6.fa.co
+-rw-r--r-- 1 user user 31813760 Nov  2 21:18 Homo_sapiens.GRCh38.dna.chromosome.7.fa.co
+-rw-r--r-- 1 user user 30104816 Nov  2 21:20 Homo_sapiens.GRCh38.dna.chromosome.8.fa.co
+-rw-r--r-- 1 user user 23932541 Nov  2 21:22 Homo_sapiens.GRCh38.dna.chromosome.9.fa.co
+

+ 105 - 0
results/samtools/compression.txt

@@ -0,0 +1,105 @@
+compressing file 1
+
+real	0m2.767s
+user	0m2.492s
+sys	0m0.160s
+compressing file 2
+
+real	0m2.808s
+user	0m2.550s
+sys	0m0.136s
+compressing file 3
+
+real	0m2.298s
+user	0m2.099s
+sys	0m0.100s
+compressing file 4
+
+real	0m2.235s
+user	0m2.008s
+sys	0m0.133s
+compressing file 5
+
+real	0m2.114s
+user	0m1.891s
+sys	0m0.132s
+compressing file 6
+
+real	0m1.981s
+user	0m1.790s
+sys	0m0.104s
+compressing file 7
+
+real	0m1.853s
+user	0m1.669s
+sys	0m0.104s
+compressing file 8
+
+real	0m1.694s
+user	0m1.539s
+sys	0m0.080s
+compressing file 9
+
+real	0m1.506s
+user	0m1.343s
+sys	0m0.100s
+compressing file 10
+
+real	0m1.564s
+user	0m1.412s
+sys	0m0.084s
+compressing file 11
+
+real	0m1.566s
+user	0m1.397s
+sys	0m0.101s
+compressing file 12
+
+real	0m1.552s
+user	0m1.405s
+sys	0m0.080s
+compressing file 13
+
+real	0m1.229s
+user	0m1.110s
+sys	0m0.068s
+compressing file 14
+
+real	0m1.147s
+user	0m1.047s
+sys	0m0.052s
+compressing file 15
+
+real	0m1.073s
+user	0m0.957s
+sys	0m0.072s
+compressing file 16
+
+real	0m0.999s
+user	0m0.888s
+sys	0m0.068s
+compressing file 17
+
+real	0m0.973s
+user	0m0.874s
+sys	0m0.056s
+compressing file 18
+
+real	0m0.934s
+user	0m0.828s
+sys	0m0.064s
+compressing file 19
+
+real	0m0.702s
+user	0m0.627s
+sys	0m0.044s
+compressing file 20
+
+real	0m0.754s
+user	0m0.688s
+sys	0m0.033s
+compressing file 21
+
+real	0m0.517s
+user	0m0.454s
+sys	0m0.040s

+ 105 - 0
results/samtools/conversion.txt

@@ -0,0 +1,105 @@
+converting Homo_sapiens.GRCh38.dna.chromosome.1.fa
+
+real	0m1.019s
+user	0m0.284s
+sys	0m0.314s
+converting Homo_sapiens.GRCh38.dna.chromosome.2.fa
+
+real	0m0.976s
+user	0m0.311s
+sys	0m0.256s
+converting Homo_sapiens.GRCh38.dna.chromosome.3.fa
+
+real	0m0.825s
+user	0m0.220s
+sys	0m0.267s
+converting Homo_sapiens.GRCh38.dna.chromosome.4.fa
+
+real	0m0.776s
+user	0m0.235s
+sys	0m0.227s
+converting Homo_sapiens.GRCh38.dna.chromosome.5.fa
+
+real	0m0.748s
+user	0m0.214s
+sys	0m0.230s
+converting Homo_sapiens.GRCh38.dna.chromosome.6.fa
+
+real	0m0.704s
+user	0m0.204s
+sys	0m0.215s
+converting Homo_sapiens.GRCh38.dna.chromosome.7.fa
+
+real	0m0.650s
+user	0m0.203s
+sys	0m0.191s
+converting Homo_sapiens.GRCh38.dna.chromosome.8.fa
+
+real	0m0.592s
+user	0m0.178s
+sys	0m0.182s
+converting Homo_sapiens.GRCh38.dna.chromosome.9.fa
+
+real	0m0.572s
+user	0m0.168s
+sys	0m0.172s
+converting Homo_sapiens.GRCh38.dna.chromosome.10.fa
+
+real	0m0.563s
+user	0m0.141s
+sys	0m0.194s
+converting Homo_sapiens.GRCh38.dna.chromosome.11.fa
+
+real	0m0.566s
+user	0m0.141s
+sys	0m0.192s
+converting Homo_sapiens.GRCh38.dna.chromosome.12.fa
+
+real	0m0.563s
+user	0m0.154s
+sys	0m0.173s
+converting Homo_sapiens.GRCh38.dna.chromosome.13.fa
+
+real	0m0.466s
+user	0m0.128s
+sys	0m0.155s
+converting Homo_sapiens.GRCh38.dna.chromosome.14.fa
+
+real	0m0.445s
+user	0m0.109s
+sys	0m0.161s
+converting Homo_sapiens.GRCh38.dna.chromosome.15.fa
+
+real	0m0.434s
+user	0m0.139s
+sys	0m0.115s
+converting Homo_sapiens.GRCh38.dna.chromosome.16.fa
+
+real	0m0.391s
+user	0m0.101s
+sys	0m0.128s
+converting Homo_sapiens.GRCh38.dna.chromosome.17.fa
+
+real	0m0.333s
+user	0m0.101s
+sys	0m0.110s
+converting Homo_sapiens.GRCh38.dna.chromosome.18.fa
+
+real	0m0.343s
+user	0m0.097s
+sys	0m0.109s
+converting Homo_sapiens.GRCh38.dna.chromosome.19.fa
+
+real	0m0.258s
+user	0m0.074s
+sys	0m0.079s
+converting Homo_sapiens.GRCh38.dna.chromosome.20.fa
+
+real	0m0.272s
+user	0m0.082s
+sys	0m0.085s
+converting Homo_sapiens.GRCh38.dna.chromosome.21.fa
+
+real	0m0.204s
+user	0m0.046s
+sys	0m0.079s

+ 105 - 0
results/samtools/cram_compression.txt

@@ -0,0 +1,105 @@
+sing file 1
+
+real    0m13.140s
+user    0m12.825s
+sys     0m0.209s
+compressing file 2
+
+real    0m13.259s
+user    0m12.952s
+sys     0m0.196s
+compressing file 3
+
+real    0m10.876s
+user    0m10.664s
+sys     0m0.120s
+compressing file 4
+
+real    0m10.434s
+user    0m10.206s
+sys     0m0.140s
+compressing file 5
+
+real    0m9.940s
+user    0m9.722s
+sys     0m0.136s
+compressing file 6
+
+real    0m9.330s
+user    0m9.110s
+sys     0m0.140s
+compressing file 7
+
+real    0m8.695s
+user    0m8.488s
+sys     0m0.132s
+compressing file 8
+
+real    0m7.958s
+user    0m7.758s
+sys     0m0.132s
+compressing file 9
+
+real    0m7.132s
+user    0m6.962s
+sys     0m0.112s
+compressing file 10
+
+real    0m7.334s
+user    0m7.146s
+sys     0m0.124s
+compressing file 11
+
+real    0m7.376s
+user    0m7.163s
+sys     0m0.148s
+compressing file 12
+
+real    0m7.341s
+user    0m7.161s
+sys     0m0.116s
+compressing file 13
+
+real    0m5.838s
+user    0m5.701s
+sys     0m0.088s
+compressing file 14
+
+real    0m5.419s
+user    0m5.289s
+sys     0m0.084s
+compressing file 15
+
+real    0m5.091s
+user    0m4.986s
+sys     0m0.064s
+compressing file 16
+
+real    0m4.699s
+user    0m4.569s
+sys     0m0.088s
+compressing file 17
+
+real    0m4.485s
+user    0m4.376s
+sys     0m0.068s
+compressing file 18
+
+real    0m4.326s
+user    0m4.217s
+sys     0m0.068s
+compressing file 19
+
+real    0m3.146s
+user    0m3.051s
+sys     0m0.064s
+compressing file 20
+
+real    0m3.481s
+user    0m3.395s
+sys     0m0.052s
+compressing file 21
+
+real    0m2.375s
+user    0m2.307s
+sys     0m0.044s

+ 22 - 0
results/samtools/size_bam.txt

@@ -0,0 +1,22 @@
+total 710684
+-rw-r--r-- 1 user user 62048289 Nov  2 14:50 Homo_sapiens.GRCh38.dna.chromosome.1.bam
+-rw-r--r-- 1 user user 35855955 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.10.bam
+-rw-r--r-- 1 user user 35894133 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.11.bam
+-rw-r--r-- 1 user user 35580843 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.12.bam
+-rw-r--r-- 1 user user 26467775 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.13.bam
+-rw-r--r-- 1 user user 24284901 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.14.bam
+-rw-r--r-- 1 user user 22486646 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.15.bam
+-rw-r--r-- 1 user user 21568790 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.16.bam
+-rw-r--r-- 1 user user 21294270 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.17.bam
+-rw-r--r-- 1 user user 20684650 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.18.bam
+-rw-r--r-- 1 user user 14616042 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.19.bam
+-rw-r--r-- 1 user user 65391181 Nov  2 14:50 Homo_sapiens.GRCh38.dna.chromosome.2.bam
+-rw-r--r-- 1 user user 16769658 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.20.bam
+-rw-r--r-- 1 user user 10477999 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.21.bam
+-rw-r--r-- 1 user user 53586949 Nov  2 14:50 Homo_sapiens.GRCh38.dna.chromosome.3.bam
+-rw-r--r-- 1 user user 51457814 Nov  2 14:50 Homo_sapiens.GRCh38.dna.chromosome.4.bam
+-rw-r--r-- 1 user user 48838053 Nov  2 14:50 Homo_sapiens.GRCh38.dna.chromosome.5.bam
+-rw-r--r-- 1 user user 46216304 Nov  2 14:50 Homo_sapiens.GRCh38.dna.chromosome.6.bam
+-rw-r--r-- 1 user user 42371043 Nov  2 14:50 Homo_sapiens.GRCh38.dna.chromosome.7.bam
+-rw-r--r-- 1 user user 39107538 Nov  2 14:50 Homo_sapiens.GRCh38.dna.chromosome.8.bam
+-rw-r--r-- 1 user user 32708272 Nov  2 14:51 Homo_sapiens.GRCh38.dna.chromosome.9.bam

+ 22 - 0
results/samtools/size_cram.txt

@@ -0,0 +1,22 @@
+total 640916
+-rw-r--r-- 1 user user 55769827 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.1.cram
+-rw-r--r-- 1 user user 32238052 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.10.cram
+-rw-r--r-- 1 user user 32529673 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.11.cram
+-rw-r--r-- 1 user user 32166751 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.12.cram
+-rw-r--r-- 1 user user 23568321 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.13.cram
+-rw-r--r-- 1 user user 21887811 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.14.cram
+-rw-r--r-- 1 user user 20493276 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.15.cram
+-rw-r--r-- 1 user user 19895937 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.16.cram
+-rw-r--r-- 1 user user 20177456 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.17.cram
+-rw-r--r-- 1 user user 19310998 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.18.cram
+-rw-r--r-- 1 user user 14251243 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.19.cram
+-rw-r--r-- 1 user user 58026123 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.2.cram
+-rw-r--r-- 1 user user 15510100 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.20.cram
+-rw-r--r-- 1 user user  9708258 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.21.cram
+-rw-r--r-- 1 user user 47707954 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.3.cram
+-rw-r--r-- 1 user user 45564837 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.4.cram
+-rw-r--r-- 1 user user 43655371 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.5.cram
+-rw-r--r-- 1 user user 40980906 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.6.cram
+-rw-r--r-- 1 user user 38417108 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.7.cram
+-rw-r--r-- 1 user user 34926945 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.8.cram
+-rw-r--r-- 1 user user 29459829 Nov  3 13:05 Homo_sapiens.GRCh38.dna.chromosome.9.cram

+ 22 - 0
results/samtools/size_sam.txt

@@ -0,0 +1,22 @@
+total 2758040
+-rw-r--r-- 1 user user 248956528 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.1.sam
+-rw-r--r-- 1 user user 133797529 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.10.sam
+-rw-r--r-- 1 user user 135086729 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.11.sam
+-rw-r--r-- 1 user user 133275416 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.12.sam
+-rw-r--r-- 1 user user 114364435 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.13.sam
+-rw-r--r-- 1 user user 107043825 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.14.sam
+-rw-r--r-- 1 user user 101991296 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.15.sam
+-rw-r--r-- 1 user user  90338452 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.16.sam
+-rw-r--r-- 1 user user  83257548 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.17.sam
+-rw-r--r-- 1 user user  80373392 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.18.sam
+-rw-r--r-- 1 user user  58617723 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.19.sam
+-rw-r--r-- 1 user user 242193635 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.2.sam
+-rw-r--r-- 1 user user  64444274 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.20.sam
+-rw-r--r-- 1 user user  46710090 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.21.sam
+-rw-r--r-- 1 user user 198295665 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.3.sam
+-rw-r--r-- 1 user user 190214661 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.4.sam
+-rw-r--r-- 1 user user 181538365 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.5.sam
+-rw-r--r-- 1 user user 170806085 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.6.sam
+-rw-r--r-- 1 user user 159346079 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.7.sam
+-rw-r--r-- 1 user user 145138742 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.8.sam
+-rw-r--r-- 1 user user 138394823 Nov  2 14:32 Homo_sapiens.GRCh38.dna.chromosome.9.sam

+ 23 - 0
results/src_size.txt

@@ -0,0 +1,23 @@
+total 2854464
+-r--r--r-- 1 user user 253105752 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.1.fa
+-r--r--r-- 1 user user 136027438 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.10.fa
+-r--r--r-- 1 user user 137338124 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.11.fa
+-r--r--r-- 1 user user 135496623 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.12.fa
+-r--r--r-- 1 user user 116270459 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.13.fa
+-r--r--r-- 1 user user 108827838 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.14.fa
+-r--r--r-- 1 user user 103691101 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.15.fa
+-r--r--r-- 1 user user  91844042 Jun  4 10:51 Homo_sapiens.GRCh38.dna.chromosome.16.fa
+-r--r--r-- 1 user user  84645123 Jun  4 10:51 Homo_sapiens.GRCh38.dna.chromosome.17.fa
+-r--r--r-- 1 user user  81712897 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.18.fa
+-r--r--r-- 1 user user  59594634 Jun  4 10:52 Homo_sapiens.GRCh38.dna.chromosome.19.fa
+-r--r--r-- 1 user user 246230144 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.2.fa
+-r--r--r-- 1 user user  65518294 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.20.fa
+-r--r--r-- 1 user user  47488540 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.21.fa
+-r--r--r-- 1 user user  51665500 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.22.fa
+-r--r--r-- 1 user user 201600541 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.3.fa
+-r--r--r-- 1 user user 193384854 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.4.fa
+-r--r--r-- 1 user user 184563953 Jun  4 10:51 Homo_sapiens.GRCh38.dna.chromosome.5.fa
+-r--r--r-- 1 user user 173652802 Jun  4 10:51 Homo_sapiens.GRCh38.dna.chromosome.6.fa
+-r--r--r-- 1 user user 162001796 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.7.fa
+-r--r--r-- 1 user user 147557670 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.8.fa
+-r--r--r-- 1 user user 140701352 Jun  4 10:49 Homo_sapiens.GRCh38.dna.chromosome.9.fa