gab
/
BA


			
				
					
						
						
							12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
							%SUMMARY
%- ABSTRACT
%- INTRODUCTION
%# BASICS
%- \acs{DNA} STRUCTURE
%- DATA TYPES
% - BAM/FASTQ
% - NON STANDARD
%- COMPRESSION APPROACHES
% - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
% - HUFFMAN ENCODING
% - PROBABILITY APPROACHES (WITH BASE?)
%
%# COMPARING TOOLS
%- 
%# POSSIBLE IMPROVEMENT
%- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
%- IMPACT ON COMPRESSION

%\chapter{Analysis for Possible Compression Improvements}
\chapter{Feasibillity Analysis for New Algorithm Considering Stochastic Organisation of Genomes}

% first thoughts:
% - just save one nuceleotide every n bits
% - save checksum for whole genome

% - use algorithms (from new discoveries) to recreate genome
% - check checksum -> finished : retry

% - can run recursively and threaded

% - im falle von testdata: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo'

The first attempt to determine feasability of this project consists of setting basevalues, a further improvement can be meassured by. For this to be recreateable, a few specifications must be known:\\
CPU Core information `cat /proc/cpuinfo`\\

Output for the last core:\\

processor	: 15
vendor\_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC Processor (with IBPB)
stepping	: 2
microcode	: 0x1000065
cpu MHz		: 2400.000
cache size	: 512 KB
physical id	: 15
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 15
initial apicid	: 15
fpu		: yes
fpu\_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr\_opt pdpe1gb rdtscp lm rep\_good nopl cpuid extd\_apicid tsc\_known\_freq pni pclmulqdq ssse3 fma cx16 sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand hypervisor lahf\_lm cmp\_legacy cr8\_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr\_core ssbd ibpb vmmcall fsgsbase tsc\_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha\_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr virt\_ssbd arat arch\_capabilities
bugs		: sysret\_ss\_attrs null\_seg spectre\_v1 spectre\_v2 spec\_store\_bypass
bogomips	: 4800.00
TLB size	: 1024 4K pages
clflush size	: 64
cache\_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:\\
Memory capacity (and more todo list):
`dmidecode --type memory` || `dmidecode --type 17`
\section{Pool of Tools}
For an initial test, a small pool of three tools was choosen. 
\begin{itemize}
  \item Samtools
  \item GeCo
  \item genie
\end{itemize}
Each of this tools comply with the criteria choosen in \autoref{chap:filetypes}.\\
To test each tool, the same set of data were used. The genome of a homo sapien id: GRCh38 were chosen due to its size TODO: find more exact criteria for testdata.
The Testdata is available via an open FTP Server, hotsed by ensembl. Source:\url{http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/}\\
Testparameters that were focused on:
\begin{itemize}
  \item Efficiency: \textbf{Duration} the Process had run for
  \item Effectivity: The difference in \textbf{Size} between input and compressed data
  \item todo: fehlerquote!
\end{itemize}
First was captured by 
TODO choose:
- a linux tool to output the exact runtime (time <cmd>)
- a alteration in the c code that outputs the time at start and end of the process runtime.
\section{Installation}
\section{Alteration of Code to Determine Runtime}
\section{Execution}
\section{Data analysis}