%SUMMARY %- ABSTRACT %- INTRODUCTION %# BASICS %- \acs{DNA} STRUCTURE %- DATA TYPES % - BAM/FASTQ % - NON STANDARD %- COMPRESSION APPROACHES % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA} % - HUFFMAN ENCODING % - PROBABILITY APPROACHES (WITH BASE?) % %# COMPARING TOOLS %- %# POSSIBLE IMPROVEMENT %- \acs{DNA}S STOCHASTICAL ATTRIBUTES %- IMPACT ON COMPRESSION %\chapter{Analysis for Possible Compression Improvements} \chapter{Feasibillity Analysis for New Algorithm Considering Stochastic Organisation of Genomes} % first thoughts: % - just save one nuceleotide every n bits % - save checksum for whole genome % - use algorithms (from new discoveries) to recreate genome % - check checksum -> finished : retry % - can run recursively and threaded % - im falle von testdata: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo' The first attempt to determine feasability of this project consists of setting basevalues, a further improvement can be meassured by. For this to be recreateable, a few specifications must be known:\\ CPU Core information `cat /proc/cpuinfo`\\ Output for the last core:\\ processor : 15 vendor\_id : AuthenticAMD cpu family : 23 model : 1 model name : AMD EPYC Processor (with IBPB) stepping : 2 microcode : 0x1000065 cpu MHz : 2400.000 cache size : 512 KB physical id : 15 siblings : 1 core id : 0 cpu cores : 1 apicid : 15 initial apicid : 15 fpu : yes fpu\_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr\_opt pdpe1gb rdtscp lm rep\_good nopl cpuid extd\_apicid tsc\_known\_freq pni pclmulqdq ssse3 fma cx16 sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand hypervisor lahf\_lm cmp\_legacy cr8\_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr\_core ssbd ibpb vmmcall fsgsbase tsc\_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha\_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr virt\_ssbd arat arch\_capabilities bugs : sysret\_ss\_attrs null\_seg spectre\_v1 spectre\_v2 spec\_store\_bypass bogomips : 4800.00 TLB size : 1024 4K pages clflush size : 64 cache\_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management:\\ Memory capacity (and more todo list): `dmidecode --type memory` || `dmidecode --type 17` \section{Pool of Tools} For an initial test, a small pool of three tools was choosen. \begin{itemize} \item Samtools \item GeCo \item genie \end{itemize} Each of this tools comply with the criteria choosen in \autoref{chap:filetypes}.\\ To test each tool, the same set of data were used. The genome of a homo sapien id: GRCh38 were chosen due to its size TODO: find more exact criteria for testdata. The Testdata is available via an open FTP Server, hotsed by ensembl. Source:\url{http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/}\\ Testparameters that were focused on: \begin{itemize} \item Efficiency: \textbf{Duration} the Process had run for \item Effectivity: The difference in \textbf{Size} between input and compressed data \item todo: fehlerquote! \end{itemize} First was captured by TODO choose: - a linux tool to output the exact runtime (time ) - a alteration in the c code that outputs the time at start and end of the process runtime. \section{Installation} \section{Alteration of Code to Determine Runtime} \section{Execution} \section{Data analysis}