| 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091 |
- %SUMMARY
- %- ABSTRACT
- %- INTRODUCTION
- %# BASICS
- %- \acs{DNA} STRUCTURE
- %- DATA TYPES
- % - BAM/FASTQ
- % - NON STANDARD
- %- COMPRESSION APPROACHES
- % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
- % - HUFFMAN ENCODING
- % - PROBABILITY APPROACHES (WITH BASE?)
- %
- %# COMPARING TOOLS
- %-
- %# POSSIBLE IMPROVEMENT
- %- \acs{DNA}S STOCHASTICAL ATTRIBUTES
- %- IMPACT ON COMPRESSION
- %\chapter{Analysis for Possible Compression Improvements}
- \chapter{Feasibillity Analysis for New Algorithm Considering Stochastic Organisation of Genomes}
- % first thoughts:
- % - just save one nuceleotide every n bits
- % - save checksum for whole genome
- % - use algorithms (from new discoveries) to recreate genome
- % - check checksum -> finished : retry
- % - can run recursively and threaded
- % - im falle von testdata: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo'
- The first attempt to determine feasability of this project consists of setting basevalues, a further improvement can be meassured by. For this to be recreateable, a few specifications must be known:\\
- CPU Core information `cat /proc/cpuinfo`\\
- Output for the last core:\\
- processor : 15
- vendor\_id : AuthenticAMD
- cpu family : 23
- model : 1
- model name : AMD EPYC Processor (with IBPB)
- stepping : 2
- microcode : 0x1000065
- cpu MHz : 2400.000
- cache size : 512 KB
- physical id : 15
- siblings : 1
- core id : 0
- cpu cores : 1
- apicid : 15
- initial apicid : 15
- fpu : yes
- fpu\_exception : yes
- cpuid level : 13
- wp : yes
- flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr\_opt pdpe1gb rdtscp lm rep\_good nopl cpuid extd\_apicid tsc\_known\_freq pni pclmulqdq ssse3 fma cx16 sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand hypervisor lahf\_lm cmp\_legacy cr8\_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr\_core ssbd ibpb vmmcall fsgsbase tsc\_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha\_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr virt\_ssbd arat arch\_capabilities
- bugs : sysret\_ss\_attrs null\_seg spectre\_v1 spectre\_v2 spec\_store\_bypass
- bogomips : 4800.00
- TLB size : 1024 4K pages
- clflush size : 64
- cache\_alignment : 64
- address sizes : 48 bits physical, 48 bits virtual
- power management:\\
- Memory capacity (and more todo list):
- `dmidecode --type memory` || `dmidecode --type 17`
- \section{Pool of Tools}
- For an initial test, a small pool of three tools was choosen.
- \begin{itemize}
- \item Samtools
- \item GeCo
- \item genie
- \end{itemize}
- Each of this tools comply with the criteria choosen in \autoref{chap:filetypes}.\\
- To test each tool, the same set of data were used. The genome of a homo sapien id: GRCh38 were chosen due to its size TODO: find more exact criteria for testdata.
- The Testdata is available via an open FTP Server, hotsed by ensembl. Source:\url{http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/}\\
- Testparameters that were focused on:
- \begin{itemize}
- \item Efficiency: \textbf{Duration} the Process had run for
- \item Effectivity: The difference in \textbf{Size} between input and compressed data
- \item todo: fehlerquote!
- \end{itemize}
- First was captured by
- TODO choose:
- - a linux tool to output the exact runtime (time <cmd>)
- - a alteration in the c code that outputs the time at start and end of the process runtime.
- \section{Installation}
- \section{Alteration of Code to Determine Runtime}
- \section{Execution}
- \section{Data analysis}
|