more writing, added benchmark to compare log files

This commit is contained in:
2017-11-30 18:48:23 +01:00
parent c1237406f8
commit 673464137e
8 changed files with 68 additions and 10 deletions

6
Thesis/.gitignore vendored
View File

@@ -8,6 +8,12 @@
*.xdg
*.xdy
*.glsdefs
*.fls
*.glo
*.idx
*.ind
*.lof
*.lot
# vscode

View File

@@ -1,14 +1,7 @@
\chapter{Introduction}
\label{cha:Introduction}
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
\lstinputlisting[language={java}, label=lst:sendImpliciteIntent,caption=Intent - Bild anzeigen]{res/src/sendImpliciteIntent.java}
Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi.
Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.
The domain name system (\gls{dns}) has been one of the corner stones of the internet for a long time. It acts as a hierarchical, bidirectional translation device between mnemonic domain names and network addresses. It also provides service lookup or enrichment capabilities for a range of application protocols like HTTP, SMTP, and SSH. In the context of defensive IT security, investigating aspects of the \gls{dns} can facilitate protection efforts tremendously. Estimating the reputation of domains can help in identifying hostile activities. Such a score can, for example, consider features like quickly changing network blocks for a given domain or clustering of already known malicious domains and newly observed ones.
\section{Motivation}
@@ -18,11 +11,21 @@ Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming
\section{Challenges}
\label{sec:challenges}
All of the investigated approaches are using \gls{pdns} logs to generate a reputation score for a specific domain. These logs are generated on central \gls{dns} resolvers and capture outgoing traffic of multiple users (see section~\ref{subsec:passive_dns}), one challenge of this work is handling huge volumes of data. With about seven Gigabytes \todo{verify} of uncompressed \gls{pdns} logs for a single day, various general issues might occur: General purpose computers nowadays usually have up to 16 Gigabytes of RAM (rarely 32 GB) which concludes that multiple tasks (i.e. building a training set) may not be performed purely in-memory. The time of analysis might also become a bottleneck. Simply loading one single day (see benchmark example~\ref{lst:load_and_iterate_one_day_of_compressed_pdns_logs}) of (compressed) logs from disk and iterating it without actual calculations takes roughly 148 seconds. To evaluate existing algorithms certain requirements have to be met. Passive DNS logs usually contain sensitive data which is one reason why most papers do not publish test data. For a precise evaluation the raw input data is needed. Some previously developed classifications have not completely disclosed the involved algorithms so these have to be reconstructed as close as possible taking all available information into account.
\section{Goals}
\label{sec:goals}
The task of this work is to evaluate existing scoring mechanisms of domains in the special context of IT security, and also research the potential for combining different measurement approaches. It ultimately shall come up with an improved and evaluated algorithm for determining the probability of a domain being related to hostile activities.
\section{Related Work}
\label{sec:related_work}
\todo{machine learning vs others}
\lstinputlisting[language={java}, label=lst:sendImpliciteIntent,caption=Intent - Bild anzeigen]{res/src/sendImpliciteIntent.java}

View File

@@ -0,0 +1,22 @@
\section{Benchmarks}
\label{sec:benchmarks}
To get a better understanding of performance related challenges, some benchmarks are performed and described in this section. All benchmarks are performed on the same machine with 16 GB of DD3 RAM with a clock speed of 1600 MT/s in dual channel, an Intel i7-3520M CPU @ 2900 MHz and a Samsung SSD 850 EVO with 250 GB (where not otherwise specified). Linux 4.13.12-1 has been used and Python scripts are executed with Python interpreter in version 3.6.3. For consistency, no other software is running at the time of the benchmark execution (e.g. a desktop environment or heavy background processes) \todo{list of what is running}. All benchmark are run ten times and outliers that show a run time of 10\% above the statistical median are ignored. Although considering the mentioned actions, it is not safe to assume completely equal initial situations at the time of execution on non real-time operating systems (like the one used). So these figures have to be treated with care and should only give a fundamental understanding of how long tasks are about to run.
\begin{lstlisting}[language={bash}, caption={Benchmark: Load and iterate one day of compressed pdns logs}, label={lst:load_and_iterate_one_day_of_compressed_pdns_logs}]
start_z = time.time()
globbed = glob.glob('/home/felix/pdns/' + '*-2017-09-01*.csv.gz')
for f in globbed:
with gzip.open(f, 'rt', newline='') as file:
reader = csv.reader(file)
for row in reader:
pass
print('iterating day took: ' + str(time.time() - start_z) + ' s')
# result:
cleaned results: [155.0667760372162, 148.00951623916626, 147.8429672718048, 147.2554485797882, 147.1039183139801, 147.26967453956604, 147.13052105903625, 147.33162689208984, 147.20316672325134, 147.29751586914062]
average: 148.15111315250397 seconds
\end{lstlisting}

View File

@@ -239,4 +239,10 @@ QType & Type & Description \\
\caption{Address Resolution}
\label{fig:address_resolution}
\end{figure}
\todo{not referenced atm}
\todo{not referenced atm}
\subsection{Passive DNS}
\label{subsec:passive_dns}

View File

@@ -2,4 +2,5 @@
\label{cha:technical_background}
\input{content/Technical_Background/DNS/DNS}
\input{content/Technical_Background/Detecting_Malicious_Domain_Names/Detecting_Malicious_Domain_Names}
\input{content/Technical_Background/Detecting_Malicious_Domain_Names/Detecting_Malicious_Domain_Names}
\input{content/Technical_Background/Benchmarks/Benchmarks}

View File

@@ -43,6 +43,8 @@
\newacronym{tcp}{TCP}{Transmission Control Protocol}
\newacronym{pdns}{pDNS}{passive DNS}
\newacronym{os}{OS}{Operating System}
\newacronym{ftp}{FTP}{File Transfer Protocol}

5
src/benchmarks/compare_days.sh Executable file
View File

@@ -0,0 +1,5 @@
#!/bin/bash
cd /run/media/felix/AE7E01B77E01797B/pDNS;
for i in {01..31}; do echo -n -e "day $i \t size: "; echo -n -e $(du -ch *"2017-10-$i"* | tail -1) " \t #files: "; ls *"2017-10-$i"* | wc -l; done

View File

@@ -0,0 +1,13 @@
iterating day took: 155.0667760372162 s
iterating day took: 148.00951623916626 s
iterating day took: 147.8429672718048 s
iterating day took: 147.2554485797882 s
iterating day took: 147.1039183139801 s
iterating day took: 147.26967453956604 s
iterating day took: 147.13052105903625 s
iterating day took: 147.33162689208984 s
iterating day took: 147.20316672325134 s
iterating day took: 147.29751586914062 s
all results: [155.0667760372162, 148.00951623916626, 147.8429672718048, 147.2554485797882, 147.1039183139801, 147.26967453956604, 147.13052105903625, 147.33162689208984, 147.20316672325134, 147.29751586914062]
cleaned results: [155.0667760372162, 148.00951623916626, 147.8429672718048, 147.2554485797882, 147.1039183139801, 147.26967453956604, 147.13052105903625, 147.33162689208984, 147.20316672325134, 147.29751586914062]
average: 148.15111315250397