rush hour 2

This commit is contained in:
2018-01-30 20:54:29 +01:00
parent ece9b4afcf
commit 95d35f2470
20 changed files with 171 additions and 146 deletions

View File

@@ -1,26 +1,25 @@
\chapter{Introduction}
\label{cha:Introduction}
The domain name system (\gls{dns}) has been one of the corner stones of the internet for a long time. It acts as a hierarchical, bidirectional translation device between mnemonic domain names and network addresses. It also provides service lookup or enrichment capabilities for a range of application protocols like HTTP, SMTP, and SSH. In the context of defensive IT security, investigating aspects of the \gls{dns} can facilitate protection efforts tremendously. Estimating the reputation of domains can help in identifying hostile activities. Such a score can, for example, consider features like quickly changing network blocks for a given domain or clustering of already known malicious domains and newly observed ones.
The domain name system (\gls{dns}) has been one of the corner stones of the internet for a long time. It acts as a hierarchical, bidirectional translation device between mnemonic domain names and network addresses. It also provides service lookup or enrichment capabilities for a range of application protocols like HTTP, SMTP, and SSH (e.g. verifying SSH host keys using DNS). In the context of defensive IT security, investigating aspects of the \gls{dns} can facilitate protection efforts tremendously. Estimating the reputation of domains can help in identifying hostile activities. Such a score can, for example, consider features like quickly changing network blocks for a given domain or clustering of already known malicious domains and newly observed ones.
\section{Motivation}
\label{sec:motivation}
\todo{also check papers for motivations}
Malware like botnets, phishing sites and spam heavily rely on the domain name system to either hide behind proxies or communicate with command and control servers. Malware authors are getting more and more creative in bypassing traditional countermeasures. Using techniques like domain generation algorithms and fast-flux service networks make it hard to eliminate the roots of, for example botnets. The ZeuS botnet family exists since 2007 and further propagation could not be stopped until today (\fsCite{WhyDGOWinsOnline}). This leads to a situation where static filter list can not keep pace with evolving malware authors. To eliminate malware in the long run, malware has to be stopped before it can be widely spread across the internet. There are three major systems that have been proposed as dynamic domain reputation systems using passive DNS data in the past. With passive DNS databases getting more common, setting up a domain reputation system using pDNS data promises a lightweight monitoring system.
\section{Challenges}
\label{sec:challenges}
All of the investigated approaches are using \gls{pdns} logs to generate a reputation score for a specific domain. These logs are generated on central \gls{dns} resolvers and capture outgoing traffic of multiple users (see Section~\ref{subsec:passive_dns}), one challenge of this work is handling huge volumes of data. With about seven Gigabytes \todo{verify} of uncompressed \gls{pdns} logs for a single day, various general issues might occur: General purpose computers nowadays usually have up to 16 Gigabytes of RAM (rarely 32 GB) which concludes that multiple tasks (i.e. building a training set) may not be performed purely in-memory. The time of analysis might also become a bottleneck. Simply loading one single day (see benchmark example~\ref{lst:load_and_iterate_one_day_of_compressed_pdns_logs}) of (compressed) logs from disk and iterating it without actual calculations takes roughly 148 seconds. To evaluate existing algorithms certain requirements have to be met. Passive DNS logs usually contain sensitive data which is one reason why most papers do not publish test data. For a precise evaluation the raw input data is needed. Some previously developed classifications have not completely disclosed the involved algorithms so these have to be reconstructed as close as possible taking all available information into account.
All of the investigated approaches are using \gls{pdns} logs to generate a reputation score for a specific domain. These logs are monitored on central \gls{dns} resolvers and capture lookup results of arbitrary scale users (see Section~\ref{subsec:passive_dns}), so one challenge of this work is handling huge volumes of data. With about seven Gigabytes of uncompressed \gls{pdns} logs for a single day, various general issues might occur: General purpose computers nowadays usually have up to 16 Gigabytes of RAM (rarely 32 GB) which concludes that multiple tasks (i.e. building a training set) may not be performed purely in-memory. The time of analysis might also become a bottleneck. Simply loading one single day (see benchmark example~\ref{lst:load_and_iterate_one_day_of_compressed_pdns_logs}) of (compressed) logs from disk and iterating it without actual calculations takes roughly 148 seconds. To evaluate existing algorithms certain requirements have to be met. Passive DNS logs usually contain sensitive data which is one reason why most papers do not publish test data. For a precise evaluation the raw input data is needed. Some previously developed classifications have not completely disclosed the involved algorithms so these have to be reconstructed as closely as possible taking all available information into account.
\section{Goals}
\label{sec:goals}
The task of this work is to evaluate existing scoring mechanisms of domains in the special context of IT security, and also research the potential for combining different measurement approaches. It ultimately shall come up with an improved and evaluated algorithm for determining the probability of a domain being related to hostile activities.
The task of this work is to evaluate for existing scoring mechanisms of domains in the special context of IT security, and also research the potential for combining different measurement approaches. It ultimately shall come up with an improved algorithm by combining existing algorithms for determining the probability of a domain being related to hostile activities.
\section{Related Work}