master_thesis/Thesis/content/Introduction/Introduction.tex

\chapter{Introduction}
\label{cha:Introduction}

The domain name system (\gls{dns}) has been one of the corner stones of the internet for a long time. It acts as a hierarchical, bidirectional translation device between mnemonic domain names and network addresses. It also provides service lookup or enrichment capabilities for a range of application protocols like HTTP, SMTP, and SSH. In the context of defensive IT security, investigating aspects of the \gls{dns} can facilitate protection efforts tremendously. Estimating the reputation of domains can help in identifying hostile activities. Such a score can, for example, consider features like quickly changing network blocks for a given domain or clustering of already known malicious domains and newly observed ones.


\section{Motivation}
\label{sec:motivation}


\todo{also check papers for motivations}


\section{Challenges}
\label{sec:challenges}

All of the investigated approaches are using \gls{pdns} logs to generate a reputation score for a specific domain. These logs are generated on central \gls{dns} resolvers and capture outgoing traffic of multiple users (see Section~\ref{subsec:passive_dns}), one challenge of this work is handling huge volumes of data. With about seven Gigabytes \todo{verify} of uncompressed \gls{pdns} logs for a single day, various general issues might occur: General purpose computers nowadays usually have up to 16 Gigabytes of RAM (rarely 32 GB) which concludes that multiple tasks (i.e. building a training set) may not be performed purely in-memory. The time of analysis might also become a bottleneck. Simply loading one single day (see benchmark example~\ref{lst:load_and_iterate_one_day_of_compressed_pdns_logs}) of (compressed) logs from disk and iterating it without actual calculations takes roughly 148 seconds. To evaluate existing algorithms certain requirements have to be met. Passive DNS logs usually contain sensitive data which is one reason why most papers do not publish test data. For a precise evaluation the raw input data is needed. Some previously developed classifications have not completely disclosed the involved algorithms so these have to be reconstructed as close as possible taking all available information into account.


\section{Goals}
\label{sec:goals}

The task of this work is to evaluate existing scoring mechanisms of domains in the special context of IT security, and also research the potential for combining different measurement approaches. It ultimately shall come up with an improved and evaluated algorithm for determining the probability of a domain being related to hostile activities.


\section{Related Work}
\label{sec:related_work}

In the context of IT-Security, there do exists several approaches for assigning a reputation score to a domain. Before 2010 the general idea of protecting a network against malicious requests targeting other networks was to establish static filter lists. This included both explicitly allowing requests as well as explicitly blocking request to certain IP addresses or domain names. For example \fsAuthor{Jung:2004:ESS:1028788.1028838} introduced an approach to block request to certain domains using a DNS black list. As shown by \fsCite{ramachandran2006can} in 2006, this approach is not always suitable to keep up with the speed of malware authors. A different type of system has been established in 2010 when two algorithms have been introduced, \textit{Notos} followed by \textit{Exposure}, that used machine learning to dynamically assign a reputation score to a domain by using the characteristics of how benign and malicious domains are usually configured and used in terms of e.g. DNS resource usage or the global distribution of the machines that are used for malicious purposes.