master_thesis/Thesis/content/Introduction/Introduction.tex

\chapter{Introduction}
\label{cha:Introduction}

The Domain Name System (DNS) has been one of the corner stones of the internet for a long time. It acts as an hierarchical, bidirectional translation device between mnemonic domain names and network addresses. It also provides service lookup or enrichment capabilities for a range of application protocols like HTTP, SMTP, and SSH (e.g. verifying SSH host keys using DNS). In the context of defensive IT security, investigating aspects of the DNS can facilitate protection efforts tremendously. Estimating the reputation of domains can help in identifying hostile activities. Such a score can, for example, consider features like quickly changing network blocks for a given domain or clustering of already known malicious domains and newly observed ones.


\section{Motivation}
\label{sec:motivation}

Malware like botnets, phishing sites and spam heavily rely on the Domain Name System to either hide behind proxies or communicate with command and control servers. Malware authors are getting more and more creative in bypassing traditional countermeasures. Using techniques like domain generation algorithms and fast-flux service networks makes it hard to eliminate the roots of botnets. The ZeuS botnet family exists since 2007 and further propagation could not be stopped until today (\fsCite{WhyDGOWinsOnline}). This leads to a situation where static filter list can not keep pace with evolving malware authors. To eliminate malware in the long run, malware has to be stopped before it can be widely spread across the internet. There are three major systems that have been proposed as dynamic domain reputation systems using passive DNS data in the past. With passive DNS databases getting more common, setting up a domain reputation system using pDNS data promises a lightweight monitoring system.


\section{Challenges}
\label{sec:challenges}

All of the investigated approaches are using passive DNS (pDNS) logs to generate a reputation score for a specific domain. These logs are monitored on central DNS resolvers and capture lookup results of arbitrary scale users (see Section~\ref{subsec:passive_dns}), so one challenge of this work is handling huge volumes of data. With about seven Gigabytes of uncompressed pDNS logs for a single day, various general issues might occur: General purpose computers nowadays usually have up to 32 Gigabytes of RAM which concludes that multiple tasks (i.e. building a training set) may not be performed purely in-memory. The time of analysis might also become a bottleneck (see Section~\ref{sec:system_architecture}). To evaluate existing algorithms certain requirements have to be met. Passive DNS logs usually contain sensitive data which is one reason why most papers do not publish test data. For a precise evaluation the raw input data is needed. Some previously developed classifications have not completely disclosed the involved algorithms so these have to be reconstructed as closely as possible taking all available information into account.


\section{Goals}
\label{sec:goals}

The task of this work is to evaluate existing scoring mechanisms of domains in the special context of IT security, and also research the potential for combining different measurement approaches. It ultimately shall come up with an improved algorithm by combining existing algorithms for determining the probability of a domain being related to hostile activities.


\section{Related Work}
\label{sec:related_work}

In the context of IT-Security, several approaches for assigning a reputation score to a domain do exist. Before 2010, the general idea of protecting a network against malicious requests targeting other networks was to establish static filter lists. This included both explicitly allowing requests as well as explicitly blocking request to certain IP addresses or domain names. For example \fsAuthor{Jung:2004:ESS:1028788.1028838} introduced an approach to block request to certain domains using a DNS black list. As shown by \fsCite{ramachandran2006can}, this approach is not always suitable to keep up with the speed of malware authors. A different type of system has been established in 2010 when two algorithms have been introduced, \textit{Notos} followed by \textit{Exposure}, that used machine learning to dynamically assign a reputation score to a domain by using the characteristics of how benign and malicious domains are usually configured and used in terms of e.g. DNS resource usage or the global distribution of the machines that are used for malicious purposes.