master_thesis/Exposé/main.tex

%-------------------
%Header
%-------------------

%Wir verwenden eine DIN-A4-Seite und die Schriftgröße 12.
\documentclass[a4paper,12pt]{scrartcl}
\title{Exposé Master's thesis Felix Steghofer}


%Diese drei Pakete benötigen wir für die Umlaute, Deutsche Silbentrennung etc.
%Apple-Nutzer sollten anstelle von \usepackage[latin1]{inputenc} das Paket \usepackage[applemac]{inputenc} verwenden
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[T1]{fontenc}
\usepackage{enumitem}
\usepackage{listings}
\usepackage{sidecap}
\usepackage{float}
\usepackage{todonotes}
\usepackage{mathtools}

%Das Paket erzeugt ein anklickbares Verzeichnis in der PDF-Datei.
\usepackage[hyphens]{url}
\usepackage{hyperref}

%Das Paket wird für die anderthalb-zeiligen Zeilenabstand benötigt
\usepackage{setspace}

%Einrückung eines neuen Absatzes
\setlength{\parindent}{0em}

%Definition der Ränder
\usepackage[paper=a4paper,left=30mm,right=30mm,top=30mm,bottom=30mm]{geometry}

%Links format
\hypersetup{
  colorlinks   = true, %Colours links instead of ugly boxes
  urlcolor     = blue, %Colour for external hyperlinks
  linkcolor    = blue, %Colour of internal links
  citecolor   = red %Colour of citations
}

%c++ code
\lstset{language=C++,
       	basicstyle=\ttfamily,
       	keywordstyle=\color{blue}\ttfamily,
       	stringstyle=\color{red}\ttfamily,
       	commentstyle=\color{green}\ttfamily,
  	   	frame=single,
  		xrightmargin=.5em,
  		xleftmargin=.5em
      }

%Pics..
\usepackage{graphicx}
\usepackage{caption}
\usepackage{csquotes}
\usepackage{chngcntr}
\graphicspath{ {media/} }

%Pics counter
\counterwithout{figure}{section}

%Abstand der Fußnoten
\deffootnote{1em}{1em}{\textsuperscript{\thefootnotemark\ }}

%Regeln, bis zu welcher Tiefe (section,subsection,subsubsection) Überschriften angezeigt werden sollen (Anzeige der Überschriften im Verzeichnis / Anzeige der Nummerierung)
\setcounter{tocdepth}{3}
\setcounter{secnumdepth}{3}

% uncomment for bibliography
%\usepackage[backend=biber,
%style=numeric
%style=alphabetic
%style=reading
%style=authoryear-ibid
%]{bibtex}
%\addbibresource{literatur_seminararbeit}
%\defbibheading{head}{\section{Literaturverzeichnis}}
%-------------------
%Ende des Kopfbereiches
%-------------------

%-------------------
%Main
%-------------------
\begin{document}


%Beginn der Titelseite
\begin{titlepage}
\begin{small}
\vfill {Universität Passau || Siemens CERT || Master's thesis - Exposé}
\end{small}


\begin{center}
\begin{Large}
\vfill{\textsf{\textbf{
Evaluation of domain reputation scoring algorithms in the field of IT-Security and development of a probabilistic hostile activities accounting algorithm.
}}}
\end{Large}
\end{center}

\begin{small}
\vfill Felix Steghofer \\ \today \\ Advisor: Thomas Penteker \\ Supervisor: Prof. Dr. rer. nat. Joachim Posegga

\end{small}

\end{titlepage}
%Ende der Titelseite


%Inhaltsverzeichnis (aktualisiert sich erst nach dem zweiten Setzen)
%\tableofcontents
\thispagestyle{empty}

%Beginn einer neuen Seite
\clearpage

%Anderthalbzeiliger Zeilenabstand ab hier
\onehalfspacing

\pagestyle{plain}

\section{Introduction}
The domain name system (DNS) has been one of the corner stones of the internet
for a long time. It acts as a hierarchical, bidirectional translation device
between mnemonic domain names and network addresses. It also provides service
lookup or enrichment capabilities for a range of application protocols like
HTTP, SMTP, and SSH.
In the context of defensive IT security, investigating aspects of the DNS can
facilitate protection efforts tremendously. Estimating the reputation of
domains can help in identifying hostile activities. Such a score can, for
example, consider features like quickly changing network blocks for a given
domain or clustering of already known malicious domains and newly observed
ones.

The task of this work is to evaluate existing scoring mechanisms of domains in
the special context of IT security, and also research the potential for combining
different measurement approaches. It ultimately shall come up with an improved
and evaluated algorithm for determining the probability of a domain being
related to hostile activities. \\

\section{Exposé}
For the improved algorithm we want to investigate a couple of approaches. There has already been done some work in related topics so far, with an active research group residing at the Georgia Institute of Technology. Antonakakis et al. have developed two dynamic domain reputation systems based on machine learning. These are shortly introduced first as they can be referred to as the state of the art in the field of \textit{DNS reputation score} as well as the most popular according to Google scholar citations \cite{GoogleScholarDNSReputSystemOnline} and Mendeley read counts \cite{MendeleyDNSReputSystemOnline}.

Notos uses passive monitoring of DNS query data and its idea is described with:
\begin{quote}The premise of this system is that malicious, agile use of DNS has unique characteristics and can be distinguished from legitimate, professionally provisioned DNS services \cite{antonakakis2010building}. \end{quote}

Kopis on the other hand is operating in the upper DNS hierarchy and makes use of global DNS query resolution patterns to detect malware related domains with features like the requester diversity, the requester profile or the reputation of involved IPs \cite{antonakakis2011detecting}. For a more detailed overview how Notos and Kopis accomplish this task, see the \nameref{sec:related_work}~section.

A third algorithm has been developed by Bilge et al. \cite{bilge2011exposure} and operates in the same DNS layer as Notos does (passive DNS monitoring) but uses a different feature set to evaluate domains.

Furthermore we have thought of additional parameters that could be taken into account, like the character distribution within the domain name, the device class of the machine the DNS request is originating from (i.e. a PC or an embedded device as determined by passive OS fingerprinting) and many more particularities in the request/response patterns.

In the first step of this work ($\sim$two months), all previous efforts for labeling domains with a reputation score have to be investigated and evaluated for a succeeding algorithm. The next month or two will be used for the implementation of this algorithm as well as evaluating it on a suitable dataset. In the last step ($\sim$two months), the thesis will be finalized.


\section{Related work}
\label{sec:related_work}
Malware related dynamic domain reputation systems (passive DNS request/response monitoring Machine Learning approaches):
\begin{itemize}
    \item Notos (passive monitoring of recursive DNS traffic) \cite{antonakakis2010building}
    \item Exposure (like Notos, but different feature set) \cite{bilge2011exposure}
    \item Kopis (working in the upper DNS hierarchy) \cite{antonakakis2011detecting}
\end{itemize}


See Figure~\ref{exposure_features} for an example of possible features. (Extracted by Exposure to do a sentiment analysis) \\
\begin{figure}[htbp]
\centering
\includegraphics[width=.7\textwidth]{exposure_features.png}
\caption{Features used in Exposure \cite{bilge2011exposure}}
\label{exposure_features}
\end{figure}

\textbf{In comparison, the features of Kopis:}

At first, the following data is extracted out of each DNS request/response pair. \\
\begin{math}
Qj (d) = (Tj , Rj , d, IPsj )
\end{math}
where \\
\begin{itemize}
    \item \textit{Tj} is the epoch (time of the request/response [e.g. on a daily basis])
    \item \textit{Rj} is the IP of the requests initiator
    \item \textit{d} the queried domain and
    \item \textit{IPsj} is the set of resolved IPs for this domain as responded
\end{itemize}

Using this information, the following features are used to build the reputation score:

\begin{itemize}
    \item Requester Diversity: Where do request originate (overall)
    \item Requester Profile: Is the requester a single computer or does it itself handle/serve many client (RDNS server of a large ISP)? Different profiles can therefor be weighted accordingly.
    \item Resolved-IPs Reputation (IPR): This basically checks a database for the reputation of all resolved IPs. In detail the following aspects are audited:
    \begin{itemize}
        \item \textit{Malware Evidence}: Average number of know malware related domains that have pointed to that IP in the last month (with respect to the epoch)
        \item \textit{SBL Evidence} very much like the Malware Evidence but with a external IP spam list (Spamhaus Block List \cite{SpamhausBlockingListOnline})
        \item \textit{Whitelist Evidence}: Number of IP addresses pointed by known good domains (DNSWL \cite{DNSWLOnline} and top 30 domains according to Alexa \cite{AlexaWebInformationOnline})
    \end{itemize}
\end{itemize}


Comparing those three systems, Kopis succeeds for a dynamic, independent and global domain reputation scoring algorithm so far. It uses a supervised machine learning approach where within the training mode, a set of sentimentally annotated \textit{malware-related} and \textit{known legitimate} domain names is used to build a model based on query/response
patterns that can later be used to statistically classify in operational mode. In total numbers it features a high detection rate ($\sim$98.4\%) as well as a low false positive rate ($\sim$0.4\%)


%
% Bibliography
%
\bibliographystyle{abbrv}
\bibliography{bib}

%list of all pictures
\listoffigures

\end{document}
%-------------------
%End
%-------------------