master_thesis/Thesis/content/Evaluation_of_existing_Systems/Exposure/Exposure.tex

\section{Exposure}
\label{sec:exposure}

\subsection{General}
\label{subsec:exposure_general}

\textit{Exposure} is ``a system that employs large-scale, passive DNS analysis techniques to detect domains that are involved in malicious activity''\fsCite{Bilge11exposure:finding}, which was first introduced in 2011 by the \textit{Institute Eurecom} in Sophia Antipolis, the \textit{Northeastern University} from Boston and the \textit{University of California} in Santa Barbara. \textit{Exposure} is the second published system to detect malicious domains using passive DNS data and is built on the key premise, that most malicious services are dependent on the Domain Name System and compared to benign services should expose enough differences in behaviour for an automated discovery, see Section~\ref{subsec:exposure_features} for what differences the features are targeted at. The main analysis for \textit{Exposure} has been run on data of a period of 2.5 month with more than 100 billion DNS queries. \textit{Exposure} is not targeted at a specific threat but rather covers a wide variety of malicious activities like phishing, Fast-Flux services, spamming, botnets (using domain generation algorithms), and similar others. It uses fifteen features, with nine features, that have not been proposed in previous research. Ultimately, \textit{Exposure} offers a real-time detection system which has been made available to the public in 2014 \fsCite{Bilge:2014:EPD:2617317.2584679}. Unfortunately, the service was not accessible at the time of this writing.


\subsection{Architecture}
\label{subsec:exposure_architecture}

For the distinction of benign and malicious domains to perform well, a large set of training data is used in \textit{Exposure} (seven days). The offline training has been powered by recursive DNS traffic (RDNS), gathered from the Security Information Exchange (SIE). Specifically, only the answers of the RDNS traffic have been used and comprises of: the queried domain name, timestamp of the request, caching time TTL and the list of resolved IP addresses. The overall system consists of five main components. How those modules are interacting with each other and which input data is required for each module can be seen in Figure~\ref{fig:exposure_system_overview}.

\begin{itemize}
    \item The \textit{Data Collector} module passively captures the DNS traffic in the monitored network.
    \item The \textit{Feature Attribution} component is attributing the captured domains with a vector containing the associated features.
    \item The third component \textit{Malicious and Benign Domains Collector} is running in parallel to the first two modules and constantly gathers information about known good and known bad domains. These lists are used to label the output of the \textit{Feature Attribution} module afterwards, as it can be seen in Figure~\ref{fig:exposure_system_overview}. The list of benign domains is extracted from the Alexa top list \fsCite{AlexaWebInformationOnline} and externally confirmed \gls{whois} data. The list of known malicious domains is collected from several external, both professionally provisioned and user maintained, sources and includes domains in different threat classes, e.g., malwaredomains.com \fsCite{malwaredomainsInformationOnline}, Phishtank \fsCite{PhishtankInformationOnline}, Anubis (no longer available), the Zeus Block List \fsCite{zeusblocklistInformationOnline} and domains from DGAs for Conficker \fsCite{porras2009foray} and Mebroot \fsCite{Stone-Gross:2009:YBM:1653662.1653738}.
    \item The labeled dataset is then fed into the \textit{Learning Module} and trains the domain detection model that is used in the final step. This classifier may also be retrained on a regular basis to keep up with malicious behavior (daily in \textit{Exposure}).
    \item The \textit{Classifier} uses the decision model to classify unlabeled (new) domains into benign and malicious groups. For this, the same feature vector that is produced by the \textit{Feature Attribution} module is used.
\end{itemize}


\begin{figure}[!htbp]
    \centering
    \includegraphics[width=.9\textwidth, clip=true]{content/Evaluation_of_existing_Systems/Exposure/exposure_system_overview.png}
    \caption{Exposure: System overview \fsCite[Figure 1]{Bilge11exposure:finding}}
    \label{fig:exposure_system_overview}
\end{figure}


\subsection{Features}
\label{subsec:exposure_features}

\textit{Exposure} uses a total of fifteen features that have been chosen after several month of study with thousands of well-known benign and malicious domains. These features are grouped into four different categories which can be seen in Table~\ref{tab:exposure_features}.

The first group, \textit{Time-Based Features} has not been approached in publications before. These features investigate the time, at which the request with domain \textit{d} has been issued. The main idea behind this group of features is to find malicious services that use techniques like \textit{domain flux}
(see Section~\ref{subsec:fast-flux_service_networks}) to circumvent take downs and make their infrastructure more agile. \fsAuthor{Bilge:2014:EPD:2617317.2584679}  infer that ``[\textit{Domain flux}] often show a sudden increase followed by a sudden decrease in the number of requests''. Domains of malicious services using a DGA do only exist for a short period of time by design. \fsAuthor{Bilge:2014:EPD:2617317.2584679} defines the first feature as follows: ``A domain is defined to be a short-lived domain [...] if it is queried only between time \(t_0\) and \(t_1\), and if this duration is comparably short (e.g., less than several days).'' The next three features are subject to the change point detection (CPD) problem: Change point detection is about the identification of (abrupt) changes in the distribution of values, for example in time series. \textit{Exposure} implemented a CPD algorithm based on the popular CUSUM (cumulative sum) algorithm. At first, the time series of request timestamps is split into periods of 3600 seconds (one hour was tested to work well). After that, all time intervals are iterated and for each interval, the average request count of the previous eight hours \(P_t^-\) and following eight intervals \(P_t^+\) is calculated. In the next step, the distance of these two values is calculated for each interval \(d(t)=|P_t^--P_t^+|\) and the resulting ordered sequence \(d(t)\) of distances is fed to the CUSUM algorithm to finally retrieve all change points (for more information on the implemented CPD algorithm, see \fsCite[Section 3.1]{Bilge:2014:EPD:2617317.2584679}). To calculate the \textit{Daily similarity} features, the Euclidean Distance of the time series of each day for \textit{d} is calculated. Intuitively, a low distance denotes similar time series and thus high daily similarity whereas two days with higher distance do show a less similar request volume. All the features of this group naturally only perform well when having a larger number of requests to \textit{d} over a significant period of time.

The next group of Features (\textit{DNS Answer-Based Features}) investigates resolutions of the requested domain \textit{d}. While one domain can map to multiple IP addresses for benign services, most harmless services show a much smaller network profile in terms of e.g. the location or the distribution of \glspl{as}/\glspl{bgp}. To benefit from those findings, four features have been extracted: The number of distinct IP addresses, the amount of different countries these IP addresses are assigned to, the number of other domains that share an IP address (\textit{d} resolves to) and the fourth feature is the amount of results of the reverse DNS query for all IPs of \textit{d}. It is worth noting, that some hosting providers also use one IP address for many domains but in conjunction with other features those false positives can be reduced.

The \textit{TTL Value-Based Features} cover five individual features. Each answer for a DNS request contains the TTL attribute, which is the recommendation (configured by the operator of \textit{d}) of how long the resolution will be valid and should be cached for this reason. Whereas RFC 1033 recommends a TTL of one day (86400 seconds) \fsCite{RFC1033}, it is getting more common, especially for content delivery networks to use much lower values (e.g. Cloudflare, one of the biggest managed DNS providers is using a default of 5 minutes). Botnets are also usually applying low TTL values to avoid long outages of C\&C servers and bots. As \fsAuthor{Bilge:2014:EPD:2617317.2584679} states, botnets also change their TTL values more frequently and use values in different ranges depending on their availability. While applying a higher value to high bandwidth servers with low downtimes, home computers behind a digital subscriber line are much likely to get offline and therefore are assigned lower TTL values. For this reason, all TTL values for a domain are checked against the following ranges (in seconds): [0, 1], [1, 10], [10, 100], [100, 300], [300, 900], [900, inf].

The last group of features are the \textit{Domain Name-Based Features}. Domain names of benign services mostly use easy to remember names which consist of valid words. Attackers often are not interested in human readable domain names. This is especially true for domains generated by a DGA. \textit{Exposure} extracts two statistical features out of the domain name, the first being the percentage of numerical characters and secondly the length of the longest (english) meaningful string (LMS).


\begin{table}[!htbp]
    \centering
    \caption{Exposure: Features}
    \label{tab:exposure_features}
    \begin{tabularx}{\textwidth}{|l|X|}
    \hline
    \textbf{Feature Set}                                & \textbf{Feature Name}                   \\ \hline
    \multirow{4}{*}{\textit{Time-Based Features}}       & Short life                              \\ \cline{2-2}
                                                        & Daily similarity                        \\ \cline{2-2}
                                                        & Repeating patterns                      \\ \cline{2-2}
                                                        & Access ratio                            \\ \hline
    \multirow{4}{*}{\textit{DNS Answer-Based Features}} & Number of distinct IP addresses         \\ \cline{2-2}
                                                        & Number of distinct countries            \\ \cline{2-2}
                                                        & Number of domains share the IP with     \\ \cline{2-2}
                                                        & Reverse DNS query results               \\ \hline
    \multirow{5}{*}{\textit{TTL Value-Based Features}}  & Average TTL                             \\ \cline{2-2}
                                                        & Standard Deviation of TTL               \\ \cline{2-2}
                                                        & Number of distinct TTL values           \\ \cline{2-2}
                                                        & Number of TTL change                    \\ \cline{2-2}
                                                        & Percentage usage of specific TTL ranges \\ \hline
    \multirow{2}{*}{\textit{Domain Name-Based Features}}         & \% of numerical characters              \\ \cline{2-2}
                                                        & \% of the length of the LMS             \\ \hline
\end{tabularx}
\end{table}


\subsection{Reputation Engine}
\label{subsec:exposure_reputation_engine}

The reputation classifier of \textit{Exposure} is implemented as a \textit{J48} decision tree algorithm (see Section~\ref{sec:machine_learning} for details of decision trees, \textit{j48} is an implementation of the \textit{C4.5} decision tree algorithm). The performance of decision trees mainly depend on the quality of the training set. For this reason a representative set of training data, with malicious domains from various threat classes, has to be chosen. Sources that have been used to identify malicious and benign domains can be found in Section~\ref{subsec:exposure_architecture}. In total, a list of 3500 known bad as well as 3000 known good domains have been used for the initial training. In order to take advantage of the \textit{Time-Based Features}, the optimal training period has been observed to be seven days. The tree is then constructed using the feature attribute values and its corresponding labels. More specifically, the whole training set is iterated and each time, a set of samples can be separated using one single attribute (in perspective to the assigned label) it is branched out and a new leaf is created. Each branch is then split into more fine grained subtrees as long as there is an \textit{information gain}, which means that all samples of the subset belong to the same class, i.e. are assigned the same label (see more on decision trees in Section~\ref{subsec:decision_tree_classifier}).

\subsection{Results}
\label{subsec:exposure_results}

The performance of classifiers with different feature sets has been tested using e.g. 10-fold cross validation. See Section~\ref{subsec:notos_results} for more details on 10-fold cross validation. To find the model with the minimum error rate, all combinations of feature sets ({\textit{Time-Based Features} as F1, \textit{DNS Answer-Based Features} as F2, \textit{TTL Value-Based Features} F3 and \textit{Domain Name-Based Features} as F4) have been trained using the same decision tree algorithm. Figure~\ref{fig:exposure_miss-classifier_instances} shows the error rate of those different classification models. The \textit{Time-Based Features} are showing the smallest error when inspecting single feature sets only. Looking at models with multiple feature sets, the overall minimum error rate is produced when using all four feature groups. The total amount of requests in the dataset that was collected for the initial analysis counted roughly 100 billion DNS queries. As processing all of these requests is not feasible in practice, two filtering steps have been introduced. The first one filters out all requests to a domain in the top 1000 Alexa list. The assumption for this filter is that no malicious domain will get this popular without being detected in some form. This action reduced about 20\% of the initial requests. The second step filters out all requests to domains that have been registered at least one year before the analysis. This filter applied to 45.000 domains (or 40 billion corresponding queries) and reduced the remaining traffic by another 50\%. The filtering process has been cross tested against the Alexa top list, McAfee WebAdvisor (formerly McAfee SiteAdvisor) \fsCite{MCAfeeWebAdvisorOnline}, Google Safe Browsing \fsCite{GoogleSafeBrowsingOnline} and Norton Safe Web \fsCite{NortonSafeWebOnline} and only 0.09\% have been reported to be risky. \fsAuthor{Bilge11exposure:finding} for this reason states that: ``We therefore believe that our filtering policy did not miss a significant number of malicious domains because of the pre-filtering we performed during the offline experiments.''

The accuracy of the classifier has been validated using two different methods. The first method was to classify the training set with 10-fold cross validation. The second method is to simply use 66\% of the dataset for the training and the remaining 33\% as the testing set.


\begin{figure}[!htbp]
    \centering
    \includegraphics[width=.9\textwidth, clip=true]{content/Evaluation_of_existing_Systems/Exposure/exposure_validation.png}
    \caption{Exposure: Percentage of miss-classified instances \fsCite[Figure 2]{Bilge11exposure:finding}}
    \label{fig:exposure_miss-classifier_instances}
\end{figure}