master_thesis/Thesis/content/Evaluation_of_existing_Systems/Exposure/Exposure.tex

\section{Exposure}
\label{sec:exposure}

\subsection{General}
\label{subsec:exposure_general}

\textit{Exposure} is ``a system that employs large-scale, passive DNS analysis techniques to detect domains that are involved in malicious activity'', which was first introduced in 2011 by the \textit{Institute Eurecom} in Sophia Antipolis, the \textit{Northeastern University} from Boston and the \textit{University of California} in Santa Barbara \fsCite{Bilge11exposure:finding}. \textit{Exposure} is the second published system to detect malicious domains using passive DNS data and is built on the key premise, that most malicious services are dependent on the domain name system and compared to benign services should expose enough differences in behavior for an automated discovery. The main analysis for \textit{Exposure} has been run on data of a period of 2.5 month with more than 100 billion DNS queries. \textit{Exposure} is not targeted at a specific threat but rather covers a wide variety of malicious activities like phishing, Fast-Flux services, spamming, botnets (using domain generation algorithms), etc. It uses fifteen features with nine features, that have not been proposed in previous research. Ultimately, \textit{Exposure} offers a real-time detection system which has been made available for the public in 2014 \fsCite{Bilge:2014:EPD:2617317.2584679}. Unfortunately, the service was not accessible at the time of this work.


\subsection{Architecture}
\label{subsec:exposure_architecture}

For the distinction of benign and malicious domains to perform well, a large set of training data is used in \textit{Exposure} (seven days). The offline training has been powered by recursive DNS traffic (RDNS), gathered from the Security Information Exchange (SIE). Specifically, only the answer of the RDNS traffic has been used, that comprises of: the queried domain name, timestamp of the request, caching time TTL and the list of resolved IP addresses. The overall systems consists of five main components. The interaction of those models can be seen in Figure~\ref{fig:exposure_system_overview}.

\begin{itemize}
    \item The \textit{Data Collector} module passively captures the DNS traffic in the monitored network.
    \item The \textit{Feature Attribution} component is attributing the captured domains with the desired features.
    \item The third component \textit{Malicious and Benign Domains Collector} is running in parallel to the first two modules and constantly gathers information about known good and known bad domains. These lists are used to label the output of the \textit{Feature Attribution} module afterwards, as it can be seen in picture~\ref{fig:exposure_system_overview}. The list of benign domains is extracted from the Alexa top list \fsCite{AlexaWebInformationOnline} and externally confirmed \gls{whois} data. The list of known malicious domains is collected from several external sources and includes domains in different threat classes, e.g., malwaredomains.com \fsCite{malwaredomainsInformationOnline}, Phishtank \fsCite{PhishtankInformationOnline} and Anubis (no longer available).
    \item The labeled dataset is then fed into the \textit{Learning Module} and trains the domain detection model that is used in the final step. This classifier may also be retrained on a regular basis to keep up with malicious behavior (daily in \textit{Exposure}).
    \item The \textit{Classifier} uses the decision model to classify unlabeled (new) domains into benign and malicious groups. For this, the same feature vector that is produced by the \textit{Feature Attribution} module is used.
\end{itemize}


\begin{figure}[!htbp]
    \centering
    \includegraphics[width=.9\textwidth, clip=true]{content/Evaluation_of_existing_Systems/Exposure/exposure_system_overview.png}
    \caption{Exposure: System overview \fsCite[Figure 1]{Bilge11exposure:finding}}
    \label{fig:exposure_system_overview}
\end{figure}


\subsection{Features}
\label{subsec:exposure_features}

\textit{Exposure} uses a total of fifteen features that have been chosen after several month of study with thousands of well-known benign and malicious domains. These features are grouped into four different categories which can be seen in Table~\ref{tab:exposure_features}.

The first group, with \textit{Time-Based Features} has not been approached in publications before. These features investigate the time, at which the request with domain \textit{d} has been issued. The main idea behind this group of features is to find malicious services that use techniques like \textit{domain flux}
\todo{explain domain flux} to circumvent take downs and make their infrastructure more agile. ``[\textit{Domain flux}] often show a sudden increase followed by a sudden decrease in the number of requests'' \fsCite[Section 3.1]{Bilge:2014:EPD:2617317.2584679}. Domains of malicious services using a DGA do only exist for a short period of time by design. \fsAuthor{Bilge:2014:EPD:2617317.2584679} defines the first feature as follows: ``A domain is defined to be a short-lived domain [...] if it is queried only between time \(t_0\) and \(t_1\), and if this duration is comparably short (e.g., less than several days).'' The next three features are subject to the change point detection (CPD) problem: Change point detection is about the identification of (abrupt) changes in the distribution of values, for example in time series. \textit{Exposure} implemented a CPD algorithm based on the popular CUSUM (cumulative sum) algorithm. At first, the time series of request timestamps is split into periods of 3600 seconds (one hour was tested to work well). After that, all time intervals are iterated and for each interval, the average request count of the previous eight hours \(P_t^-\) and following eight intervals \(P_t^+\) is calculated. In the next step, the distance of these two values is calculated \(d(t)=|P_t^--P_t^+|\) for each interval and the resulting ordered sequence \(d(t)\) of distances is fed to the CUSUM algorithm to finally get retrieve all change points (For more information on the implemented CPD algorithm, see \fsCite[Section 3.1]{Bilge:2014:EPD:2617317.2584679}). To calculate feature two (\textit{Daily similarity}), the Euclidean Distance of the time series of each day for \textit{d} is calculated. Intuitively, a low distance means similar time series and such high daily similarity whereas two days with higher distance do show a less similar request volume. All the features of this group do naturally only perform well when having a larger number of requests to \textit{d} over a significant period of time.

The next group of Features (\textit{DNS Answer-Based Features}) investigates resolutions of the requested domain \textit{d}. While one domain can map to multiple IP addresses for benign services, most harmless services do show a much smaller network profile in terms of e.g. location and \glspl{as}. To satisfy those findings, four features have been extracted: The number of distinct IP addresses, the amount of different countries these IP addresses are assigned to, the number of other domains that share an IP address \textit{d} resolves to and the fourth feature is the amount of results of the reverse dns query for all IPs of \textit{d}. It is worth noting, that some hosting providers also use one IP address for many domains so an extra layer to prevent such false positives make sense.

The \textit{TTL Value-Based Features} covers five individual features. Each answer for a DNS request contains the TTL attribute, which is the recommendation, configured by the operator of \textit{d}, of how long the resolution will be valid and should be cached for this reason. Whereas RFC 1033 recommends a TTL of one day (86400 seconds) \fsCite{RFC1033} it is getting more common, especially for content delivery networks to use much lower values (e.g. Cloudflare, one of the biggest managed DNS providers is using a default of 5 minutes). Botnets are also usually applying low TTL values to avoid long outages of C\&C servers and bots. As \fsAuthor{Bilge:2014:EPD:2617317.2584679} states, botnets do also change their TTL values more frequently and use values in different ranges depending on their availability. While applying a higher value to high bandwidth servers with low downtimes, home computers behind a digital subscriber line are much likely to fail and get lower TTL values. For this reason, all TTL values for a domain are checked against the following ranges: [0, 1], [1, 10], [10, 100], [100, 300], [300, 900], [900, inf].

The last group of features are the \textit{Domain Name-Based Features}. Domain names of benign services mostly use easy to remember names which consist of valid words. Attackers often are not interested in human readable domain names. This is especially right for domains generated by a DGA. \textit{Exposure} extracts two statistical features out of the domain name, the first being the percentage of numerical characters and secondly the length of the longest (english) meaningful string (LMS).


\begin{table}[!htbp]
    \centering
    \caption{Exposure: Features}
    \label{tab:exposure_features}
    \begin{tabularx}{\textwidth}{|l|X|}
    \hline
    \textbf{Feature Set}                                & \textbf{Feature Name}                   \\ \hline
    \multirow{4}{*}{\textit{Time-Based Features}}       & Short life                              \\ \cline{2-2}
                                                        & Daily similarity                        \\ \cline{2-2}
                                                        & Repeating patterns                      \\ \cline{2-2}
                                                        & Access ratio                            \\ \hline
    \multirow{4}{*}{\textit{DNS Answer-Based Features}} & Number of distinct IP addresses         \\ \cline{2-2}
                                                        & Number of distinct countries            \\ \cline{2-2}
                                                        & Number of domains share the IP with     \\ \cline{2-2}
                                                        & Reverse DNS query results               \\ \hline
    \multirow{5}{*}{\textit{TTL Value-Based Features}}  & Average TTL                             \\ \cline{2-2}
                                                        & Standard Deviation of TTL               \\ \cline{2-2}
                                                        & Number of distinct TTL values           \\ \cline{2-2}
                                                        & Number of TTL change                    \\ \cline{2-2}
                                                        & Percentage usage of specific TTL ranges \\ \hline
    \multirow{2}{*}{Domain Name-Based Features}         & \% of numerical characters              \\ \cline{2-2}
                                                        & \% of the length of the LMS             \\ \hline
    \end{tabularx}
    \end{table}