rush hour

2018-01-29 22:52:15 +01:00
parent 817b68b025
commit ece9b4afcf
14 changed files with 284 additions and 110 deletions
--- a/Thesis/content/Development_of_DoresA/Development_of_DoresA.tex
+++ b/Thesis/content/Development_of_DoresA/Development_of_DoresA.tex
@@ -1,17 +1,83 @@
-\chapter{Development of $DoresA$}
+\chapter{Development of DoresA}
 \label{cha:development_of_doresa}

-==> remember, operated in a mostly safe environment (few malware should be in the field)
+The last part of this work the development of a dynamic domain reputation system. A lot of concepts for this system will be adopted from the previously evaluated systems, most concepts will be taken from \textit{Exposure} with some general ideas of \textit{Notos} and \textit{Kopis}. There will also be some additional concepts be investigated that are not yet proposed by the those systems. In general, there are some limitations to be taken into account which arise mostly by the specific type of data that is available for this work and where it has been monitored. The passive DNS logs that have been provided for this work have been collected on three recursive DNS servers in a large company in locations in Europe, Asia and the United States. As those logs do contain sensitive data, raw logs used in this work can not be published mostly due to privacy reasons. It also has to be noted, that the DNS requests are not available for this work for the same reason. The DNS responses should however be sufficient for the target of this work.
+
 ==> not like exposure: do not initially filter out domains? (alexa top 1000 and older than one year)

 \section{Initial Situation and Goals}
 \label{sec:initial_situation_and_goals}

-\section{Dataset preprocessing}
-\label{sec:dataset_preprocessing}
+Ultimately, this work should come up with an algorithm to find domains that are involved in malicious activities. Most of the latest approached work has been working with machine learning techniques to build domain reputation scoring algorithms. As those publications have generally shown promising results (see Section~\ref{cha:evaluation_of_existing_systems}), this work is also focusing on a dynamic approach with machine learning algorithms involved. The network, in which the logs for this work have been collected is different from most ISP or other public networks. There is a lot of effort made to keep the network malware-free. This includes both software solutions (like anti-virus software and firewalls) as well as a team that proactively and reactively monitors and removes malware. Another defensive task is to train the employees to be aware of current and upcoming threats (e.g., to pay attention on hyperlinks in emails, distrust public usb sticks and physical access guidelines). Although this should lead to a mostly malware free network with few requests to malicious domains, 2017 has shown to be the year of ransomware (see Section~\ref{sec:malware}). Private internet users and companies have been infected with malware that was encrypting their data and requiring the target to pay an amount of money to decrypt it. There are of course other ongoing threats that have existed for many years, like spam campaigns (\fsCite{TrendMicroOnline}). The particular task in this work is to discover whether a dynamic reputation system for domains is useful and is applicable under this circumstances. 
+
+
+\section{System Architecture}
+\label{sec:system_architecture}
+
+The overall system will take an similar approach which was first introduced by \textit{Exposure} (see \ref{sec:exposure}). In general, this involves an architecture with four different modules. The \textit{Malicious/Benign Domains Collector} is working at the beginning of the analysis and is fetching malicious domains as well as known benign domains from several external services:
+\begin{itemize}
+    \item \textit{Malware Prevention through Domain Blocking} list from malwaredomains.com which is an professionally maintained list with domains involved in malicious activities like the distribution of malware and spyware (\fsCite{malwaredomainsInformationOnline}).
+    \item \textit{Phishtank}: A list that targets domains that are engaged in spam activities (\fsCite{PhishtankInformationOnline}).
+    \item \textit{ZeuS Tracker}: Blocking list for domains and IP addresses involved in the ZeuS botnet as command and control (C\&C)servers.
+    \item \textit{Alexa} with a list of the most popular domains in various countries as well as a global overview (total of 2000 domains).
+\end{itemize}
+
+The malicious domains list from those three services consisted of 28367 individual entries when first collected. This information is later used to label benign and malicious domains in the training process. The \textit{Malicious/Benign Domains Collector} can be rerun at any time to keep up with known malicious and benign domains at a later stage and increase the accuracy of \textit{DoresA}. The second module \todo{ref system architecture image}, \textit{Data Aggregation Module} is collecting all passive DNS logs and persisting those. The \textit{Data Aggregation Module} is also responsible for persisting information that is explicitly needed in the training step and such consumed by the \textit{Training Module}. This \textit{Training Module}'s primary concern is to learn a model that holds information about resource usage of certain DNS responses as well as labeling those data samples. Due to the limitation of available time, the training period has been reduced to three days (starting from the first of september in 2017) of training time with a window of \todo{how many minutes roughly?}. The training model thus consisted of a total of \todo{how many in total} DNS responses and included resolutions for \textit{how many individual domains} individual domains. The accuracy of this model can be also be increased by retraining the model e.v. once a day or week to keep up with new characteristics of malicious usage. This training model can then be used in the last module, the \textit{Classification Module}, to classify resolutions of unlabeled domains. The \textit{Classification Module} could e.g. be used to act as a real-time warning system when deployed in a network.
+
+The logs that are provided have been collected in different locations all over the world and are aggregated on a single machine as csv files. As operating on the raw csv logs in the training step has shown to be very inefficient \todo{benchmark here, roughly one week per day}, especially when performing multiple analysis cycles, a different solution for accessing the logs had to be found. Experimenting with putting the raw passive DNS logs into a NoSQL database (MongoDB \fsCite{MongoDBOnline}) as well as a relational database (MariaDB \fsCite{MariaDBOnline}) did not show a significant decrease in accessing the data so a slightly different approach has been used. By using an in-memory database (redis \fsCite{RedisOnline}) and only keeping those information, that are needed for the analysis has shown to give much better results \todo{benchmark here}. It has to be stated though that while retaining most of the needed information, information like the timestamp of individual requests could not be kept. See Table \todo{redis table} for which data is stored inside the redis instance. Using an in-memory database for this application led to a different challenge. Even though trimmed down to the minimum set of information, the data has an average size of \todo{todo numbers here} per day. For this reason, a machine with an appropriate amount of internal RAM had to be used. In this case, a total of 512 Gigabyte \todo{verify} of RAM with an Intel Xeon with 32 cores was available. 
+
+
+\todo{system architecture image}
+
+
+\subsection{Decision Tree Classifier}
+\label{subsec:decision_tree_classifier}
+
+While evaluating previous work, mainly two classification algorithms have shown to provide good results in this area. A decision tree classifier has some advantages over different other classification systems: the training time is comparably low, especially in contrast to neural networks. It delivers quite easily interpretable results when plotting the resulting decision tree, it requires little data preparation (e.g. no normalization of the input is needed like in many other algorithms and can handle both numerical and categorical inputs) and it is possible to validate the results of the training using techniques like cross-validation. In this work, the implementation of the python library scikit-learn is used. The current implementation of the scikit-learn algorithm is called \textit{CART} (Classification and Regression Trees) and is based on the C4.5 decision tree implementation that is also used in \textit{Exposure}. For a detailed comparison of classification algorithms see \fsCite{Lim2000}.
+

 \section{Feature Selection}
 \label{sec:feature_selection}

-\section{Evaluation}
-\label{sec:evaluation}
+The feature selection is primarily motivated by the results of the evaluation of previously proposed systems. As \textit{Exposure} has shown to be the system that shares most similarities compared to the network and traffic that is available, also most features are adopted from \textit{Exposure} in the first place. Due to the restricted analysis time, the \textit{Time-Based Features} can unfortunately not be used in this work. To recapture, at least one week of traffic has to be trained to benefit from those features. Besides from that, nearly all features of \textit{Exposure} could be used for the training. See Table~\ref{tab:doresa_features} for all features that are used to model the resource usage characteristics of domains, used in legitimate and malicious activities. For a detailed explanation of why these features have been included, see Section~\ref{subsec:exposure_features}.
+
+
+\begin{table}[!htbp]
+    \centering
+    \caption{Doresa: Features}
+    \label{tab:doresa_features}
+    \begin{tabularx}{\textwidth}{|l|X|}
+    \hline
+    \textbf{Feature Set}                                & \textbf{Feature Name}                   \\ \hline
+    \multirow{4}{*}{\textit{DNS Answer-Based Features}} & Number of distinct IP addresses         \\ \cline{2-2} 
+                                                        & Number of distinct countries            \\ \cline{2-2} 
+                                                        & Number of domains share the IP with     \\ \cline{2-2} 
+                                                        & Reverse DNS query results               \\ \hline
+    \multirow{5}{*}{\textit{TTL Value-Based Features}}  & Average TTL                             \\ \cline{2-2} 
+                                                        & Standard Deviation of TTL               \\ \cline{2-2} 
+                                                        & Number of distinct TTL values           \\ \cline{2-2} 
+                                                        & Number of TTL change                    \\ \cline{2-2} 
+                                                        & Percentage usage of specific TTL ranges \\ \hline
+    \multirow{2}{*}{\textit{Domain Name-Based Features}}         & \% of numerical characters              \\ \cline{2-2} 
+                                                        & \% of the length of the LMS             \\ \hline
+\end{tabularx}
+\end{table}
+
+\todo{additional features?}
+
+
+\section{Implementation}
+\label{sec:implementation}
+
+The implementation of \textit{DoresA} does include several different pieces of software. The main part is implemented in python and consists of the \textit{Training Module} and the \textit{Classification Module}. Apart from the main application, the \textit{Malicious/Benign Domains Collector} is a collection of bash scripts to fetch the filter lists and combine them into lists that can easily be consumed by the main application. The \textit{Data Aggregation Module} is written in C (\fsCite{kernighan2006c}), mostly for performance reasons as these logs are aggregated in real time and fed into the redis database. Most of the \textit{Data Aggregation Module} implementation has been available for this work but had to be extended to also persist all TTL changes for a domain. 
+
+The main application is mainly working in two modes. In the training mode, all entries are first loaded from the raw csv logs for the given period. The next step extracts and calculates the values that are needed for each feature and uses the filter lists, gathered by the \textit{Malicious/Benign Domains Collector} to label the dataset. After this, the feature values along with the label is persisted as serialized python objects. This persistence step is on the one side needed to do the final step of training but can also be useful if for some reason, the training is crashing or stopped it can be continued and picked up where the previous training left off. The last step is using the preprocessed features and the corresponding labels to build the decision model, i.e. generate the decision tree. The training can mostly (apart from the last step) be done in parallel to get a reasonable training time – the implementation in this work has efficiently been executed on 32 cores and took roughly two days for partially (see \ref{sec:system_architecture}) training three days of input data \todo{figures}. In the second mode, the \textit{Classification Module} classifies a dataset as either being benign or malicious. While the evaluated systems do have a variable reputation score from zero to one, this system does a binary classification for the dataset in the first place. This could be changed to a variable reputation score, e.g. using the probability for each class that can also be retrieved by the scikit-learn decision tree implementation \todo{(see fscite SciKitProbOnline)}.
+
+\todo{include picture of decision tree here}
+
+
+%\section{Evaluation}
+%\label{sec:evaluation}
+
+\todo{include more graphs/pictures in general}
+