rush hour 3

2018-02-01 01:40:07 +01:00
parent 95d35f2470
commit 970af03c09
18 changed files with 183 additions and 177 deletions
--- a/Thesis/content/Development_of_DoresA/Development_of_DoresA.tex
+++ b/Thesis/content/Development_of_DoresA/Development_of_DoresA.tex
@@ -1,12 +1,12 @@
 \chapter{Development of DoresA}
 \label{cha:development_of_doresa}

-The last part of this work the development of a dynamic domain reputation system, \textit{DoresA} (or Domain reputation scoring Algoriothm). A lot of concepts for this system will be adopted from the previously evaluated systems, most concepts will be taken from \textit{Exposure} with some general ideas of \textit{Notos} and \textit{Kopis}. In general, there are some limitations to be taken into account which arise mostly by the specific type of data that is available for this work and where it has been monitored. The passive DNS logs that have been provided for this work have been collected on three recursive DNS servers in a large company in locations in Europe, Asia and the United States. As those logs do contain sensitive data, raw logs used in this work can not be published mostly due to privacy reasons. It also has to be noted, that the DNS requests are not available for this work for the same reason.
+The last part of this work describes the development of a dynamic domain reputation system, \textit{DoresA} (or Domain reputation scoring Algorithm). A lot of concepts for this system will be adopted from the previously evaluated systems, most concepts will be taken from \textit{Exposure} with some general ideas of \textit{Notos} and \textit{Kopis}. In general, there are some limitations to be taken into account which arise mostly by the specific type of data that is available for this work and where it has been monitored. The passive DNS logs that have been provided for this work have been collected on three recursive DNS servers in a large company network in locations in Europe, Asia and the United States. As those logs do contain sensitive data, raw logs used in this work can not be published mostly due to privacy reasons. It also has to be noted that the DNS requests are not available for this work for the same reason.

 \section{Initial Situation and Goals}
 \label{sec:initial_situation_and_goals}

-Ultimately, this work should come up with an algorithm to find domains that are involved in malicious activities. Most of the latest approached work has been working with machine learning techniques to build domain reputation scoring algorithms. As those publications have generally shown promising results (see Section~\ref{cha:evaluation_of_existing_systems}), this work is also focusing on a dynamic approach with machine learning algorithms involved. The network, in which the logs for this work have been collected is different from most ISP or other public networks. There is a lot of effort made to keep the network malware-free. This includes both software solutions (like anti-virus software and firewalls) as well as a team that proactively and reactively monitors and removes malware. Another defensive task is to train the employees to be aware of current and upcoming threats (e.g., to pay attention on hyperlinks in emails, distrust public usb sticks and physical access guidelines). Although this should lead to a mostly malware free network with few requests to malicious domains, 2017 has shown to be the year of ransomware (see Section~\ref{sec:malware}). Private internet users and companies have been infected with malware that was encrypting their data and requiring the target to pay an amount of money to decrypt it. There are of course other ongoing threats that have existed for many years, like spam campaigns (\fsCite{TrendMicroOnline}). The particular task in this work is to discover whether a dynamic reputation system for domains is useful and applicable under this circumstances. The ultimate goal (not part of this work) is an automated warning system that triggers when a malicious domain is requested.
+Ultimately, this work should come up with an algorithm to find domains that are involved in malicious activities. Most of the latest approached work has been working with machine learning techniques to build domain reputation scoring algorithms. As those publications have generally shown promising results (see Section~\ref{cha:evaluation_of_existing_systems}), this work is also focusing on a dynamic approach with machine learning algorithms involved. The network, in which logs used for analysis and the development of the new algorithm, have been collected, is different from most ISP or other public networks. There is a lot of effort made to keep the network malware-free. This includes both software solutions (like anti-virus software and firewalls) as well as a team that proactively and reactively monitors and removes malware. Another defensive task is to train the employees to be aware of current and upcoming threats (e.g., to pay attention on hyperlinks in emails, distrust public usb sticks and physical access guidelines). Although this should lead to a mostly malware free environment (in this particular and similarly hardened networks) with few requests to malicious domains, 2017 has shown to be the year of ransomware (see Section~\ref{sec:malware}). Companies as well as private internet users have been infected with malware that was encrypting their data and requiring the target to pay an amount of money to decrypt it. There are of course other ongoing threats that have existed for many years, like spam campaigns (\fsCite{TrendMicroOnline}). The task in this work is to discover whether a dynamic reputation system for domains is useful and applicable under these circumstances. The ultimate goal (not part of this work) is an automated warning system that triggers when a malicious domain is requested.


 \section{System Architecture}
@@ -16,8 +16,8 @@ The overall system will take an similar approach which was first introduced by \
 \begin{itemize}
    \item \textit{Malware Prevention through Domain Blocking} list from malwaredomains.com which is an professionally maintained list with domains involved in malicious activities like the distribution of malware and spyware (\fsCite{malwaredomainsInformationOnline}).
    \item \textit{Phishtank}: A list that targets domains that are engaged in spam activities (\fsCite{PhishtankInformationOnline}).
-    \item \textit{ZeuS Tracker}: Blocking list for domains and IP addresses involved in the ZeuS botnet as command and control (C\&C)servers.
-    \item \textit{Alexa} with a list of the most popular domains in various countries as well as a global overview (total of 2000 domains).
+    \item \textit{ZeuS Tracker}: Blocking list for domains and IP addresses involved in the ZeuS botnet as command and control (C\&C) servers.
+    \item \textit{Alexa} with a list of the most popular domains from a global perspective (total of 2000 domains).
 \end{itemize}

 \begin{figure}[!htbp]
@@ -28,63 +28,71 @@ The overall system will take an similar approach which was first introduced by \
 \end{figure}


-The malicious domains list from those three services consisted of 28367 individual entries when first collected. This information is later used to label benign and malicious domains in the training process. The \textit{Malicious/Benign Domains Collector} can be rerun at any time to keep up with known malicious and benign domains at a later stage and increase the accuracy of \textit{DoresA}. The second module, \textit{Data Aggregation Module} is collecting all passive DNS logs and persisting those. The \textit{Data Aggregation Module} is also responsible for extracting and persisting all feature values that are needed in the training step and such consumed by the \textit{Training Module}. This \textit{Training Module}'s primary concern is to learn a model that holds information about resource usage of certain DNS responses as well as labeling those data samples. Due to the limitation of available time, the training period has been reduced to three days (starting from the first of september in 2017) and for simplicity has been reduced to 1 million samples (which have been chosen randomly over the three days). The training model thus consisted of a total of 1 million DNS responses and included resolutions for \textit{how many individual domains} individual domains. The accuracy of this model can be also be increased by retraining the model e.v. once a day or week to keep up with new characteristics of malicious usage. This training model can then be used in the last module, the \textit{Classification Module}, to classify resolutions (feature vector) of unlabeled domains. The \textit{Classification Module} could e.g. be used to act as a real-time warning system when deployed in a network.
+The malicious domains list from those three services consisted of 28367 individual entries when first collected. This information is later used to label benign and malicious domains in the training process. The \textit{Malicious/Benign Domains Collector} can be rerun at any time to keep up with known malicious and benign domains at a later stage and thus increase the accuracy of \textit{DoresA}. The second module, \textit{Data Aggregation Module} is collecting all passive DNS logs and persisting those. The \textit{Data Aggregation Module} is also responsible for extracting and persisting all feature values that are needed in the training step and such consumed by the \textit{Training Module}. This \textit{Training Module}'s primary concern is to learn a model that holds information about resource usage of certain DNS responses as well as labeling those data samples. Due to the limitation of available time, the training period has been reduced to three days (starting from the first of september in 2017) and for simplicity has been reduced to 1 million samples (which have been chosen randomly over the three days). The accuracy of this model can be also be increased by retraining the model e.g. once a week to keep up with new characteristics of malicious usage.  This training model can then be used in the last module, the \textit{Classification Module}, to classify resolutions (feature vector) of unlabeled domains. The \textit{Classification Module} could e.g. be used to act as a real-time warning system when deployed on a resolver.

-The logs that are provided have been collected in different locations all over the world and are aggregated on a single machine as csv files. As operating on the raw csv logs in the training step has shown to be very inefficient (with roughly one week of training time for one day), especially when performing multiple analysis cycles, a different solution for accessing the logs had to be found. Experimenting with putting the raw passive DNS logs into a NoSQL database (MongoDB \fsCite{MongoDBOnline}) as well as a relational database (MariaDB \fsCite{MariaDBOnline}) did not show a significant decrease in accessing the data so a slightly different approach has been used. By using an in-memory database (redis \fsCite{RedisOnline}) and only keeping those information, that are needed for the analysis has shown to give much better results, e.g. one day for the training of 1 million samples. It has to be stated though that while retaining most of the needed information, information like the timestamp of individual requests could not be kept. The following attributes are stored inside the redis instance. 
+The logs that are provided have been collected in different locations all over the world and are aggregated on a single machine as csv files. As operating on the raw csv logs in the training step has shown to be very inefficient (with roughly one week of training time for one day), especially when performing multiple analysis cycles, a different solution for accessing the logs had to be found. Experimenting with putting the raw passive DNS logs into a NoSQL database (MongoDB \fsCite{MongoDBOnline}) as well as a relational database (MariaDB \fsCite{MariaDBOnline}) did not show a significant decrease in accessing the data so a slightly different approach has been used. By using an in-memory database (Redis \fsCite{RedisOnline}) and only keeping information that are needed for the analysis has shown to give much better results: for a training set with 1 million samples, the execution time could be reduced to two day. It has to be stated though that while retaining most of the needed information, aspects like the timestamp of individual requests could not be kept. As the time patterns of single requests could not be used for the classification anyway, due to caching in lower hierarchies, this has not shown to be a problem. The following attributes are stored inside the Redis instance. 
 \begin{itemize}
-    \item \textbf{Resource record}, i.e. the domain name in this scope
-    \item The \textbf{type of the resource record}, DoresA does only take A records into account as most features can not be extracted from other types. See Section~\ref{subsubsec:dns_resource_records} for an explanation of the DNS resource record types..
-    \item All \textbf{TTL} values that this domain has had in the analysis period.
+    \item \textbf{Resource record}, i.e. the domain name in this scope.
+    \item The \textbf{type} of the resource record, DoresA does only take A records into account as most features can not be extracted from other types. See Section~\ref{subsubsec:dns_resource_records} for an explanation of the DNS resource record types.
+    \item All \textbf{TTL} values that this domain has been assigned to in the analysis period.
    \item \textbf{Resolution}: The IP addresses, that the record type resolved to.
    \item \textbf{First/last-seen}: Timestamps of when the domain has been seen for the first and last time.
    \item Additionally, all \textbf{reverse DNS} results are persisted, e.g. to find all historic domains that resolved to a known IP address.
 \end{itemize}

-Using an in-memory database for this application led to a different challenge. Even though trimmed down to the minimum set of information, the data. For this reason, a machine with an appropriate amount of internal RAM had to be used. In this case, a total of 512 Gigabyte of RAM was available with an Intel Xeon CPU with 32 cores.
+Using an in-memory database for this application led to a different challenge. Even though trimmed down to the minimum set of information, the Redis database used 3354 Megabyte of memory for traffic of one week. For this reason, a machine with an appropriate amount of internal RAM had to be used. In this case, a total of 512 Gigabyte of RAM was available with an Intel Xeon CPU with 32 cores to be able to perform analysis in a reasonable time. As training on a single core with those amounts of data was not feasible in the available time, a multi core processing approach has been targeted.


 \subsection{Decision Tree Classifier}
 \label{subsec:decision_tree_classifier}

-While evaluating previous work, mainly two classification algorithms have shown to provide good results in this area. A decision tree classifier has some advantages over different other classification systems: the training time is comparably low, especially in contrast to neural networks. It delivers quite easily interpretable results when plotting the resulting decision tree, it requires little data preparation (e.g. no normalization of the input is needed like in many other algorithms and can handle both numerical and categorical inputs) and it is possible to validate the results of the training using techniques like cross validation. In this work, the implementation of the python library scikit-learn is used. The current implementation of the scikit-learn algorithm is called \textit{CART} (Classification and Regression Trees) and is based on the C4.5 decision tree implementation that is also used in \textit{Exposure}. For a detailed comparison of classification algorithms see \fsCite{Lim2000}.
+While evaluating previous work, two classification algorithms have shown to provide good results in this area. While using a random forest implementation was giving good results in the case of \textit{Kopis}, decision tree classifiers has one major advantage over a random forest implementation. Performance has shown to be a major challenge in this work and as random forest is basically an implementation consisting of multiple (arbitrary sized) decision trees, the runtime of the training time increases by the factor of trees the random forest is generating. As \textit{Exposure} and \textit{Notos} have proved to achieve good results with a decision tree, this classification algorithm will also be used in this work. Decision tree classification further features: easily interpretable results when plotting the resulting tree, it requires little data preparation (e.g. no normalization of the input is needed like in many other algorithms and can handle both numerical and categorical inputs) and it is possible to validate the results of the training using techniques like cross validation. In this work, the implementation of the python library scikit-learn is used. The current implementation of the scikit-learn algorithm is called \textit{CART} (Classification and Regression Trees) and is based on the C4.5 decision tree implementation that is also used in \textit{Exposure}. For a detailed comparison of classification algorithms see \fsCite{Lim2000}.


 \section{Feature Selection}
 \label{sec:feature_selection}

-The feature selection is primarily motivated by the results of the evaluation of previously proposed systems. As \textit{Exposure} has shown to be the system that shares most similarities compared to the network and traffic that is available, also most features are adopted from \textit{Exposure} in the first place. Due to the restricted analysis time, the \textit{Time-Based Features} can unfortunately not be used in this work. To recapture, at least one week of traffic has to be trained to benefit from those features. Besides from that, nearly all features of \textit{Exposure} could be used for the training. See Table~\ref{tab:doresa_features} for all features that are used to model the resource usage characteristics of domains, used in legitimate and malicious activities. For a detailed explanation of why these features have been included, see Section~\ref{subsec:exposure_features}.
+The feature selection is primarily motivated by the results of the evaluation of previously proposed systems. As \textit{Exposure} has shown to be the system that shares most similarities, mostly regarding the traffic that is available, most features are also adopted from \textit{Exposure} in the first place. Due to the restricted analysis time, the \textit{Time-Based Features} can unfortunately not be used in this work. To recapture, at least one week of traffic has to be trained to benefit from those features. Besides from that, nearly all features of \textit{Exposure} could be used for the training. See Table~\ref{tab:doresa_features} for all features that are used to model the resource usage characteristics of domains, used in legitimate and malicious activities. For a detailed explanation of why these features have been included, see Section~\ref{subsec:exposure_features}. This sums up to a total of nine different features with some features having multiple feature values in the feature vector: The \textit{Reverse DNS query results} contain the ratio of IPs that can not be resolved (NX domains), the number of all resolved IPs for a domain, the ratio of ip addresses that are known to be used as digital subscriber lines (DSL), the ratio of IPs that are used for web hosting, the ratio of IPs that are used by internet service providers (ISPs) and the ratio of IPs that can be matched with a valid domain name. Please not that a software that would have been used to generate these features could not be shipped in time and the NX domains have not yet been available in the database so these features are ignored in the sample training. The percentage usage of specific TTL ranges includes the following individual features (in seconds): [0, 1], [1, 10], [10, 100], [100, 300], [300, 900], [900, inf].


 \begin{table}[!htbp]
    \centering
    \caption{Doresa: Features}
    \label{tab:doresa_features}
-    \begin{tabularx}{\textwidth}{|l|X|}
+    \begin{tabularx}{\textwidth}{|l|X|l|}
    \hline
-    \textbf{Feature Set}                                & \textbf{Feature Name}                   \\ \hline
-    \multirow{4}{*}{\textit{DNS Answer-Based Features}} & Number of distinct IP addresses         \\ \cline{2-2} 
-                                                        & Number of distinct countries            \\ \cline{2-2} 
-                                                        & Number of domains share the IP with     \\ \cline{2-2} 
-                                                        & Reverse DNS query results               \\ \hline
-    \multirow{5}{*}{\textit{TTL Value-Based Features}}  & Average TTL                             \\ \cline{2-2} 
-                                                        & Standard Deviation of TTL               \\ \cline{2-2} 
-                                                        & Number of distinct TTL values           \\ \cline{2-2} 
-                                                        & Number of TTL change                    \\ \cline{2-2} 
-                                                        & Percentage usage of specific TTL ranges \\ \hline
-    \multirow{2}{*}{\textit{Domain Name-Based Features}}         & \% of numerical characters              \\ \cline{2-2} 
-                                                        & \% of the length of the LMS             \\ \hline
-\end{tabularx}
-\end{table}
+    \textbf{Feature Set}                                & \textbf{Feature Name}                   & \textbf{\# in Vector} \\ \hline
+    \multirow{4}{*}{\textit{DNS Answer-Based Features}} & Number of distinct IP addresses         & \#1                                    \\ \cline{2-3} 
+                                                        & Number of distinct countries            & \#2                                    \\ \cline{2-3} 
+                                                        & Number of domains share the IP with     & \#3                                    \\ \cline{2-3} 
+                                                        & Reverse DNS query results               & \#4 - \#8                              \\ \hline
+    \multirow{5}{*}{\textit{TTL Value-Based Features}}  & Average TTL                             & \#9                                    \\ \cline{2-3} 
+                                                        & Standard Deviation of TTL               & \#10                                   \\ \cline{2-3} 
+                                                        & Number of distinct TTL values           & \#11                                   \\ \cline{2-3} 
+                                                        & Number of TTL change                    & \#12                                   \\ \cline{2-3} 
+                                                        & Percentage usage of specific TTL ranges & \#13 - \#17                            \\ \hline
+    \multirow{2}{*}{Domain Name-Based Features}         & \% of numerical characters              & \#18                                   \\ \cline{2-3} 
+                                                        & \% of the length of the LMS             & \#19                                   \\ \hline
+    \end{tabularx}
+    \end{table}


 \section{Implementation}
 \label{sec:implementation}

-The implementation of \textit{DoresA} does include several different pieces of software. The main part is implemented in python and consists of the \textit{Training Module} and the \textit{Classification Module}. Apart from the main application, the \textit{Malicious/Benign Domains Collector} is a collection of bash scripts to fetch the filter lists and combine them into lists that can easily be consumed by the main application. The \textit{Data Aggregation Module} is written in C (\fsCite{kernighan2006c}), mostly for performance reasons as these logs are aggregated in real time and fed into the redis database. Most of the \textit{Data Aggregation Module} implementation has been available for this work but had to be extended to also persist all TTL changes for a domain. 
+The implementation of \textit{DoresA} does include several different pieces of software. The main part is implemented in python and consists of the \textit{Training Module} and the \textit{Classification Module}. Apart from the main application, the \textit{Malicious/Benign Domains Collector} is a collection of bash scripts to fetch the filter lists and combine them into lists that can easily be consumed by the main application. The \textit{Data Aggregation Module} is written in C (\fsCite{kernighan2006c}), mostly for performance reasons as these logs are aggregated in real time and fed into the Redis database. For the \textit{Data Aggregation Module}, a previously available implementation could be extended to also persist all TTL changes for a domain. To further decrease training time, the Redis database actually consists of nine different instances which can be accessed in parallel. To actually benefit from multiple instances, the domain (acting as a key) has been hashed and the modulo operation has been used to evenly fill the instances.

-The main application is mainly working in two modes. In the training mode, all entries are first loaded from the raw csv logs for the given period. The next step extracts and calculates the values that are needed for each feature and uses the filter lists, gathered by the \textit{Malicious/Benign Domains Collector} to label the dataset. After this, the feature values along with the label is persisted as serialized python objects. This persistence step is on the one side needed to do the final step of training but can also be useful if for some reason, the training is crashing or stopped, it can be continued and picked up where the previous training left off. The last step is using the preprocessed features and the corresponding labels to build the decision model, i.e. generate the decision tree. The training can mostly (apart from the last step) be done in parallel to get a reasonable training time – the implementation in this work has efficiently been executed on 32 cores and took roughly two days for training a dataset with 1 million samples. In the second mode, the \textit{Classification Module} classifies a dataset as either being benign or malicious. While the evaluated systems do have a variable reputation score from zero to one, this system does a binary classification for the dataset in the first place. This could be changed to a variable reputation score, e.g. using the probability for each class that can also be retrieved by the scikit-learn decision tree implementation. 
+The main application is working in two modes. In the training mode, all entries are first loaded from the raw csv logs for the given period. The next step extracts and calculates the values that are needed for each feature and uses the filter lists, gathered by the \textit{Malicious/Benign Domains Collector} to label the dataset. After this, the feature values along with the label are persisted. The last step is using the preprocessed features and the corresponding labels to build the decision model, i.e. generate the decision tree. The training can mostly (apart from the last step) be done in parallel to get a reasonable training time – the implementation in this work has efficiently been executed on 32 cores and took roughly two days for training a dataset with 1 million samples. In the second mode, the \textit{Classification Module} classifies a dataset as either being benign or malicious. While the evaluated systems do have a variable reputation score from zero to one, this system does a binary classification for the dataset in the first place. This could be changed to a variable reputation score, e.g. using the probability for each class that can also be retrieved by the scikit-learn decision tree implementation. A variable score has one major advantage over a binary classification. Operators can set a threshold to make sure that no false positives occur, for example in an automated blocking system.

-Figure~\ref{fig:doresa_selection_decision_tree} shows an excerpt of the resulting decision tree from the test training with 1 million data samples. \todo{describe what is seen on the decision tree excerpt}
+Figure~\ref{fig:doresa_selection_decision_tree} shows an excerpt of the resulting decision tree from the test training with 1 million data samples. Looking at the root node (see Figure~\ref{fig:doresa_selection_decision_tree_root}) we can see that the overall model consists of 1 million samples and the first row shows that feature twelve (the number of TTL changes) \(X[11]\) has the most information gain to split the initial dataset. The second column shows the equality distribution, where zero represents an equal distribution and one an completely unequal distribution (all samples in one class). Considering any leaf in Figure~\ref{fig:doresa_selection_decision_tree} we can see the amount of samples that belongs to a class and the resulting class/label (zero represents a benign, one a malicious domain).
+
+
+\begin{figure}[!htbp]
+    \centering
+    \includegraphics[width=.8\textwidth, clip=true]{content/Development_of_DoresA/doresa_example_tree_root.png}
+    \caption{DoresA: root node of resulting decision tree}
+    \label{fig:doresa_selection_decision_tree_root}
+\end{figure}

 \begin{figure}[!htbp]
    \centering
--- a/Thesis/content/Development_of_DoresA/doresa_example_tree_root.png
+++ b/Thesis/content/Development_of_DoresA/doresa_example_tree_root.png