rush hour 2

2018-01-30 20:54:29 +01:00
parent ece9b4afcf
commit 95d35f2470
20 changed files with 171 additions and 146 deletions
--- a/Thesis/content/Evaluation_of_existing_Systems/Kopis/Kopis.tex
+++ b/Thesis/content/Evaluation_of_existing_Systems/Kopis/Kopis.tex
@@ -36,7 +36,7 @@ The overall system architecture can be seen in Figure~\ref{fig:kopis_system_over

 \textbf{Benign domain sources: }
 \begin{itemize}
-    \item Domain and ip whitelists from DNSWL \fsCite{DNSWLOnline}
+    \item Domain and IP whitelists from DNSWL \fsCite{DNSWLOnline}
    \item Address space of the top 30 Alexa domains \fsCite{AlexaWebInformationOnline}
    \item Dihe's IP-Index Browser \fsCite{DIHEOnline}
 \end{itemize}
@@ -67,7 +67,7 @@ This group of features tries to map the requester diversity, i.e. where the requ
 \subsubsection{Requester Profile (RP)}
 \label{subsubsec:kopis_requester_profile}

-The \textit{Requester Profile} features are aiming to separate request that are coming from hardened networks (like enterprise networks) from less secure networks, e.g. ISP networks. Most smaller networks like enterprise or university networks are much better protected against malware in general and such should show less requests to malicious domains. On the other hand, ISPs do usually not invest much effort into cleaning their network from malware and do not offer a high level of protection against malware propagation inside the network. As \textit{Kopis} is operating in the upper DNS layers, it is often not possible to simply measure the population behind the requesting RDNS server (due to e.g. caching \fsCite{10.1007/978-3-540-24668-8_15}) and a different metric has to be found to measure the size of the network, a request has been submitted from. Assuming traffic, monitored from a large AuthNS in epoch \(E_t\) that has authority for a set of Domains \(D\) and all unique requesting IP addresses \(R\). For each requester IP \(R_k \in R\) the amount of different domains, queried by \(R_k\) in \(E_t\), is counted \(c_{t,k}\). A weight can then be applied to each requester \(R_k\) as \(w_{t,k} = \frac{c_{t,k}}{max_{l=1}^{|R|}c_{t,l}}\). Subsequent, the more domains in \(D\) a requester \(R_k\) is querying, the higher the weight will be. This way, high weights are corresponding to larger networks and following the explanation above, the more likely it is that this requester is infected with malicious software. Given a domain \textit{d} and let \(R_d\) being the set of all requester IP addresses, the count \(c_{t,k}\) is computed for each epoch \(E_t\) like previously described. In the following, the count for each epoch is multiplied with each weight \(w_{t-n,k}\), where \(w_{t-n,k}\) are the weights of \textit{n} days before epoch \(E_t\), top get the set of weighted counts of \textit{d} during \(E_t\): \(WC_t(d) = \{c_{t,k} * w_{t-n,k}\}_k\). Finally, five different feature values are calculated with the values of \(WC_t(d)\): the average, the \gls{biased_estimator} and \gls{unbiased_estimator} standard deviation and the biased and unbiased variance (see Glossary for an explanation of \gls{biased_estimator} and \gls{unbiased_estimator} standard deviation).
+The \textit{Requester Profile} features are aiming to separate request that are coming from hardened networks (like enterprise networks) from less secure networks, e.g. ISP networks. Most smaller networks like enterprise or university networks are much better protected against malware in general and such should show less requests to malicious domains. On the other hand, ISPs do usually not invest much effort into cleaning their network from malware and do not offer a high level of protection against malware propagation inside the network. As \textit{Kopis} is operating in the upper DNS layers, it is often not possible to simply measure the population behind the requesting RDNS server (due to e.g. caching \fsCite{10.1007/978-3-540-24668-8_15}) and a different metric has to be found to measure the size of the network, a request has been submitted from. Assuming traffic, monitored from a large AuthNS in epoch \(E_t\) that has authority for a set of Domains \(D\) and all unique requesting IP addresses \(R\). For each requester IP \(R_k \in R\) the amount of different domains, queried by \(R_k\) in \(E_t\), is counted \(c_{t,k}\). A weight can then be applied to each requester \(R_k\) as \(w_{t,k} = \frac{c_{t,k}}{max_{l=1}^{|R|}c_{t,l}}\). Subsequent, the more domains in \(D\) a requester \(R_k\) is querying, the higher the weight will be. This way, high weights are corresponding to larger networks and following the explanation above, the more likely it is that this requester is infected with malicious software. Given a domain \textit{d} and let \(R_d\) being the set of all requester IP addresses, the count \(c_{t,k}\) is computed for each epoch \(E_t\) like previously described. In the following, the count for each epoch is multiplied with each weight \(w_{t-n,k}\), where \(w_{t-n,k}\) are the weights of \textit{n} days before epoch \(E_t\), get the set of weighted counts of \textit{d} during \(E_t\): \(WC_t(d) = \{c_{t,k} * w_{t-n,k}\}_k\). Finally, five different feature values are calculated with the values of \(WC_t(d)\): the average, the biased and unbiased standard deviation and the biased and unbiased variance. The biased and unbiased estimator for the standard deviation of a random variable \textit{X} are defined as \(\sqrt{\sum_{i=1}^N \frac{1}{N}(\bar{X}_i - \mu)^2}\) and respectively \(\sqrt{\sum_{i=1}^N \frac{1}{N-1}(\bar{X}_i - \mu)^2}\) (with \(N\) being the amount of samples and \(\mu\) the empirical mean).


 \subsubsection{Resolved-IPs Reputation (IPR)}
@@ -86,7 +86,7 @@ The set of \textit{Resolved-IPs Reputation} features consists of nine individual
 \subsection{Results}
 \label{subsec:kopis_results}

-\textit{Kopis} used DNS traffic captured at two major domain name registrars (AuthNS servers) between 01.01.2010 and 31.08.2010 as well as a country code top level domain server (.ca) from 26.08.2010 up to 18.10.2010. As the TLD server was operated in delegate-only mode, passive DNS traffic had to be additionally collected to get the resolutions for these queries. In total, this led to 321 million lookups a day in average. This amount of data showed to be a significant problem and the overall traffic size to be analysed had to be reduced. The most significant reduction was to remove all duplicate queries and only take unique requests into account. Finally, about 12.5 million daily unique requests remained in average \todo{make reduction more clear}. Using the \textit{KB}, that consists of various sources (see \fsCite{subsec:kopis_architecture}), a sample with 225,429 unique RRs (corresponding to 28,915 unique domain names) could be split into groups with 27,317 malicious and 1,598 benign domains. All raw data was indexed in a relational database and was enriched with information like first and last seen timestamps. Like any system that uses a machine learning approach, it was important for \textit{Kopis} to select significant features and a period that was sufficient for the training to deliver good results. Figure~\ref{fig:kopis_train_period_selection} shows the \glspl{roc} (ROC) of different models, generated with data from periods of one up to five days and validated using 10-fold cross validation. According to \fsAuthor[Section 5.3]{Antonakakis:2011:DMD:2028067.2028094}: ``When we increased the observation window beyond the mark of five days we did not see a significant improvement in the detection results.'' Using these models, the best classification algorithm had to be found. This has been accomplished using a technique called model selection (see e.g. \fsCite{Kohavi:1995:SCB:1643031.1643047}). The most accurate classifier for these models has shown to be the \textit{random forest} implementation with a true positive rate of 98.4\% and a false positive rate of 0.3\% (with training data from a period of five days). \textit{Random forest} is the combination of different decision trees, either trained on different training sets or using different sets of features. Unfortunately, the exact random forest classification implementation of \textit{Kopis} has not been published. Other classifiers that have been experimented with are: Naive Bayes, k-nearest neighbors, Support Vector Machines, MLP Neural Network and random committee.
+\textit{Kopis} used DNS traffic captured at two major domain name registrars (AuthNS servers) between 01.01.2010 and 31.08.2010 as well as a country code top level domain server (.ca) from 26.08.2010 up to 18.10.2010. As the TLD server was operated in delegate-only mode, passive DNS traffic had to be additionally collected to get the resolutions for these queries. In total, this led to 321 million lookups a day in average. This amount of data showed to be a significant problem and the overall traffic size to be analysed had to be reduced. The most significant reduction was to remove all duplicate queries and only take unique requests into account. Finally, about 12.5 million daily unique requests remained in average. Using the \textit{KB}, that consists of various sources (see \fsCite{subsec:kopis_architecture}), a sample with 225,429 unique RRs (corresponding to 28,915 unique domain names) could be split into groups with 27,317 malicious and 1,598 benign domains. All raw data was indexed in a relational database and was enriched with information like first and last seen timestamps. Like any system that uses a machine learning approach, it was important for \textit{Kopis} to select significant features and a period that was sufficient for the training to deliver good results. Figure~\ref{fig:kopis_train_period_selection} shows the \glspl{roc} (ROC) of different models, generated with data from periods of one up to five days and validated using 10-fold cross validation. According to \fsAuthor[Section 5.3]{Antonakakis:2011:DMD:2028067.2028094}: ``When we increased the observation window beyond the mark of five days we did not see a significant improvement in the detection results.'' Using these models, the best classification algorithm had to be found. This has been accomplished using a technique called model selection (see e.g. \fsCite{Kohavi:1995:SCB:1643031.1643047}). The most accurate classifier for these models has shown to be the \textit{random forest} implementation with a true positive rate of 98.4\% and a false positive rate of 0.3\% (with training data from a period of five days). \textit{Random forest} is the combination of different decision trees, either trained on different training sets or using different sets of features. Unfortunately, the exact random forest classification implementation of \textit{Kopis} has not been published. Other classifiers that have been experimented with are: Naive Bayes, k-nearest neighbors, Support Vector Machines, MLP Neural Network and random committee.

 \begin{figure}[!htbp]
    \centering