rush hour 3

2018-02-01 01:40:07 +01:00
parent 95d35f2470
commit 970af03c09
18 changed files with 183 additions and 177 deletions
--- a/Thesis/content/Evaluation_of_existing_Systems/Evaluation_of_existing_Systems.tex
+++ b/Thesis/content/Evaluation_of_existing_Systems/Evaluation_of_existing_Systems.tex
@@ -6,7 +6,7 @@ This chapter deals with work around domain reputation scoring systems that has b
 \section{Evaluation Schema}
 \label{sec:evaluation_scheme}

-For a comprehensive evaluation, all input and output as well as the exact implementations (and/or the corresponding parameters that have been used for the analysis) of the algorithm was needed. Unfortunately, none of the publications we are dealing with here have released any (raw) input data, specifically the passive DNS logs and the filter lists for the training set. Neither has any of the algorithm's actual implementation been published. For this reason the evaluation of the existing systems is focusing on the results that have individually been published. Most importantly the detection rate as well as the false positive rate. Another important fact for this overview is what data has actually been used for the training and classification and where the data has been obtained. Passive DNS logs may be collected in different stages of the DNS resolution and might, e.g. due to caching, lead to the extraction of different information. A resolver running on the users machine might obtain much more traffic and thus benefit from e.g. time based patterns which are not possible at higher level DNS servers that are not able to collect that traffic because the response has been cached on resolvers in a lower (DNS-) hierarchy.
+For a comprehensive evaluation, all input and output as well as the exact implementations (and/or the corresponding parameters that have been used for the analysis) of the algorithm was needed. Unfortunately, none of the publications we are dealing with here have released any (raw) input data, specifically the passive DNS logs and the filter lists for the training set. Neither has any of the algorithm's actual implementation been published. For this reason the evaluation of the existing systems is focusing on the results that have individually been published. Most importantly, the detection rate as well as the false positive rate. Another important fact for this overview is what data has actually been used for the training and classification and where the data has been obtained. Passive DNS logs may be collected in different stages of the DNS resolution and might, e.g. due to caching in a lower DNS hierarchy, lead to the extraction of different information. A resolver running on the users machine might obtain much more traffic and thus benefit from e.g. time based patterns which are not possible at higher level DNS servers that are not able to collect that traffic because the response has been cached on resolvers in a lower (DNS-) hierarchy.


 \input{content/Evaluation_of_existing_Systems/Notos/Notos.tex}
@@ -19,6 +19,6 @@ For a comprehensive evaluation, all input and output as well as the exact implem
 \section{Results and Comparison}
 \label{sec:results_and_comparison}

-After investigating those three systems, we want to demonstrate the major differences and similarities. The results discussed here are the base for the implementation of the own algorithm. All three systems are based on machine learning techniques. Two of the systems use a decision tree classifier and \textit{Kopis} uses a random forest classifier which is not significantly different from a decision tree but has some advantages in some areas (see a detailed comparision on Section~\ref{sec:model_selection}) One major difference of these systems is the data they are working with. While \textit{Notos} and \textit{Exposure} are operated with data collected at recursive DNS servers in lower DNS layers, \textit{Exposure} is gathering traffic from a top level domain name server and two AuthNS from major domain name registrars. As the data available for this work has also been gathered at RDNS servers in a lower DNS hierarchy and no data from higher DNS layers is available, most concepts of \textit{Kopis} can not be used for the system that is proposed in this work. Nevertheless there are general aspects of \textit{Kopis} that can be useful, e.g. which sources have been used to build the knowledge base for the classification of test samples in the training or how the overall architecture has been designed. It also has to be noted though, that \textit{Kopis} is the only system that is able to operate without having reputation information for domains and IPs available. Having data available that is collected similarly to \textit{Notos} and \textit{Exposure} does not mean that all concepts and features can be applied in the new system. A company network has much different characteristics then a network operated by e.g. an ISP. The network, in which the logs for this work has been collected in, is hardened with much more effort so that malware should generally be rarely found. Especially \textit{Notos} uses public traffic from an ISP RDNS server that is handling clients of this ISP network which, by design, can not be taken care of like in a closed company network and is much more likely to contain a lot of different malware. One major difference between \textit{Notos} and \textit{Exposure} is the complexity of the overall system. \textit{Notos}, being the first dynamic domain reputation system, has a much higher amount of features that are used. Some of these features, like the Network-based features (see Table~\ref{tab:notos_network-based_features}) are much more fine grain (e.g. independently operating on the top level, second level and third level domains) compared to the similar group of features in \textit{Exposure} (see Table~\ref{tab:exposure_features}, \textit{DNS Answer-Based Features}). For this reason, \textit{Notos} does also need much more detailed reputation information, e.g. for the IP spaces. Although not having such fine grain features, \textit{Exposure} shows similar detection rates like \textit{Notos}. Another general advantages of \textit{Exposure} over \textit{Notos} is the reduced training time (again for example due to fewer features) and that it does not need information about malware that has been gathered in self hosted honeypots (which in fact, done right is a completely different topic on its own and therefore not part of this work). 
+After investigating those three systems, we want to demonstrate the major differences and similarities. The results discussed here are the base for the implementation of the own algorithm. All three systems are based on machine learning techniques. Two of the systems use a decision tree classifier and \textit{Kopis} uses a random forest classifier which is not significantly different from a decision tree but requires more effort to be implemented. One major difference of these systems is the data they are working with. While \textit{Notos} and \textit{Exposure} are operated with data collected at recursive DNS servers in lower DNS layers, \textit{Kopis} is gathering traffic from a top level domain name server and two AuthNS from major domain name registrars. As the data available for this work has also been gathered at RDNS servers in a lower DNS hierarchy and no data from higher DNS layers is available, most concepts of \textit{Kopis} can not be used for the system that is proposed in this work. Nevertheless, there are general aspects of \textit{Kopis} that can be useful, e.g. which sources have been used to build the knowledge base for the classification of test samples in the training or how the overall architecture has been designed. It also has to be noted though, that \textit{Kopis} is the only system that is able to operate without having reputation information for domains and IPs available. Having data available that is collected similarly to \textit{Notos} and \textit{Exposure} does not mean that all concepts and features can be applied in the new system. A company network has much different characteristics than a network operated by e.g. an ISP. The network, in which the logs for this work has been collected in, is hardened with much more effort so that malware should generally be rarely found. Especially \textit{Notos} uses public traffic from an ISP RDNS server that is handling clients of this ISP network which, by design, can not be taken care of like in a closed company network and is much more likely to contain a lot of different malware. One major difference between \textit{Notos} and \textit{Exposure} is the complexity of the overall system. \textit{Notos}, being the first dynamic domain reputation system, has a much higher amount of features that are used. Some of these features, like the Network-based features (see Table~\ref{tab:notos_network-based_features}) are much more fine grain (e.g. independently operating on the top level, second level and third level domains) compared to the similar group of features in \textit{Exposure} (see Table~\ref{tab:exposure_features}, \textit{DNS Answer-Based Features}). For this reason, \textit{Notos} does also need much more detailed reputation information, e.g. for the IP spaces. Although not having such fine grain features, \textit{Exposure} shows similar detection rates like \textit{Notos}. Another general advantages of \textit{Exposure} over \textit{Notos} is the reduced training time (again for example due to fewer features) and that it does not need information about malware that has been gathered in self hosted honeypots (which in fact, done right is a completely different topic on its own and therefore not part of this work). 

 It also has to be noted, that while all three systems show a high detection rate in general with a high true positive and low false positive rate, they can not be operated with a 100\% success rate and should always be deployed along with other detection systems like firewalls, malware detection software and/or traditional filter systems like DNS black- and whitelists. Dynamic reputation system can however be used to find domains used in malicious activities before other systems are aware of the threat.
--- a/Thesis/content/Evaluation_of_existing_Systems/Exposure/Exposure.tex
+++ b/Thesis/content/Evaluation_of_existing_Systems/Exposure/Exposure.tex
@@ -4,19 +4,19 @@
 \subsection{General}
 \label{subsec:exposure_general}

-\textit{Exposure} is ``a system that employs large-scale, passive DNS analysis techniques to detect domains that are involved in malicious activity''\fsCite{Bilge11exposure:finding}, which was first introduced in 2011 by the \textit{Institute Eurecom} in Sophia Antipolis, the \textit{Northeastern University} from Boston and the \textit{University of California} in Santa Barbara. \textit{Exposure} is the second published system to detect malicious domains using passive DNS data and is built on the key premise, that most malicious services are dependent on the domain name system and compared to benign services should expose enough differences in behaviour for an automated discovery, see Section~\ref{subsec:exposure_features} for what differences the features are targeted at. The main analysis for \textit{Exposure} has been run on data of a period of 2.5 month with more than 100 billion DNS queries. \textit{Exposure} is not targeted at a specific threat but rather covers a wide variety of malicious activities like phishing, Fast-Flux services, spamming, botnets (using domain generation algorithms), and similar others. It uses fifteen features, with nine features, that have not been proposed in previous research. Ultimately, \textit{Exposure} offers a real-time detection system which has been made available to the public in 2014 \fsCite{Bilge:2014:EPD:2617317.2584679}. Unfortunately, the service was not accessible at the time of this writing. 
+\textit{Exposure} is ``a system that employs large-scale, passive DNS analysis techniques to detect domains that are involved in malicious activity''\fsCite{Bilge11exposure:finding}, which was first introduced in 2011 by the \textit{Institute Eurecom} in Sophia Antipolis, the \textit{Northeastern University} from Boston and the \textit{University of California} in Santa Barbara. \textit{Exposure} is the second published system to detect malicious domains using passive DNS data and is built on the key premise, that most malicious services are dependent on the Domain Name System and compared to benign services should expose enough differences in behaviour for an automated discovery, see Section~\ref{subsec:exposure_features} for what differences the features are targeted at. The main analysis for \textit{Exposure} has been run on data of a period of 2.5 month with more than 100 billion DNS queries. \textit{Exposure} is not targeted at a specific threat but rather covers a wide variety of malicious activities like phishing, Fast-Flux services, spamming, botnets (using domain generation algorithms), and similar others. It uses fifteen features, with nine features, that have not been proposed in previous research. Ultimately, \textit{Exposure} offers a real-time detection system which has been made available to the public in 2014 \fsCite{Bilge:2014:EPD:2617317.2584679}. Unfortunately, the service was not accessible at the time of this writing. 


 \subsection{Architecture}
 \label{subsec:exposure_architecture}

-For the distinction of benign and malicious domains to perform well, a large set of training data is used in \textit{Exposure} (seven days). The offline training has been powered by recursive DNS traffic (RDNS), gathered from the Security Information Exchange (SIE). Specifically, only the answer of the RDNS traffic has been used, that comprises of: the queried domain name, timestamp of the request, caching time TTL and the list of resolved IP addresses. The overall system consists of five main components. How those modules are interacting with each other and which input data is required for each module can be seen in Figure~\ref{fig:exposure_system_overview}.
+For the distinction of benign and malicious domains to perform well, a large set of training data is used in \textit{Exposure} (seven days). The offline training has been powered by recursive DNS traffic (RDNS), gathered from the Security Information Exchange (SIE). Specifically, only the answers of the RDNS traffic have been used and comprises of: the queried domain name, timestamp of the request, caching time TTL and the list of resolved IP addresses. The overall system consists of five main components. How those modules are interacting with each other and which input data is required for each module can be seen in Figure~\ref{fig:exposure_system_overview}.

 \begin{itemize}
    \item The \textit{Data Collector} module passively captures the DNS traffic in the monitored network.
    \item The \textit{Feature Attribution} component is attributing the captured domains with a vector containing the associated features.
-    \item The third component \textit{Malicious and Benign Domains Collector} is running in parallel to the first two modules and constantly gathers information about known good and known bad domains. These lists are used to label the output of the \textit{Feature Attribution} module afterwards, as it can be seen in picture~\ref{fig:exposure_system_overview}. The list of benign domains is extracted from the Alexa top list \fsCite{AlexaWebInformationOnline} and externally confirmed \gls{whois} data. The list of known malicious domains is collected from several external, both professionally provisioned and user maintained, sources and includes domains in different threat classes, e.g., malwaredomains.com \fsCite{malwaredomainsInformationOnline}, Phishtank \fsCite{PhishtankInformationOnline}, Anubis (no longer available), the Zeus Block List \fsCite{zeusblocklistInformationOnline} and domains from DGAs for Conficker \fsCite{porras2009foray} and Mebroot \fsCite{Stone-Gross:2009:YBM:1653662.1653738}.
-    \item The labeled dataset is then fed into the \textit{Learning Module} and trains the domain detection model that is used in the final step. This classifier may also be retrained on a regular basis to keep up with malicious behavior (daily in \textit{Exposure}). 
+    \item The third component \textit{Malicious and Benign Domains Collector} is running in parallel to the first two modules and constantly gathers information about known good and known bad domains. These lists are used to label the output of the \textit{Feature Attribution} module afterwards, as it can be seen in Figure~\ref{fig:exposure_system_overview}. The list of benign domains is extracted from the Alexa top list \fsCite{AlexaWebInformationOnline} and externally confirmed \gls{whois} data. The list of known malicious domains is collected from several external, both professionally provisioned and user maintained, sources and includes domains in different threat classes, e.g., malwaredomains.com \fsCite{malwaredomainsInformationOnline}, Phishtank \fsCite{PhishtankInformationOnline}, Anubis (no longer available), the Zeus Block List \fsCite{zeusblocklistInformationOnline} and domains from DGAs for Conficker \fsCite{porras2009foray} and Mebroot \fsCite{Stone-Gross:2009:YBM:1653662.1653738}.
+    \item The labeled dataset is then fed into the \textit{Learning Module} and trains the domain detection model that is used in the final step. This classifier may also be retrained on a regular basis to keep up with malicious behavior (daily in \textit{Exposure}).
    \item The \textit{Classifier} uses the decision model to classify unlabeled (new) domains into benign and malicious groups. For this, the same feature vector that is produced by the \textit{Feature Attribution} module is used.
 \end{itemize}

@@ -35,11 +35,11 @@ For the distinction of benign and malicious domains to perform well, a large set
 \textit{Exposure} uses a total of fifteen features that have been chosen after several month of study with thousands of well-known benign and malicious domains. These features are grouped into four different categories which can be seen in Table~\ref{tab:exposure_features}. 

 The first group, \textit{Time-Based Features} has not been approached in publications before. These features investigate the time, at which the request with domain \textit{d} has been issued. The main idea behind this group of features is to find malicious services that use techniques like \textit{domain flux} 
-(see Section~\ref{subsec:fast-flux_service_networks}) to circumvent take downs and make their infrastructure more agile. ``[\textit{Domain flux}] often show a sudden increase followed by a sudden decrease in the number of requests'' \fsCite[Section 3.1]{Bilge:2014:EPD:2617317.2584679}. Domains of malicious services using a DGA do only exist for a short period of time by design. \fsAuthor{Bilge:2014:EPD:2617317.2584679} defines the first feature as follows: ``A domain is defined to be a short-lived domain [...] if it is queried only between time \(t_0\) and \(t_1\), and if this duration is comparably short (e.g., less than several days).'' The next three features are subject to the change point detection (CPD) problem: Change point detection is about the identification of (abrupt) changes in the distribution of values, for example in time series. \textit{Exposure} implemented a CPD algorithm based on the popular CUSUM (cumulative sum) algorithm. At first, the time series of request timestamps is split into periods of 3600 seconds (one hour was tested to work well). After that, all time intervals are iterated and for each interval, the average request count of the previous eight hours \(P_t^-\) and following eight intervals \(P_t^+\) is calculated. In the next step, the distance of these two values is calculated \(d(t)=|P_t^--P_t^+|\) for each interval and the resulting ordered sequence \(d(t)\) of distances is fed to the CUSUM algorithm to finally retrieve all change points (For more information on the implemented CPD algorithm, see \fsCite[Section 3.1]{Bilge:2014:EPD:2617317.2584679}). To calculate the \textit{Daily similarity} features, the Euclidean Distance of the time series of each day for \textit{d} is calculated. Intuitively, a low distance denotes similar time series and thus high daily similarity whereas two days with higher distance do show a less similar request volume. All the features of this group naturally only perform well when having a larger number of requests to \textit{d} over a significant period of time.
+(see Section~\ref{subsec:fast-flux_service_networks}) to circumvent take downs and make their infrastructure more agile. \fsAuthor{Bilge:2014:EPD:2617317.2584679}  infer that ``[\textit{Domain flux}] often show a sudden increase followed by a sudden decrease in the number of requests''. Domains of malicious services using a DGA do only exist for a short period of time by design. \fsAuthor{Bilge:2014:EPD:2617317.2584679} defines the first feature as follows: ``A domain is defined to be a short-lived domain [...] if it is queried only between time \(t_0\) and \(t_1\), and if this duration is comparably short (e.g., less than several days).'' The next three features are subject to the change point detection (CPD) problem: Change point detection is about the identification of (abrupt) changes in the distribution of values, for example in time series. \textit{Exposure} implemented a CPD algorithm based on the popular CUSUM (cumulative sum) algorithm. At first, the time series of request timestamps is split into periods of 3600 seconds (one hour was tested to work well). After that, all time intervals are iterated and for each interval, the average request count of the previous eight hours \(P_t^-\) and following eight intervals \(P_t^+\) is calculated. In the next step, the distance of these two values is calculated for each interval \(d(t)=|P_t^--P_t^+|\) and the resulting ordered sequence \(d(t)\) of distances is fed to the CUSUM algorithm to finally retrieve all change points (for more information on the implemented CPD algorithm, see \fsCite[Section 3.1]{Bilge:2014:EPD:2617317.2584679}). To calculate the \textit{Daily similarity} features, the Euclidean Distance of the time series of each day for \textit{d} is calculated. Intuitively, a low distance denotes similar time series and thus high daily similarity whereas two days with higher distance do show a less similar request volume. All the features of this group naturally only perform well when having a larger number of requests to \textit{d} over a significant period of time.

-The next group of Features (\textit{DNS Answer-Based Features}) investigates resolutions of the requested domain \textit{d}. While one domain can map to multiple IP addresses for benign services, most harmless services show a much smaller network profile in terms of e.g. location and \glspl{as}. To benefit from those findings, four features have been extracted: The number of distinct IP addresses, the amount of different countries these IP addresses are assigned to, the number of other domains that share an IP address \textit{d} resolves to and the fourth feature is the amount of results of the reverse dns query for all IPs of \textit{d}. It is worth noting, that some hosting providers also use one IP address for many domains and an extra layer helps preventing those false positives.
+The next group of Features (\textit{DNS Answer-Based Features}) investigates resolutions of the requested domain \textit{d}. While one domain can map to multiple IP addresses for benign services, most harmless services show a much smaller network profile in terms of e.g. the location or the distribution of \glspl{as}/\glspl{bgp}. To benefit from those findings, four features have been extracted: The number of distinct IP addresses, the amount of different countries these IP addresses are assigned to, the number of other domains that share an IP address (\textit{d} resolves to) and the fourth feature is the amount of results of the reverse DNS query for all IPs of \textit{d}. It is worth noting, that some hosting providers also use one IP address for many domains but in conjunction with other features those false positives can be reduced.

-The \textit{TTL Value-Based Features} cover five individual features. Each answer for a DNS request contains the TTL attribute, which is the recommendation, configured by the operator of \textit{d}, of how long the resolution will be valid and should be cached for this reason. Whereas RFC 1033 recommends a TTL of one day (86400 seconds) \fsCite{RFC1033}, it is getting more common, especially for content delivery networks to use much lower values (e.g. Cloudflare, one of the biggest managed DNS providers is using a default of 5 minutes). Botnets are also usually applying low TTL values to avoid long outages of C\&C servers and bots. As \fsAuthor{Bilge:2014:EPD:2617317.2584679} states, botnets also change their TTL values more frequently and use values in different ranges depending on their availability. While applying a higher value to high bandwidth servers with low downtimes, home computers behind a digital subscriber line are much likely to fail and get lower TTL values. For this reason, all TTL values for a domain are checked against the following range (in seconds): [0, 1], [1, 10], [10, 100], [100, 300], [300, 900], [900, inf].
+The \textit{TTL Value-Based Features} cover five individual features. Each answer for a DNS request contains the TTL attribute, which is the recommendation (configured by the operator of \textit{d}) of how long the resolution will be valid and should be cached for this reason. Whereas RFC 1033 recommends a TTL of one day (86400 seconds) \fsCite{RFC1033}, it is getting more common, especially for content delivery networks to use much lower values (e.g. Cloudflare, one of the biggest managed DNS providers is using a default of 5 minutes). Botnets are also usually applying low TTL values to avoid long outages of C\&C servers and bots. As \fsAuthor{Bilge:2014:EPD:2617317.2584679} states, botnets also change their TTL values more frequently and use values in different ranges depending on their availability. While applying a higher value to high bandwidth servers with low downtimes, home computers behind a digital subscriber line are much likely to get offline and therefore are assigned lower TTL values. For this reason, all TTL values for a domain are checked against the following ranges (in seconds): [0, 1], [1, 10], [10, 100], [100, 300], [300, 900], [900, inf].

 The last group of features are the \textit{Domain Name-Based Features}. Domain names of benign services mostly use easy to remember names which consist of valid words. Attackers often are not interested in human readable domain names. This is especially true for domains generated by a DGA. \textit{Exposure} extracts two statistical features out of the domain name, the first being the percentage of numerical characters and secondly the length of the longest (english) meaningful string (LMS).

--- a/Thesis/content/Evaluation_of_existing_Systems/Kopis/Kopis.tex
+++ b/Thesis/content/Evaluation_of_existing_Systems/Kopis/Kopis.tex
@@ -24,21 +24,21 @@ The last evaluated System is called \textit{Kopis} and has been proposed in 2011
    \label{fig:kopis_system_overview}
 \end{figure}

-The overall system architecture can be seen in Figure~\ref{fig:kopis_system_overview}. The first step in the reputation system is to gather all (streamed) DNS queries and responses and divide this traffic into fixed epochs (e.g. one day in \textit{Kopis}). After collecting the traffic of each epoch \(E_i\), different statistics about a domain \textit{d} are extracted by the \textit{Feature Computation} function into a feature vector \(v_d^i\). A detailed table of which features are used is listed in Section~\ref{subsec:kopis_features}. \textit{Kopis} tries to separate benign from malicious domains by characteristics like the volume of DNS requests to domain \textit{d}, the diversity of IP addresses of the querying machines and the historic information relating to the IP space \textit{d} is pointing to. Like the first two investigated systems, \textit{Kopis} is operating in two different modes. In training mode, the reputation model is built in an offline fashion (\textit{Learning Module}) which is later used in the operational mode (\textit{Statistical Classifier}) to assign \textit{d} a reputation score in a streamed fashion. The \textit{Learning Module} takes the feature vector of a period of \textit{m} days that is generated by the \textit{Feature Computation} function as input and uses the \textit{Knowledge Base (KB)} to label each sample in that training set as being a malicious or legitimate domain (training set: \(V_{train} = \{v_d^i\}_{i=1..m}, \forall d \in \textit{KB}\)). The \textit{KB} consists of various public and undisclosed sources: \\
+The overall system architecture can be seen in Figure~\ref{fig:kopis_system_overview}. The first step in the reputation system is to gather all (streamed) DNS queries and responses and divide this traffic into fixed epochs (e.g. one day in \textit{Kopis}). After collecting the traffic of each epoch \(E_i\), different statistics about a domain \textit{d} are extracted by the \textit{Feature Computation} function into a feature vector \(v_d^i\). A detailed table of which features are used is listed in Section~\ref{subsec:kopis_features}. \textit{Kopis} tries to separate benign from malicious domains by characteristics like the volume of DNS requests to domain \textit{d}, the diversity of IP addresses of the querying machines and the historic information, relating to the IP space \textit{d} is pointing to. Like the first two investigated systems, \textit{Kopis} is operating in two different modes. In training mode, the reputation model is built in an offline fashion (\textit{Learning Module}) which is later used in the operational mode (\textit{Statistical Classifier}) to assign \textit{d} a reputation score in a streamed fashion. The \textit{Learning Module} takes the feature vector of a period of \textit{m} days that is generated by the \textit{Feature Computation} function as input and uses the \textit{Knowledge Base (KB)} to label each sample in that training set as being a malicious or legitimate domain (training set: \(V_{train} = \{v_d^i\}_{i=1..m}, \forall d \in \textit{KB}\)). The \textit{KB} consists of various public and undisclosed sources: \\

 \textbf{Malicious domain sources: } 
 \begin{itemize}
-    \item Information about malware from a commercial feed with a volume between 400 MB and 2GB a day
-    \item Malware, captured from two corporate networks
-    \item Public blacklists, e.g., malwaredomains.com \fsCite{malwaredomainsInformationOnline} and the Zeus Block List \fsCite{zeusblocklistInformationOnline} \\
+    \item Information about malware from a commercial feed with a volume between 400 MB and 2GB a day.
+    \item Malware, captured from two corporate networks.
+    \item Public blacklists, e.g., malwaredomains.com \fsCite{malwaredomainsInformationOnline} and the Zeus Block List \fsCite{zeusblocklistInformationOnline}. \\
 \end{itemize}


 \textbf{Benign domain sources: }
 \begin{itemize}
-    \item Domain and IP whitelists from DNSWL \fsCite{DNSWLOnline}
-    \item Address space of the top 30 Alexa domains \fsCite{AlexaWebInformationOnline}
-    \item Dihe's IP-Index Browser \fsCite{DIHEOnline}
+    \item Domain and IP whitelists from DNSWL \fsCite{DNSWLOnline}.
+    \item Address space of the top 30 Alexa domains \fsCite{AlexaWebInformationOnline}.
+    \item Dihe's IP-Index Browser \fsCite{DIHEOnline}.
 \end{itemize}

 The operational mode first captures all DNS traffic streams. At the end of each epoch \(E_j\), the feature vector \(v_{d'}^j\) for all unknown domains \(d' \notin \textit{KB}\) is extracted and the \textit{Statistical Classifier} assigns a label (either malicious or legitimate) \(l_{d', j}\) and a confidence score \(c(l_{d', j})\). While the label classifies if the domain \textit{d'} is expected to be malicious or legitimate, the confidence score expresses the probability of this label. For the final reputation score, \textit{Kopis} first computes a series of label/confidence tuples for \textit{m} epochs starting at epoch \(E_t\): \(S(v_{d'}^j) = \{l_{d', j}, c(l_{d', j})\}, j = t, .., (t + m)\) and by averaging the confidence scores of the malicious labels (\textit{M}), the reputation score can be expressed as \(\overline{C}_M = avg_j\{c(l_{d', j})\}\)
@@ -47,14 +47,14 @@ The operational mode first captures all DNS traffic streams. At the end of each
 \subsection{Features}
 \label{subsec:kopis_features}

-Much like the previous investigated systems, \textit{Kopis} is extracting different features that are grouped in three sets. Two of those groups, the \textit{Requester Diversity} and the \textit{Requester Profile} features, have not been proposed in research before and due to the system architecture are differing from those that are used in \textit{Notos} and \textit{Exposure}. In contrast to \textit{Notos} and \textit{Exposure}, which use traffic monitored from recursive DNS servers in lower DNS layers, \textit{Kopis} is operating with data from two large AuthNS as well as a country level TLD server (.ca space) in the upper DNS layers (see \ref{fig:kopis_data_sources}). Operating in this level in the DNS hierarchy leads to different challenges as well. A top-level domain server is rarely answering a request itself but most of the time is only delegating the request to a more specific server, e.g. a server responsible for the zone of a second-level domain in a company. For this reason, to get the actual resolved record (IP), the delegated name server can be queried straightly or a passive DNS database (e.g. from the Security Information Exchange \fsCite{SIEOnline}) can be engaged. 
+Much like the previous investigated systems, \textit{Kopis} is extracting different features that are grouped in three sets. Two of those groups, the \textit{Requester Diversity} and the \textit{Requester Profile} features, have not been proposed in research before and due to the system architecture are differing from those that are used in \textit{Notos} and \textit{Exposure}. In contrast to \textit{Notos} and \textit{Exposure}, which use traffic monitored from recursive DNS servers in lower DNS layers, \textit{Kopis} is operating with data from two large AuthNS as well as a country level TLD server (.ca space) in the upper DNS layers (see \ref{fig:kopis_data_sources}). Operating in this level in the DNS hierarchy leads to different challenges as well. A top-level domain server is rarely answering a request itself but most of the time is only delegating the request to a more specific server, e.g. a server responsible for the zone of a second-level domain in a company. For this reason, to get the actual resolved record (IP), the delegated name server can be directly queried or a passive DNS database (e.g. from the Security Information Exchange \fsCite{SIEOnline}) can be engaged. 

-The first step of extracting features out of the captured traffic for each dns query \(q_j\) (to resolve a domain \textit{d}), is to find the epoch \(T_j\), in which the request has been made, the IP address of the machine \(R_j\) that run the query and the resolved records \(IPs_j\). Using these raw values, \textit{Kopis} extracts the following specific features:
+The first step of extracting features out of the captured traffic for each DNS query \(q_j\) (to resolve a domain \textit{d}) is to find the epoch \(T_j\), in which the request has been made, the IP address of the machine \(R_j\) that run the query and the resolved records \(IPs_j\). Using these raw values, \textit{Kopis} extracts the following specific features:

 \subsubsection{Requester Diversity (RD)}
 \label{subsubsec:kopis_requester_diversity}

-This group of features tries to map the requester diversity, i.e. where the requests originate, into values that can be used in the \textit{Feature Computation} function. In general, this aims to find if related machines of a domain \textit{d} are globally distributed or acting in a bound location. It first important to notice that to map an IP address to its corresponding ASN, country and BGP prefix, the Team Cymru IP TO ASN MAPPING database has been leveraged \fsCite{CymruOnline}. This set of features is motivated on the premise that the machines involved with a domain used in malicious purposes, usually have a different distribution than those for legitimate usage. While benign services will show a consistent pattern of IP addresses that are looking up \textit{d}, malicious domains are queried from many machines from different locations around the world, e.g. bots in a botnet or spambots involved in a spam campaign. Recapturing that botnets are usually not targeted at specific geographical regions. Figure~\ref{fig:kopis_requester_distribution} shows the distribution of the ASNs as well as the country codes, calculated by the cumulative distribution function (CDF). In both cases, benign domains have a either low or very high distribution (bimodal distribution). In contrast, malicious domains show a larger spectrum of diversities, mainly depending on how successful the malware is spreading. There are mainly three values involved here. For all requester IP addresses \(\{R_j\}_{j=1..m}\), the BGP prefixes, the autonomous system numbers and the country codes (CC) are resolved. After this, the distribution of the occurrence frequency of these three sets is computed and for each distribution, the mean, standard deviation and variance is calculated (total of nine features). Another four features are extracted simply using the total number of distinct IP addresses (\textit{d} resolved to), the amount of BGP prefixes of these IPs, the total number of different ASNs and the total amount of distinct countries, these IPs reside in. 
+This group of features tries to map the requester diversity, i.e. where the requests originate, into values that can be used in the \textit{Feature Computation} function. In general, this aims to find if related machines of a domain \textit{d} are globally distributed or acting in a bound location. To map an IP address to its corresponding ASN, country and BGP prefix, the Team Cymru IP TO ASN MAPPING database has been leveraged \fsCite{CymruOnline}. This set of features is motivated on the premise that the machines, involved with a domain used in malicious purposes, usually have a different distribution than those for legitimate usage. While benign services will show a consistent pattern of IP addresses that are looking up \textit{d}, malicious domains are queried from many machines from different locations around the world, e.g. bots in a botnet or spambots involved in a spam campaign. Recapturing that botnets are usually not targeted at specific geographical regions. Figure~\ref{fig:kopis_requester_distribution} shows the distribution of the ASNs as well as the country codes calculated by the cumulative distribution function (CDF). In both cases, benign domains have an either low or very high distribution (bimodal distribution). In contrast, malicious domains show a larger spectrum of diversities, mainly depending on how successful the malware is spreading. There are mainly three values involved here. For all requester IP addresses \(\{R_j\}_{j=1..m}\), the BGP prefixes, the autonomous system numbers and the country codes (CC) are resolved. After this, the distribution of the occurrence frequency of these three sets is computed and for each distribution, the mean, standard deviation and variance is calculated (total of nine features). Another four features are extracted simply using the total number of distinct IP addresses (\textit{d} resolved to), the amount of BGP prefixes of these IPs, the total number of different ASNs and the total amount of distinct countries, these IPs reside in. 

 \begin{figure}[!htbp]
    \centering
@@ -67,7 +67,7 @@ This group of features tries to map the requester diversity, i.e. where the requ
 \subsubsection{Requester Profile (RP)}
 \label{subsubsec:kopis_requester_profile}

-The \textit{Requester Profile} features are aiming to separate request that are coming from hardened networks (like enterprise networks) from less secure networks, e.g. ISP networks. Most smaller networks like enterprise or university networks are much better protected against malware in general and such should show less requests to malicious domains. On the other hand, ISPs do usually not invest much effort into cleaning their network from malware and do not offer a high level of protection against malware propagation inside the network. As \textit{Kopis} is operating in the upper DNS layers, it is often not possible to simply measure the population behind the requesting RDNS server (due to e.g. caching \fsCite{10.1007/978-3-540-24668-8_15}) and a different metric has to be found to measure the size of the network, a request has been submitted from. Assuming traffic, monitored from a large AuthNS in epoch \(E_t\) that has authority for a set of Domains \(D\) and all unique requesting IP addresses \(R\). For each requester IP \(R_k \in R\) the amount of different domains, queried by \(R_k\) in \(E_t\), is counted \(c_{t,k}\). A weight can then be applied to each requester \(R_k\) as \(w_{t,k} = \frac{c_{t,k}}{max_{l=1}^{|R|}c_{t,l}}\). Subsequent, the more domains in \(D\) a requester \(R_k\) is querying, the higher the weight will be. This way, high weights are corresponding to larger networks and following the explanation above, the more likely it is that this requester is infected with malicious software. Given a domain \textit{d} and let \(R_d\) being the set of all requester IP addresses, the count \(c_{t,k}\) is computed for each epoch \(E_t\) like previously described. In the following, the count for each epoch is multiplied with each weight \(w_{t-n,k}\), where \(w_{t-n,k}\) are the weights of \textit{n} days before epoch \(E_t\), get the set of weighted counts of \textit{d} during \(E_t\): \(WC_t(d) = \{c_{t,k} * w_{t-n,k}\}_k\). Finally, five different feature values are calculated with the values of \(WC_t(d)\): the average, the biased and unbiased standard deviation and the biased and unbiased variance. The biased and unbiased estimator for the standard deviation of a random variable \textit{X} are defined as \(\sqrt{\sum_{i=1}^N \frac{1}{N}(\bar{X}_i - \mu)^2}\) and respectively \(\sqrt{\sum_{i=1}^N \frac{1}{N-1}(\bar{X}_i - \mu)^2}\) (with \(N\) being the amount of samples and \(\mu\) the empirical mean).
+The \textit{Requester Profile} features are aiming to separate request that are coming from hardened networks (like enterprise networks) from less secure networks, e.g. ISP networks. Most smaller networks like enterprise or university networks are much better protected against malware in general and thus should show less requests to malicious domains. On the other hand, ISPs do usually not invest much effort into cleaning their network from malware and do not offer a high level of protection against malware propagation inside the network. As \textit{Kopis} is operating in the upper DNS layers, it is often not possible to simply measure the population behind the requesting RDNS server (due to e.g. caching \fsCite{10.1007/978-3-540-24668-8_15}) and a different metric has to be found to measure the size of the network a request has been submitted from. Assuming traffic, monitored from a large AuthNS in epoch \(E_t\) that has authority for a set of Domains \(D\) and all unique requesting IP addresses \(R\). For each requester IP \(R_k \in R\) the amount of different domains, queried by \(R_k\) in \(E_t\), is counted \(c_{t,k}\). A weight can then be applied to each requester \(R_k\) as \(w_{t,k} = \frac{c_{t,k}}{max_{l=1}^{|R|}c_{t,l}}\). Subsequent, the more domains in \(D\) a requester \(R_k\) is querying, the higher the weight will be. This way, high weights are corresponding to larger networks and following the explanation above, the more likely it is that this requester is infected with malicious software. Given a domain \textit{d} and let \(R_d\) being the set of all requester IP addresses, the count \(c_{t,k}\) is computed for each epoch \(E_t\) like previously described. In the following, the count for each epoch is multiplied with each weight \(w_{t-n,k}\) (where \(w_{t-n,k}\) are the weights of \textit{n} days before epoch \(E_t\)) to get the set of weighted counts of \textit{d} during \(E_t\): \(WC_t(d) = \{c_{t,k} * w_{t-n,k}\}_k\). Finally, five different feature values are calculated with the values of \(WC_t(d)\): the average, the biased and unbiased standard deviation and the biased and unbiased variance. The biased and unbiased estimator for the standard deviation of a random variable \textit{X} are defined as \(\sqrt{\sum_{i=1}^N \frac{1}{N}(\bar{X}_i - \mu)^2}\) and respectively \(\sqrt{\sum_{i=1}^N \frac{1}{N-1}(\bar{X}_i - \mu)^2}\) (with \(N\) being the amount of samples and \(\mu\) the empirical mean).


 \subsubsection{Resolved-IPs Reputation (IPR)}
@@ -77,8 +77,9 @@ The set of \textit{Resolved-IPs Reputation} features consists of nine individual

 \begin{itemize}
  \item \textit{Malware Evidence: } contains three individual features, the amount of IP addresses in the last month (in respect to \(E_t\)) that have been pointed to by any malicious domain and much likely the amount of BGP prefixes and AS numbers that a malicious domain has been resolved to.
-  \item \textit{SBL Evidence: } using the domains in the Spamhaus Block list \fsCite{SBLOnline}, the average number of IP addresses, BGP prefixes and ASNs that have been pointed to by these domains is calculated.
-  \item \textit{Whitelist Evidence: } the list of domains \(WL\) that are suspected to be legitimate is constructed using the DNS whitelist of DNSWL \fsCite{DNSWLOnline} and the top 30 popular domains from Alexa \fsCite{AlexaWebInformationOnline}. Then the set of known good IPs \(WL_{IPs}\) is resolved from all domains in the white list \(WL\). Let \(IPs(d,t)\) be all addresses that \textit{d} points to (similarly to the first two groups) the amount of matching IP addresses and the amount of ASNs and BGP prefixes that include IP addresses of \(WL_{IPs}\) is calculated.
+  \item \textit{SBL Evidence: } using the domains in the Spamhaus Block list \
+  \fsCite{SBLOnline}, the average number of IP addresses, BGP prefixes and ASNs that have been pointed to by these domains is calculated.
+  \item \textit{Whitelist Evidence: } the list of domains \(WL\) that are suspected to be legitimate is constructed using the DNS whitelist of DNSWL \fsCite{DNSWLOnline} and the top 30 popular domains from Alexa \fsCite{AlexaWebInformationOnline}. Then, the set of known good IPs \(WL_{IPs}\) is resolved from all domains in the white list \(WL\). Let \(IPs(d,t)\) be all addresses that \textit{d} points to (similarly to the first two groups) the amount of matching IP addresses and the amount of ASNs and BGP prefixes that include IP addresses of \(WL_{IPs}\) is calculated.
 \end{itemize}


@@ -86,7 +87,7 @@ The set of \textit{Resolved-IPs Reputation} features consists of nine individual
 \subsection{Results}
 \label{subsec:kopis_results}

-\textit{Kopis} used DNS traffic captured at two major domain name registrars (AuthNS servers) between 01.01.2010 and 31.08.2010 as well as a country code top level domain server (.ca) from 26.08.2010 up to 18.10.2010. As the TLD server was operated in delegate-only mode, passive DNS traffic had to be additionally collected to get the resolutions for these queries. In total, this led to 321 million lookups a day in average. This amount of data showed to be a significant problem and the overall traffic size to be analysed had to be reduced. The most significant reduction was to remove all duplicate queries and only take unique requests into account. Finally, about 12.5 million daily unique requests remained in average. Using the \textit{KB}, that consists of various sources (see \fsCite{subsec:kopis_architecture}), a sample with 225,429 unique RRs (corresponding to 28,915 unique domain names) could be split into groups with 27,317 malicious and 1,598 benign domains. All raw data was indexed in a relational database and was enriched with information like first and last seen timestamps. Like any system that uses a machine learning approach, it was important for \textit{Kopis} to select significant features and a period that was sufficient for the training to deliver good results. Figure~\ref{fig:kopis_train_period_selection} shows the \glspl{roc} (ROC) of different models, generated with data from periods of one up to five days and validated using 10-fold cross validation. According to \fsAuthor[Section 5.3]{Antonakakis:2011:DMD:2028067.2028094}: ``When we increased the observation window beyond the mark of five days we did not see a significant improvement in the detection results.'' Using these models, the best classification algorithm had to be found. This has been accomplished using a technique called model selection (see e.g. \fsCite{Kohavi:1995:SCB:1643031.1643047}). The most accurate classifier for these models has shown to be the \textit{random forest} implementation with a true positive rate of 98.4\% and a false positive rate of 0.3\% (with training data from a period of five days). \textit{Random forest} is the combination of different decision trees, either trained on different training sets or using different sets of features. Unfortunately, the exact random forest classification implementation of \textit{Kopis} has not been published. Other classifiers that have been experimented with are: Naive Bayes, k-nearest neighbors, Support Vector Machines, MLP Neural Network and random committee.
+\textit{Kopis} used DNS traffic captured at two major domain name registrars (AuthNS servers) between 01.01.2010 and 31.08.2010 as well as a country code top level domain server (.ca) from 26.08.2010 up to 18.10.2010. As the TLD server was operated in delegate-only mode, passive DNS traffic had to be additionally collected to get the resolutions for these queries. In total, this led to 321 million lookups a day in average. This amount of data showed to be a significant problem and the overall traffic size to be analysed had to be reduced. The most significant reduction was to remove all duplicate queries and only take unique requests into account. Finally, about 12.5 million daily unique requests remained in average. Using the \textit{KB}, that consists of various sources (see \ref{subsec:kopis_architecture}), a sample with 225,429 unique RRs (corresponding to 28,915 unique domain names) could be split into groups with 27,317 malicious and 1,598 benign domains. All raw data was indexed in a relational database and was enriched with information like first and last seen timestamps. Like any system that uses a machine learning approach, it was important for \textit{Kopis} to select significant features and a period that was sufficient for the training to deliver good results. Figure~\ref{fig:kopis_train_period_selection} shows the \glspl{roc} (ROC) of different models, generated with data from periods of one up to five days and validated using 10-fold cross validation. According to \fsAuthor{Antonakakis:2011:DMD:2028067.2028094}: ``When we increased the observation window beyond the mark of five days we did not see a significant improvement in the detection results.'' Using these models, the best classification algorithm had to be found. This has been accomplished using a technique called model selection (see e.g. \fsCite{Kohavi:1995:SCB:1643031.1643047}). The most accurate classifier for these models has shown to be a \textit{random forest} implementation with a true positive rate of 98.4\% and a false positive rate of 0.3\% (with training data from a period of five days). \textit{Random forest} is the combination of different decision trees, either trained on different training sets or using different sets of features. Unfortunately, the exact random forest classification implementation of \textit{Kopis} has not been published. Other classifiers that have been experimented with are: Naive Bayes, k-nearest neighbors, Support Vector Machines, MLP Neural Network and random committee.

 \begin{figure}[!htbp]
    \centering
--- a/Thesis/content/Evaluation_of_existing_Systems/Notos/Notos.tex
+++ b/Thesis/content/Evaluation_of_existing_Systems/Notos/Notos.tex
@@ -12,7 +12,7 @@
        \textbf{Top-level domain:} TLD, where \(TLD(d)\) is the top-level domain of \textit{d}. \\
        \textbf{Second-level domain:} \(2LD(d)\) being the second-level domain of domain \textit{d}. \\
        \textbf{Third-level domain: } \(3LD(d)\) containing the three rightmost substrings separated by period for \textit{d}.
-    \item Given domain \(d\) \(Zone(d)\) describes the set of domains that include \textit{d} and all subdomains of \textit{d}.
+    \item Given domain \(d\), \(Zone(d)\) describes the set of domains that include \textit{d} and all subdomains of \textit{d}.
    \item \(D = \{d_1, d_2, ..., d_m\}\) representing a set of domains and \(A(D)\) all IP addresses that, at any time, any domain \(d \in D\) resolved to.
    \item \(BGP(a)\) consists of all IP addresses that are residing in the same \gls{bgp} prefix as \textit{a}.
    \item Analogously, \(AS(a)\) as the set of IP addresses located in the same \gls{as} as \textit{a}.
@@ -23,10 +23,10 @@
 \label{subsec:notos_architecture}

 The main goal of \textit{Notos} is to assign a dynamic reputation score to domain names. Domains that are likely to be involved in malicious activities are tagged with a low reputation score, whereas legitimate Internet services are assigned with a high reputation score. 
-\textit{Notos'} primary source of information for the training and classification is a database that contains historical data about domains and resolved IP addresses. This database is built using DNS traffic from two recursive ISP DNS servers (RDNS) and pDNS logs collected by the Security Information Exchange (SIE) which covers authoritative name servers in North America and Europe. For building a list of known malicious domain names, several honeypots and spam-traps have been deployed. A large list of known benign domains has been gathered from the top sites list on \textit{alexa.com} which ranks the most popular websites in several regions \fsCite{AlexaWebInformationOnline}. These two lists are referred to as the \textit{knowledge base} and are used to train the reputation training model. 
+\textit{Notos'} primary source of information for the training and classification is a database that contains historical data about domains and resolved IP addresses. This database is built using DNS traffic from two recursive ISP DNS servers (RDNS) and pDNS logs collected by the Security Information Exchange (SIE) which covers authoritative name servers in North America and Europe. For building a list of known malicious domain names, several honeypots and spam-traps have been deployed. A large list of known benign domains has been gathered from the top sites list on \textit{alexa.com} which ranks the most popular websites in several regions \fsCite{AlexaWebInformationOnline}. These two lists are referred to as the \textit{knowledge base} and are particularly used to train the reputation training model. 


-To assign a reputation score to a domain \textit{d}, the most current set of IP addresses \(A_{c}(d) = \left\{a_{i}\right\}_{i=1..m}\) to which \textit{d} points is first fetched. Afterwards the pDNS database is queried for several information for this domain \textit{d}. The \textit{Related Historic IPs (RHIPs)} is the set of all IP addresses that ever pointed to this domain. In case domain \textit{d} is a third-level domain, all IP addresses that pointed to the corresponding second-level domain are also included. See Chapter~\ref{subsec:domain_names} for more information on the structure of domain names. If \textit{d} is a second-level domain, then all IPs that are pointed to from any of the third-level subdomains are also added to the RHIPs. In the next step, the set of \textit{Related Historic Domains (RHDNs)} is queried and covers all domains that are related to the currently processed domain \textit{d}. Specifically, all domains which ever resolved to an IP address that is residing in any of the ASNs of those IPs that \textit{d} currently resolves to.
+To assign a reputation score to a domain \textit{d}, the most current set of IP addresses \(A_{c}(d) = \left\{a_{i}\right\}_{i=1..m}\), to which \textit{d} points to, is first fetched. Afterwards the pDNS database is queried for several information for this domain \textit{d}. The \textit{Related Historic IPs (RHIPs)} is the set of all IP addresses that ever pointed to this domain. In case domain \textit{d} is a third-level domain, all IP addresses that pointed to the corresponding second-level domain are also included. Also see Section~\ref{subsec:domain_names} for more information on the structure of domain names. If \textit{d} is a second-level domain, then all IPs that are pointed to from any of the third-level subdomains are also added to the RHIPs. The reason why second- and third-level domains are combined here is that according to \fsAuthor{Antonakakis:2010:BDR:1929820.1929844} most third level domains are related to their corresponding second-level domain and therefore are treated similarly. In the next step, the set of \textit{Related Historic Domains (RHDNs)} is queried and covers all domains that are related to the currently processed domain \textit{d}. Specifically, all domains which ever resolved to an IP address that is residing in any of the ASNs of those IPs that \textit{d} currently resolves to.

 There are three types of features extracted from the database for \textit{Notos} that are used for training the reputation model (quotation from \fsCite[Section 3.1]{Antonakakis:2010:BDR:1929820.1929844}):

@@ -34,11 +34,11 @@ There are three types of features extracted from the database for \textit{Notos}
 \begin{enumerate}
    \item \textbf{Network-based features:} The first group of statistical features is extracted from the set of RHIPs. We measure quantities such as the total number of IPs historically associated with \textit{d}, the diversity of their geographical location, the number of distinct autonomous systems (ASs) in which they reside, etc.
    \item \textbf{Zone-based features:} The second group of features we extract are those from the RHDNs set. We measure the average length of domain names in RHDNs, the number of distinct TLDs, the occurrence frequency of different characters, etc.
-    \item \textbf{Evidence-based features:} The last set of features includes the measurement of quantities such as the number of distinct malware samples that contacted the domain \textit{d},     the number of malware samples that connected to any of the IPs pointed by \textit{d}, etc.
+    \item \textbf{Evidence-based features:} The last set of features includes the measurement of quantities such as the number of distinct malware samples that contacted the domain \textit{d}, the number of malware samples that connected to any of the IPs pointed by \textit{d}, etc.
 \end{enumerate}
 \end{quote}

-Figure~\ref{fig:notos_system_overview} shows the overall system architecture of \textit{Notos}. After all the features are extracted from the passive DNS database and prepared for further processing, the reputation engine is initialized. \textit{Notos'} reputation engine is operating in two modes. In offline mode, the reputation model is constructed for a set of domains using the feature set of each domain and the classification which can be calculated using the \textit{knowledge base} with black- and whitelist (also referred as training). This model can later be used in the online mode to dynamically assign a reputation score. In online mode, the same features that are used for the initial training are extracted for a new domain (resource record or RR, see Section~\nameref{subsubsec:dns_resource_records}) and \textit{Notos} uses the trained reputation engine to calculate a dynamic reputation rating (see Figure~\ref{fig:notos_online_offline_mode}). The data for labeling domains and IPs  originates from various sources: the blacklist primarily consists of filter lists from malware services like malwaredomainlist.com and malwaredomains.com. Additional IP and domain labeling blacklists are the Sender Policy Block from Spamhaus (\fsCite{SBLOnline}) and the ZeuS blocklist from ZeuS Tracker (\fsCite{zeusblocklistInformationOnline}). The base has been downloaded before the main analyzation period (fifteen days from the first of August 2009) and as filter lists usually lag behind state-of-the art malware, the blacklists have continuously been updated. The whitelist was built using the top 500 popular Alexa websites. Additionally, the 18 most common second level domains from various content delivery networks for classifying the CDN clusters and a list of 464 dynamic DNS 2LD for identifying domains and IPs in dynamic DNS zones have been gathered.
+Figure~\ref{fig:notos_system_overview} shows the overall system architecture of \textit{Notos}. After all the features are extracted from the passive DNS database and prepared for further processing, the reputation engine is initialized. \textit{Notos'} reputation engine is operating in two modes. In offline mode, the reputation model is constructed for a set of domains using the feature set of each domain and the classification which can be calculated using the \textit{knowledge base} with black- and whitelist (also referred to as training). This model can later be used in the online mode to dynamically assign a reputation score. In online mode, the same features that are used for the initial training are extracted for a new domain (resource record or RR, see Section~\nameref{subsubsec:dns_resource_records}) and \textit{Notos} uses the trained reputation engine to calculate a dynamic reputation rating (see Figure~\ref{fig:notos_online_offline_mode}). The data for labeling domains and IPs  originates from various sources: the blacklist primarily consists of filter lists from malware services like malwaredomainlist.com and malwaredomains.com. Additional IP and domain labeling blacklists are the Sender Policy Block from Spamhaus (\fsCite{SBLOnline}) and the ZeuS blocklist from ZeuS Tracker (\fsCite{zeusblocklistInformationOnline}). The base has been downloaded before the main analyzation period (fifteen days from the first of August 2009) and as filter lists usually lag behind state-of-the art malware, the blacklists have continuously been updated. The whitelist was built using the top 500 popular Alexa websites. Additionally, the 18 most common second level domains from various content delivery networks for classifying the CDN clusters and a list of 464 dynamic DNS second level domains, for identifying domains and IPs in dynamic DNS zones, have been gathered.

 \begin{figure}[!htbp]
    \centering
@@ -63,7 +63,7 @@ In this Section, all statistical features are listed and a short explanation, fo
 \subsubsection{Network-based features}
 \label{subsubsec:notos_network-based_features}

-The first group of features handles network-related keys. This group mostly describe how the owning operators of \textit{d} allocate network resources to achieve different goals. While most legitimate and professionally operated internet services feature have a rather stable network profile, malicious usage usually involves short living domain names and IP addresses with high agility to circumvent blacklisting and other simple types of resource blocking. Botnets usually contain machines in many different networks (\glspl{as} and \glspl{bgp}) operated by different organizations in different countries. Appropriate companies mostly acquire bigger IP blocks and such use consecutive IPs for their services in the same address space. This homogeneity also applies to other registration related information like registrars and registration dates. To measure this level of agility and homogeneity, eighteen statistical network-based features are extracted from the RHIPs (see Table~\ref{tab:notos_network-based_features}).
+The first group of features handles network-related keys. This group mostly describe how the owning operators of \textit{d} allocate network resources to achieve different goals. While most legitimate and professionally operated internet services have a rather stable network profile, malicious usage usually involves short living domain names and IP addresses with high agility to circumvent blacklisting and other simple types of resource blocking. Botnets usually contain machines in many different networks (\glspl{as} and \glspl{bgp}) operated by different organizations in different countries. Appropriate companies mostly acquire bigger IP blocks and thus use consecutive IPs for their services in the same address space. This homogeneity also applies to other registration related information like registrars and registration dates. To measure this level of agility and homogeneity, eighteen statistical network-based features are extracted from the RHIPs (see Table~\ref{tab:notos_network-based_features}).

 \begin{table}[!htbp]
    \centering
@@ -96,7 +96,7 @@ The first group of features handles network-related keys. This group mostly desc
 \subsubsection{Zone-based features}
 \label{subsubsec:notos_zone-based_features}

-The second group is about zone-based features and is extracted from the RHDNs. In contrast to the network-based features which compares characteristics of the historic IPs, the zone-based features handles characteristics of all historically involved domains. While legitimate services often involve many domains, they usually share similarities. For example, google.com, googlemail.com, googleplus.com, etc., are all services provided by Google and contain the string 'google' in their domains. In contrast, randomly generated domains used in spam campaigns are rarely sharing similarities. By calculating the mean, median and standard deviation for some key, the summarize of the shape of its distribution is investigated \fsCite{Antonakakis:2010:BDR:1929820.1929844}. To calculate this level of diversity, seventeen features are extracted which can be found in Table~\ref{tab:notos_zone-based_features}
+The second group is about zone-based features and is extracted from the RHDNs. In contrast to the network-based features which compares characteristics of the historic IPs, the zone-based features handles characteristics of all historically involved domains. While legitimate services often involve many domains, they usually share similarities. For example, google.com, googlemail.com, googleplus.com, etc., are all services provided by Google and contain the string 'google' in their domains. In contrast, randomly generated domains used in spam campaigns are rarely sharing similarities. By calculating the mean, median and standard deviation for some key, the overall shape of its distribution is investigated \fsCite{Antonakakis:2010:BDR:1929820.1929844}. To calculate this level of diversity, seventeen features are extracted which can be found in Table~\ref{tab:notos_zone-based_features}

 \begin{table}[!htbp]
    \centering
@@ -128,7 +128,7 @@ The second group is about zone-based features and is extracted from the RHDNs. I
 \subsubsection{Evidence-based features}
 \label{subsubsec:notos_evidence-based_features}

-For the evidence-based features, public information and data from honeypots and spam-traps was collected. This \textit{knowledge base} primarily helps to discover if a domain \textit{d} is in some way interacting with known malicious IPs and domains. As domain names are much cheaper to obtain than IP addresses, malware authors tend to reuse IPs with updated domain names. The blacklist features indicate the reuse of known malicious resources like IP addresses, \gls{bgp} prefixes and \glspl{as}. 
+For the evidence-based features, public information and data from honeypots and spam-traps was collected. This \textit{knowledge base} primarily helps to discover if a domain \textit{d} is in some way interacting with known malicious IPs and domains. As domain names are much cheaper to obtain than IP addresses, malware authors tend to reuse IPs with updated domain names. Consequently, the blacklist features indicate the reuse of known malicious resources like IP addresses, \gls{bgp} prefixes and \glspl{as}. 

 \begin{table}[!htbp]
    \centering
@@ -159,14 +159,14 @@ Figure~\ref{fig:notos_features} shows how the three different feature groups are
 \subsection{Reputation Engine}
 \label{subsec:notos_reputation_engine}

-The reputation engine is used to dynamically assign a reputation score to a domain \textit{d}. In the first step, the engine has to be trained with the available training set (temporal defined as the \textit{training period}). The training is performed in an offline fashion which means all data is statically available at the beginning of this step. The training mode consists of three modules: The \textit{Network Profile Model} is a model of how known good domains are using resources. This model uses popular content delivery networks (e.g., Akamai, Amazon CloudFront) and large sites (e.g., google.com, yahoo.com) as a base. In total the \textit{Network Profile Model} consists of five classes of domains: \textit{Popular Domains}, \textit{Common Domains}, \textit{Akamai Domains}, \textit{CDN Domains} and \textit{Dynamic DNS Domains}. The second module \textit{Domain Name Clusters} performs a general clustering of all domains (respectively their statistical feature vectors) of the training set. There are two consecutive clustering processes: The \textit{network-based} clustering aims to group domains with similar characteristics in terms of the agility, e.g. how often DNS resources are changed. To refine those clusters, a \textit{zone-based} clustering is performed which groups domains that are similar in terms of its RHDNs (see explanation for the \textit{zone-based features}). Those clusters of domains with similar characteristics can then be used to identify mostly benign and malicious sets of domains. In the last step of the offline mode, the \textit{Reputation Function} is built. As seen in Figure~\ref{fig:notos_online_offline_mode} this module takes the results of the \textit{Network Profile Model} (\(NM(d_i)\)) and the \textit{Domain Name Clusters} (\(DC(d_i)\)) for each domain \textit{d} in \(d_i, i = 1..n\) as inputs, calculates an \textit{Evidence Features Vector} \(EV(d_i)\), which basically checks if \(d_i\) or any of its resolved IPs is known to be benign or malicious, and builds a model that can assign a reputation score between zero and one to \textit{d}. This \textit{Reputation Function} is implemented as a statistical classifier. These three modules form the reputation model that can be used in the last step to compute the reputation score. A rebuild of the training model can be done at any time, for example given an updated training set.
+The reputation engine is used to dynamically assign a reputation score to a domain \textit{d}. In the first step, the engine has to be trained with the available training set (temporal defined as the \textit{training period}). The training is performed in an offline fashion which means all data is statically available at the beginning of this step. The training mode consists of three modules: The \textit{Network Profile Model} is a model of how known good domains are using resources. This model uses popular content delivery networks (e.g., Akamai, Amazon CloudFront) and large sites (e.g., google.com, yahoo.com) as a base. In total the \textit{Network Profile Model} consists of five classes of domains: \textit{Popular Domains}, \textit{Common Domains}, \textit{Akamai Domains}, \textit{CDN Domains} and \textit{Dynamic DNS Domains}. The second module \textit{Domain Name Clusters} performs a general clustering of all domains (respectively their statistical feature vectors) of the training set. There are two consecutive clustering processes: The \textit{network-based} clustering aims to group domains with similar characteristics in terms of the agility, e.g. how often DNS resources are changed. To refine those clusters, a \textit{zone-based} clustering is performed which groups domains that are similar in terms of its RHDNs (see explanation for the \textit{zone-based features}). Those clusters of domains with similar characteristics can then be used to identify benign and malicious sets of domains. In the last step of the offline mode, the \textit{Reputation Function} is built. As seen in Figure~\ref{fig:notos_online_offline_mode}, this module takes the results of the \textit{Network Profile Model} (\(NM(d_i)\)) and the \textit{Domain Name Clusters} (\(DC(d_i)\)) for each domain \textit{d} in \(d_i, i = 1..n\) as inputs, calculates an \textit{Evidence Features Vector} \(EV(d_i)\), which basically checks if \(d_i\) or any of its resolved IPs is known to be benign or malicious, and builds a model that can assign a reputation score between zero and one to \textit{d}. This \textit{Reputation Function} is implemented as a statistical classifier. These three modules form the reputation model that can be used in the last step to compute the reputation score. A rebuild of the training model can be done at any time, for example given an updated training set.

 The final stage of the reputation engine is the online (streaming like) mode. Any considered domain \textit{d} is first supplied to the \textit{network profiles} module which returns a probability vector \(NM(d) = \{c_1, c_2, ..., c_5\}\) of how likely \textit{d} belongs to one of the five classes (e.g. probability \(c_1\) that \textit{d} belongs to \textit{Popular Domains}). \(DC(d)\) is the resulting vector of the \textit{domain clusters} module and can be broken down into the following parts: For the domain \textit{d} of interest, the network-based features are extracted and the closest network-based cluster \(C_d\), generated in the training mode by the \textit{Domain Name Clusters} module, is calculated. The following step takes all zone-based feature vectors \(v_j \in C_d\) and eliminates those vectors that do not fulfill \(dist(z_d , v_j ) < R\), where \(z_d\) is the zone-based feature vector for \textit{d} and \textit{R} being a predefined radius; or \(v_j \in KNN(z_d)\), with \(KNN(z_d)\)) being the k nearest-neighbors of \(z_d\). Each vector \(v_i\) of the resulting subset \(V_d \subseteq C_d\) is then assigned one of this eight labels: \textit{Popular Domains}, \textit{Common Domains}, \textit{Akamai}, \textit{CDN}, \textit{Dynamic DNS}, \textit{Spam Domains}, \textit{Flux Domains}, and \textit{Malware Domains}. The next step is to calculate the five statistical features that form the resulting vector \(DC(d) = \{l_1, l_2, ..., l_5\}\).

 \begin{enumerate}
-    \item \(l_1\) the \textit{majority class label} \textit{L}, i.e. the most common label in \(v_i \in V_d\) (e.g. \textit{Spam Domains})
-    \item \(l_2\) the standard deviation of the occurrence frequency of each label
-    \item \(l_3\) mean of the distribution of distances between \(z_d\) and the vectors \(v_j \in V_{d}^{(L)}\), where \(V_{d}^{(L)} \subseteq V_d\) is the subset of those vectors, associated with the \textit{majority class label} \textit{L}
+    \item \(l_1\) the \textit{majority class label} \textit{L}, i.e. the most common label in \(v_i \in V_d\) (e.g. \textit{Spam Domains}).
+    \item \(l_2\) the standard deviation of the occurrence frequency of each label.
+    \item \(l_3\) mean of the distribution of distances between \(z_d\) and the vectors \(v_j \in V_{d}^{(L)}\), where \(V_{d}^{(L)} \subseteq V_d\) is the subset of those vectors, associated with the \textit{majority class label} \textit{L}.
 \end{enumerate}

 Having the \textit{Network Profile Model} \(NM(d)\), the \textit{Domain Name Clusters} \(DC(d_i)\), and the \textit{Evidence Features Vector} \(EV(d)\), these vectors are combined into a sixteen dimensional feature vector \(v(d)\) which is then fed into the trained reputation function. This results in a reputation score \textit{S} in the range of [0, 1], where values close to zero represent a low reputation and such more likely represent malicious usage of the domain.
@@ -177,7 +177,7 @@ Having the \textit{Network Profile Model} \(NM(d)\), the \textit{Domain Name Clu

 In the last Section of the evaluation of \textit{Notos}, experimental results that have been published are listed. This covers metrics about the usage of raw data, lessons learned in the analyzation process (i.e. examined algorithms) and final acquisitions like precision and accuracy of the classification.

-\textit{Notos} being the first dynamic reputation system in the context of domain names, it is able to identify malicious domain names before they appear in public filter lists. To be able to assign reputation scores to new domains, \fsAuthor{Antonakakis:2010:BDR:1929820.1929844} used historic passive dns logs of a time span of 68 days with a total volume of 27,377,461 unique, successful A-type resolutions mainly from two recursive ISP DNS servers in North America (plus pDNS logs from various networks, aggregated by the SIE \ref{subsec:notos_architecture}). Figure~\ref{fig:notos_volume_new_rr} shows that after a few days, the number of new domains (RR) stabilizes at about 100,000 to 150,000 new domains a day compared to a much higher total load of unique resource records (about 94.7\% duplicates) (see Figure~\ref{fig:notos_total_volume_unique_rr}). The amount of new IPs is analogously nearly constant. After few weeks, even big content delivery networks with a large (but nearly constant) number of IP addresses will get scanned, in contrast to botnets where continuously new machines are infected. The authors infer that a relatively small pDNS database is therefor sufficient for \textit{Notos} to produce good results.
+\textit{Notos} was the first dynamic reputation system in the context of domain names and it was able to identify malicious domain names before they appeared in public filter lists which ultimately led to the discovery of an previously unknown ZeuS botnet \fsCite{Antonakakis:2010:BDR:1929820.1929844}. To be able to assign reputation scores to new domains, \fsAuthor{Antonakakis:2010:BDR:1929820.1929844} used historic passive DNS logs of a time span of 68 days with a total volume of 27,377,461 unique, successful A-type resolutions mainly from two recursive ISP DNS servers in North America (plus pDNS logs from various networks, aggregated by the SIE \ref{subsec:notos_architecture}). Figure~\ref{fig:notos_volume_new_rr} shows that after a few days, the number of new RR stabilizes at about 100,000 to 150,000 domains a day compared to a much higher total load of unique resource records (about 94.7\% duplicates) (see Figure~\ref{fig:notos_total_volume_unique_rr}). The amount of new IPs is analogously nearly constant. After few weeks, even big content delivery networks with a large (but nearly constant) number of IP addresses will get scanned, in contrast to botnets where continuously new machines are infected. The authors infer that a relatively small pDNS database is therefor sufficient for \textit{Notos} to produce good results.

 \begin{figure}[!htbp]
    \centering
@@ -193,7 +193,7 @@ In the last Section of the evaluation of \textit{Notos}, experimental results th
    \label{fig:notos_total_volume_unique_rr}
 \end{figure}

-To get optimal results with the \textit{Reputation Function}, several classifiers have been tested and selected for the given circumstances (time complexity, detection results and precision [true positives over all positives]). A decision tree with Logit-Boost strategy (see \fsCite{Friedman98additivelogistic} for implementation details) has shown to provide the best results with a low false positive rate (FP) of 0.38\% and a high true positive rate (TP) of 96.8\%. These results have been verified using a 10-fold cross validation with a reputation score threshold of 0.5. This 10-fold cross validation method splits the dataset in ten partitions/folds (each partition optimally containing roughly the same class label distribution). One fold is then used as the validation sample (testing set) and the remaining nine partitions are used as the training set. The training set is used to train the model which is then cross validated with the testing set. This step is repeated for ten times using the same partitions, each partition being the testing set once. For the validation in \textit{Notos}, a dataset of 20,249 domains with 9,530 known bad RR has been used. As the list of known good domains, the Alexa top 500 websites have been used. Taking a bigger amount of Alexa popular sites has shown to decrease accuracy of the overall system, which could be interpreted as smaller/less popular sites are more likely to get compromised. To compare \textit{Notos}' performance with static filter lists, a pre-trained instance has been fed with 250,000 unique domains collected on 1. August 2009. 10,294 distinct entries have been reported with a reputation score below 0.5. 7,984 of this 10,294 or 77.6\% could be found in at least one blacklist (see Section~\nameref{subsec:notos_architecture} for a list of included blacklists). The remaining 22.4\% could not be precisely revealed. It is worth stating that 7,980 of the 7,984 confirmed bad domain names were assigned a reputation score of less than or equal to 0.15.
+To get optimal results with the \textit{Reputation Function}, several classifiers have been tested and selected for the given circumstances (time complexity, detection results and precision [true positives over all positives]). A decision tree with Logit-Boost strategy (see \fsCite{Friedman98additivelogistic} for implementation details) has shown to provide the best results with a low false positive rate (FP) of 0.38\% and a high true positive rate (TP) of 96.8\%. These results have been verified using a 10-fold cross validation with a reputation score threshold of 0.5. This 10-fold cross validation method splits the dataset in ten partitions/folds (each partition optimally containing roughly the same class label distribution). One fold is then used as the validation sample (testing set) and the remaining nine partitions are used as the training set. The training set is used to train the model which is then cross validated with the testing set. This step is repeated for ten times using the same partitions, each partition being the testing set once. For the validation in \textit{Notos}, a dataset of 20,249 domains with 9,530 known bad RR has been used. As the list of known good domains, the Alexa top 500 websites have been used. Taking a bigger amount of Alexa popular sites has shown to decrease accuracy of the overall system, which could be interpreted that smaller/less popular sites are more likely to get compromised. To compare \textit{Notos}' performance with static filter lists, a pre-trained instance has been fed with 250,000 unique domains collected on 1. August 2009. 10,294 distinct entries have been reported with a reputation score below 0.5. 7,984 of this 10,294 or 77.6\% could be found in at least one blacklist (see Section~\nameref{subsec:notos_architecture} for a list of included blacklists). The remaining 22.4\% could not be precisely revealed. It is worth stating that 7,980 of the 7,984 confirmed bad domain names were assigned a reputation score of less than or equal to 0.15.


 \subsection{Limitations}