rush hour

This commit is contained in:
2018-01-29 22:52:15 +01:00
parent 817b68b025
commit ece9b4afcf
14 changed files with 284 additions and 110 deletions

View File

@@ -35,7 +35,7 @@ For the distinction of benign and malicious domains to perform well, a large set
\textit{Exposure} uses a total of fifteen features that have been chosen after several month of study with thousands of well-known benign and malicious domains. These features are grouped into four different categories which can be seen in Table~\ref{tab:exposure_features}.
The first group, \textit{Time-Based Features} has not been approached in publications before. These features investigate the time, at which the request with domain \textit{d} has been issued. The main idea behind this group of features is to find malicious services that use techniques like \textit{domain flux}
\todo{explain domain flux} to circumvent take downs and make their infrastructure more agile. ``[\textit{Domain flux}] often show a sudden increase followed by a sudden decrease in the number of requests'' \fsCite[Section 3.1]{Bilge:2014:EPD:2617317.2584679}. Domains of malicious services using a DGA do only exist for a short period of time by design. \fsAuthor{Bilge:2014:EPD:2617317.2584679} defines the first feature as follows: ``A domain is defined to be a short-lived domain [...] if it is queried only between time \(t_0\) and \(t_1\), and if this duration is comparably short (e.g., less than several days).'' The next three features are subject to the change point detection (CPD) problem: Change point detection is about the identification of (abrupt) changes in the distribution of values, for example in time series. \textit{Exposure} implemented a CPD algorithm based on the popular CUSUM (cumulative sum) algorithm. At first, the time series of request timestamps is split into periods of 3600 seconds (one hour was tested to work well). After that, all time intervals are iterated and for each interval, the average request count of the previous eight hours \(P_t^-\) and following eight intervals \(P_t^+\) is calculated. In the next step, the distance of these two values is calculated \(d(t)=|P_t^--P_t^+|\) for each interval and the resulting ordered sequence \(d(t)\) of distances is fed to the CUSUM algorithm to finally get retrieve all change points (For more information on the implemented CPD algorithm, see \fsCite[Section 3.1]{Bilge:2014:EPD:2617317.2584679}). To calculate feature two (\textit{Daily similarity}), the Euclidean Distance of the time series of each day for \textit{d} is calculated. Intuitively, a low distance means similar time series and such high daily similarity whereas two days with higher distance do show a less similar request volume. All the features of this group do naturally only perform well when having a larger number of requests to \textit{d} over a significant period of time.
(see Section~\ref{subsec:fast-flux_service_networks}) to circumvent take downs and make their infrastructure more agile. ``[\textit{Domain flux}] often show a sudden increase followed by a sudden decrease in the number of requests'' \fsCite[Section 3.1]{Bilge:2014:EPD:2617317.2584679}. Domains of malicious services using a DGA do only exist for a short period of time by design. \fsAuthor{Bilge:2014:EPD:2617317.2584679} defines the first feature as follows: ``A domain is defined to be a short-lived domain [...] if it is queried only between time \(t_0\) and \(t_1\), and if this duration is comparably short (e.g., less than several days).'' The next three features are subject to the change point detection (CPD) problem: Change point detection is about the identification of (abrupt) changes in the distribution of values, for example in time series. \textit{Exposure} implemented a CPD algorithm based on the popular CUSUM (cumulative sum) algorithm. At first, the time series of request timestamps is split into periods of 3600 seconds (one hour was tested to work well). After that, all time intervals are iterated and for each interval, the average request count of the previous eight hours \(P_t^-\) and following eight intervals \(P_t^+\) is calculated. In the next step, the distance of these two values is calculated \(d(t)=|P_t^--P_t^+|\) for each interval and the resulting ordered sequence \(d(t)\) of distances is fed to the CUSUM algorithm to finally get retrieve all change points (For more information on the implemented CPD algorithm, see \fsCite[Section 3.1]{Bilge:2014:EPD:2617317.2584679}). To calculate feature two (\textit{Daily similarity}), the Euclidean Distance of the time series of each day for \textit{d} is calculated. Intuitively, a low distance means similar time series and such high daily similarity whereas two days with higher distance do show a less similar request volume. All the features of this group do naturally only perform well when having a larger number of requests to \textit{d} over a significant period of time.
The next group of Features (\textit{DNS Answer-Based Features}) investigates resolutions of the requested domain \textit{d}. While one domain can map to multiple IP addresses for benign services, most harmless services do show a much smaller network profile in terms of e.g. location and \glspl{as}. To satisfy those findings, four features have been extracted: The number of distinct IP addresses, the amount of different countries these IP addresses are assigned to, the number of other domains that share an IP address \textit{d} resolves to and the fourth feature is the amount of results of the reverse dns query for all IPs of \textit{d}. It is worth noting, that some hosting providers also use one IP address for many domains so an extra layer to prevent such false positives make sense.
@@ -73,14 +73,14 @@ The last group of features are the \textit{Domain Name-Based Features}. Domain n
\subsection{Reputation Engine}
\label{subsec:exposure_reputation_engine}
The reputation classifier of \textit{Exposure} is implemented as a \textit{J48} decision tree algorithm. The performance of decision trees mainly depend on the quality of the training set. For this reason a representative set of training data, with malicious domains from various threat classes, has to be chosen. Sources that have been used to identify malicious and benign domains can be found in Section~\ref{subsec:exposure_architecture}. In total, a list of 3500 known bad as well as 3000 known good domains have been used for the initial training. In order to take advantage of the \textit{Time-Based Features}, the optimal training period has been observed to be seven days. The tree is then constructed using the feature attribute values and its corresponding labels. More specifically, the whole training set is iterated and each time, a set of samples can be separated using one single attribute (in perspective to the assigned label) it is branched out and a new leaf is created. Each branch is then split into more fine grained subtrees as long as there is an \textit{information gain}, which means that all samples of the subset belong to the same class, i.e. are assigned the same label. \todo{link comprehensive decision tree explanation}
The reputation classifier of \textit{Exposure} is implemented as a \textit{J48} decision tree algorithm. The performance of decision trees mainly depend on the quality of the training set. For this reason a representative set of training data, with malicious domains from various threat classes, has to be chosen. Sources that have been used to identify malicious and benign domains can be found in Section~\ref{subsec:exposure_architecture}. In total, a list of 3500 known bad as well as 3000 known good domains have been used for the initial training. In order to take advantage of the \textit{Time-Based Features}, the optimal training period has been observed to be seven days. The tree is then constructed using the feature attribute values and its corresponding labels. More specifically, the whole training set is iterated and each time, a set of samples can be separated using one single attribute (in perspective to the assigned label) it is branched out and a new leaf is created. Each branch is then split into more fine grained subtrees as long as there is an \textit{information gain}, which means that all samples of the subset belong to the same class, i.e. are assigned the same label (see more on decision trees in Section~\ref{subsec:decision_tree_classifier}).
\subsection{Results}
\label{subsec:exposure_results}
The performance of classifiers with different feature sets has been tested using e.g. 10-fold cross validation. To find the model with the minimum error rate, all combinations of feature sets ({\textit{Time-Based Features} as F1, \textit{DNS Answer-Based Features} as F2, \textit{TTL Value-Based Features} F3 and \textit{Domain Name-Based Features} as F4) have been trained using the same decision tree algorithm. Figure~\ref{fig:exposure_miss-classifier_instances} shows the error rate of those different classification models. The \textit{Time-Based Features} are showing the smallest error when inspecting single feature sets only. Looking at models with multiple feature sets, the overall minimum error rate is produced when using all four feature groups. The total amount of requests in the dataset that was collected for the initial analysis counted roughly 100 billion DNS queries. As processing all of these requests is not feasible in practice, two filtering steps have been introduced. The first one filters out all requests to a domain in the top 1000 Alexa list. The assumption for this filter is that no malicious domain will get this popular without being detected in some form. This action reduced about 20\% of the initial requests. The second step filters out all requests to domains that have been registered at least one year before the analysis. This filter applied to 45.000 domains (or 40 billion corresponding queries) and reduced the remaining traffic by another 50\%. The filtering process has been cross tested against the Alexa top list, McAfee WebAdvisor (formerly McAfee SiteAdvisor) \fsCite{MCAfeeWebAdvisorOnline}, Google Safe Browsing \fsCite{GoogleSafeBrowsingOnline} and Norton Safe Web \fsCite{NortonSafeWebOnline} and only 0.09\% have been reported to be risky. \fsAuthor{Bilge11exposure:finding} for this reason states that: ``We therefore believe that our filtering policy did not miss a significant number of malicious domains because of the pre-filtering we performed during the offline experiments.''
The accuracy of the classifier has been validated using two different methods. The first method was to classify the training set with 10-fold cross validation. This validation method splits the dataset in ten partitions/folds (each partition optimally containing roughly the same class label distribution). One fold is then used as the validation sample (testing set) and the remaining nine partitions are used as the training set. The training set is used to train the model which is then cross validated with the testing set. This step is repeated for ten times using the same partitions, each partition being the testing set once. The second method is to simply use 66\% of the dataset for the training and the remaining 33\% as the testing set. \todo{detailed explanation of 10 fold cross validation}
The accuracy of the classifier has been validated using two different methods. The first method was to classify the training set with 10-fold cross validation. This validation method splits the dataset in ten partitions/folds (each partition optimally containing roughly the same class label distribution). One fold is then used as the validation sample (testing set) and the remaining nine partitions are used as the training set. The training set is used to train the model which is then cross validated with the testing set. This step is repeated for ten times using the same partitions, each partition being the testing set once. The second method is to simply use 66\% of the dataset for the training and the remaining 33\% as the testing set.
\begin{figure}[!htbp]