rush hour 2

2018-01-30 20:54:29 +01:00
parent ece9b4afcf
commit 95d35f2470
20 changed files with 171 additions and 146 deletions
--- a/Thesis/content/Technical_Background/DNS/DNS.tex
+++ b/Thesis/content/Technical_Background/DNS/DNS.tex
@@ -1,7 +1,7 @@
 \section{Domain Name System}
 \label{sec:DNS}

-The \gls{dns} is one of the cornerstone of the internet as it is known today. Nearly every device, connected to the internet is using DNS. Initial designs have been proposed in 1983 and evolved over the following four years into the first globally adapted standard RFC 1034 \fsCite{rfc1034} (see also RFC 1035 for implementation and specification details \fsCite{rfc1035}). The main idea of the \gls{dns} is translating human readable domain names to network addresses. There are many extensions to the initial design including many security related features and enhancements or the support for \gls{ipv6} in 1995. 
+The \gls{dns} is one of the cornerstones of the internet as it is known today. Nearly every device, connected to the internet is using DNS. Initial designs have been proposed in 1983 and evolved over the following four years into the first globally adapted standard RFC 1034 \fsCite{rfc1034} (see also RFC 1035 for implementation and specification details \fsCite{rfc1035}). The main idea of the \gls{dns} is translating human readable domain names to network addresses. There are many extensions to the initial design including many security related features and enhancements or the support for \gls{ipv6} in 1995. 

 In order to understand how the \gls{dns} is misused for malicious activities and how to prevent these attacks, it is necessary to explain some basic mechanisms.

@@ -9,15 +9,15 @@ In order to understand how the \gls{dns} is misused for malicious activities and
 \subsection{Basics}
 \label{subsec:basics}

-In the early days of the internet the mapping between host names and ip addresses has been accomplished using a single file, \texttt{HOSTS.TXT}. This file was maintained on a central instance, the \gls{sri-nic}, and distributed to all hosts in the internet via \gls{ftp}. As this file grew and more machines got connected to the internet, the costs for distributing the mappings were increasing up to an unacceptable effort. Additionally, the initial trend of the internet, the \gls{arpanet} connecting multiple hosts together into one network, got outdated. The new challenge of the internet was to connect multiple local networks (which itself contain many machines) into a global, interactive and \gls{tcp/ip} based grid. With the amount of machines quickly increasing and the costs for distributing the \texttt{HOSTS.TXT} file exponentially rising, a new system for a reliable and fast resolution of addresses to host names had to be developed.
+In the early days of the internet the mapping between host names and IP addresses has been accomplished using a single file, \texttt{HOSTS.TXT}. This file was maintained on a central instance, the \gls{sri-nic}, and distributed to all hosts in the internet via \gls{ftp}. As this file grew and more machines got connected to the internet, the costs for distributing the mappings were increasing up to an unacceptable effort. Additionally, the initial trend of the internet, the \gls{arpanet} connecting multiple hosts together into one network, got outdated. The new challenge of the internet was to connect multiple local networks (which itself contain many machines) into a global, interactive and \gls{tcp/ip} based grid. With the amount of machines quickly increasing and the costs for distributing the \texttt{HOSTS.TXT} file rising, a new system for a reliable and fast resolution of addresses to host names had to be developed.

 \citeauthor{mockapetris1988development} proposed five conditions that had to be met by the base design of \gls{dns} \fsCite[p. 124]{mockapetris1988development}:

 \begin{itemize}
-\item Provide at least all of the same information as HOSTS.TXT.
+\item Provide at least the same information as HOSTS.TXT.
 \item Allow the database to be maintained in a distributed manner.
 \item Have no obvious size limits for names, name components, data associated with a name, etc.
-\item Interoperate across the DARPA Internet as many other environments as possible.
+\item Interoperate across the DARPA Internet and as many other environments as possible.
 \item Provide tolerable performance.
 \end{itemize}

@@ -28,14 +28,21 @@ In general, avoid as many constraints and support as many implementation structu
 \subsubsection{Architecture}
 \label{subsubsec:architecture}

-The \gls{dns} primarily builds on two types of components: name servers and resolvers. A name server holds information that can be used to handle incoming requests e.g. to resolve a domain name into an ip address. Although resolving domain names into ip addresses might be the primary use case, name servers can possess arbitrary information and provide service to retrieve this information. A resolver interacts with client software and implements algorithms to find a name server that holds the information requested by the client. Depending on the functionality needed, these two components may be split to different machines and locations or running on one machine. Where in former days the power of a workstation may not has been sufficient to run a resolver on, today it is more interesting to benefit from cached information for performance reasons. In a company network it is common to have multiple resolvers e.g. one per organizational unit.
+The \gls{dns} primarily builds on two types of components: name servers and resolvers. A name server holds information that can be used to handle incoming requests e.g. to resolve a domain name into an IP address. Although resolving domain names into IP addresses might be the primary use case, name servers can possess arbitrary (within the limits of DNS records see \ref{tab:resource_record_types}) information and provide service to retrieve this information. A resolver interacts with client software and implements algorithms to find a name server that holds the information requested by the client (see also Section~\ref{subsec:resolution} for how the resolution is working). Depending on the functionality needed, these two components may be split to different machines and locations or running on one machine. Whereas in former days the bandwidth of a workstation may not have been sufficient to run a resolver on, today it is more interesting to benefit from cached information for performance reasons. In a company network it is common to have multiple resolvers e.g. one per organizational unit.



 \subsubsection{Name space}
 \label{subsubsec:name_space}

-The \gls{dns} is based on a naming system that consists of a hierarchical and logical tree structure and is called the domain namespace. It contains a single root node (\textit{top level domain} or \textit{TLD})and an arbitrary amount of nodes in subordinate levels in variable depths (descending called second level, third level domain, and so forth). Each node is uniquely identifiable through a \gls{fqdn} and usually represents a domain, machine or service in the network. Furthermore, every domain can be subdivided into more fine-grained domains. These can again be specific machines or domains, called subdomains. This subdividing is an important concept for the internet to continue to grow and each responsible instance of a domain (e.g. a company or cooperative) is responsible for the maintenance and subdivision of the domain. 
+The \gls{dns} is based on a naming system that consists of a hierarchical and logical tree structure and is called the domain namespace. It contains a single root node (\textit{top level domain} or \textit{TLD})and an arbitrary amount of nodes in subordinate levels in variable depths (descending called second level, third level domain, and so forth). Each node is uniquely identifiable through a \gls{fqdn} and usually represents a domain, machine or service in the network. The FQDN can be constructed by fully iterating the DNS tree, see Figure~\ref{fig:dns_tree_web_de} for an example of how the DNS tree for www.web.de is looking like (note that the Root node is often abbreviated with a simple dot). Furthermore, every domain can be subdivided into more fine-grained domains. These can again be specific machines or domains, called subdomains. This subdividing is an important concept for the internet to continue to grow and each responsible instance of a domain (e.g. a company or cooperative) is responsible for the maintenance and subdivision of the domain. 
+
+\begin{figure}[!htbp]
+    \centering
+    \includegraphics[width=.4\textwidth, clip=true]{content/Technical_Background/DNS/dns_tree_web_de.png}
+    \caption{DNS: Tree structure}
+    \label{fig:dns_tree_web_de}
+\end{figure}


 \subsubsection{\gls{dns} Resource Records}
@@ -46,7 +53,7 @@ See Table~\ref{tab:resource_record_types} for an list of built-in resource types

 \begin{table}[!htbp]
 \centering
-\caption{Resource Record Types}
+\caption{DNS: Resource Record Types}
 \label{tab:resource_record_types}
 \begin{tabular}{@{}llll@{}}
 \toprule
@@ -66,7 +73,7 @@ Value & Text Code & Type
 \subsubsection{Payload}
 \label{subsubsec:payload}

-In this section we will introduce the actual payload a \gls{dns} request as well as the response is built on. The format of each message that is shared between a resolver and \gls{dns} server has been initially defined in RFC 1035 \fsCite{rfc1035} and consecutively extended with new opcodes, response codes etc. This general format applies to both requests as well as responses and consists of five sections:
+In this section we will introduce the actual payload a \gls{dns} request as well as the response are built on. The format of each message that is shared between a resolver and \gls{dns} server has been initially defined in RFC 1035 \fsCite{rfc1035} and consecutively extended with new opcodes, response codes etc. This general format applies to both requests as well as responses and consists of five sections:

 \begin{enumerate}
    \item Message Header
@@ -77,12 +84,12 @@ In this section we will introduce the actual payload a \gls{dns} request as well
 \end{enumerate}

 \paragraph{Message Header:}
-\label{par:message_header}with
+\label{par:message_header}
 The Message Header is obligatory for all types of communication and may not be empty. It contains different types of flags that are used to control the transaction. The header specifies e.g. which further sections are present, whether the message is a query or a response and more specific opcodes.

 \begin{table}[!htbp]
 \centering
-\caption{Message Header}
+\caption{DNS: Message Header}
 \label{tab:message_header}
 \begin{tabular}{@{}cccccccccccccccc@{}}
 \toprule
@@ -99,14 +106,14 @@ QR & \multicolumn{4}{c}{OPCODE} & AA & TC & RD & RA & Z & AD & CD & \multicolumn
 Table~\ref{tab:message_header} shows the template of a \gls{dns} message header. In the following listing, an explanation for the respective variables and flags is given:

 \begin{itemize}
-    \item \textbf{Message ID:} 16 bit identifier supplied by the requester (any kind of software that generates a request) and resend back unchanged by the responder to identify the transaction and enables the requester to match up replies to outstanding request.
+    \item \textbf{Message ID:} 16 bit identifier supplied by the requester (any kind of software that generates a request) and sent back unchanged by the responder to identify the transaction and enables the requester to match up replies to outstanding request.
    
    \item \textbf{QR:} Query/Response Flag – one bit field whether this message is a query(0) or a response(1)
    
    \item \textbf{OPCODE:} Four bit field that specifies the kind of query for this message. This is set by the requester and copied into the response. Possible values for the opcode field can be found in Table~\ref{tab:message_header_opcodes}
    \begin{table}[!htbp]
    \centering
-    \caption{Message Header Opcodes}
+    \caption{DNS: Message Header Opcodes}
    \label{tab:message_header_opcodes}
    \begin{tabular}{@{}lll@{}}
    \toprule
@@ -139,7 +146,7 @@ Table~\ref{tab:message_header} shows the template of a \gls{dns} message header.
    
    \begin{table}[!htbp]
    \centering
-    \caption{Message Header Response Codes}
+    \caption{DNS: Message Header Response Codes}
    \label{tab:message_header_response_codes}
    \begin{tabular}{@{}lll@{}}
    \toprule
@@ -165,7 +172,7 @@ Table~\ref{tab:message_header} shows the template of a \gls{dns} message header.

 \begin{table}[!htbp]
 \centering
-\caption{Question Section}
+\caption{DNS: Question Section}
 \label{tab_question_section}
 \begin{tabular}{@{}ccccccccc@{}}
 \toprule
@@ -177,7 +184,7 @@ Table~\ref{tab:message_header} shows the template of a \gls{dns} message header.


 \begin{itemize}
-    \item \textbf{Question Name:} Contains a variably sized payload including the domain, zone name or general object that is subject of the query. Encoded using standard \gls{dns} name notation. Depending on the Question Type, for example requesting an A Record will typically require an host part, such as www.domain.tld. A MX query will usually only contain a base domain name (domain.tld).
+    \item \textbf{Question Name:} Contains a variably sized payload including the domain, zone name or general object that is subject of the query. Encoded using standard \gls{dns} name notation. Depending on the Question Type, for example requesting an A Record will require an host part, such as www.domain.tld. A MX query will usually only contain a base domain name (domain.tld).
    
    \item \textbf{Question Type:} Specifies the type of question being asked. This field may contain a code number corresponding to a particular type of resource being requested, see Table~\ref{tab:resource_record_types} for common resource types.
    
@@ -204,11 +211,11 @@ There are mainly two different types of DNS requests that are performed here. Th
 \begin{figure}[!htbp]
 \centering
 \includegraphics[scale=.5, clip=true]{content/Technical_Background/DNS/DNS_address-resolution.pdf}
-\caption{Address Resolution}
+\caption{DNS: Address Resolution}
 \label{fig:address_resolution}
 \end{figure}

 \subsection{Passive DNS}
 \label{subsec:passive_dns}

-A Passive DNS database is a database that contains a history of all resolved DNS queries in a network. The traffic can be observed at any appropriate location in a network, e.g. on a resolver. A Passive DNS database can be used in a variety of actions to harden a network from different threats. Projects like the Security Information Exchange (SIE) collect passive DNS data from multiple sources and analyse the databases to find e.g. inconsistencies in the resolutions (\fsCite{SIEOnline}). Passive DNS databases can also be used by researchers or service providers to find performance issues, identify anomalies or generate usage statistics \fsCite{Deri:2012:TPD:2245276.2245396}.
+A Passive DNS database is a database that contains a history of all resolved DNS queries in a network. The traffic can be observed at any appropriate location in a network, e.g. on a resolver. The main advantage of passively collecting DNS traffic is that there are no operational changes needed to collect logs of resolutions, one simple way is to mirror the DNS port on the resolver and persist the traffic into files). A Passive DNS database can be used in a variety of actions to harden a network from different threats. Projects like the Security Information Exchange (SIE) collect passive DNS data from multiple sources and analyse the databases to find e.g. inconsistencies in the resolutions (\fsCite{SIEOnline}). Passive DNS databases can also be used by researchers or service providers to find performance issues, identify anomalies or generate usage statistics \fsCite{Deri:2012:TPD:2245276.2245396}.
--- a/Thesis/content/Technical_Background/DNS/dns_tree_web_de.png
+++ b/Thesis/content/Technical_Background/DNS/dns_tree_web_de.png
--- a/Thesis/content/Technical_Background/Technical_Background.tex
+++ b/Thesis/content/Technical_Background/Technical_Background.tex
@@ -2,5 +2,14 @@
 \label{cha:technical_background}

 \input{content/Technical_Background/DNS/DNS}
-\input{content/Technical_Background/Detecting_Malicious_Domain_Names/Detecting_Malicious_Domain_Names}
-\input{content/Technical_Background/Benchmarks/Benchmarks}
+
+\section{Machine Learning}
+\label{sec:machine_learning}
+
+Machine learning is broad field in computer science that aims to give computers the ability to learn without being explicitly programmed for a special purpose. There are many different approaches available that have advantages and disadvantages in different areas. Machine learning in this work is mostly limited to decision tree learning. Decision tree learning is an approach that is generally adopted from how humans are making decisions. Given a set of attributes, humans are able to decide, e.g. whether to buy one or another product. Machine learning algorithms use a technique called training to build a model which can later be used to make decisions. A decision tree consists of three components: a node represents the test of a certain attribute to split up the tree, leafs are terminal nodes and represent the prediction (the class or label) of the path from the root node to the leaf, and edges correspond to the results of a test and establish a connection to the next node or leaf. This training is performed in multiple steps: Given an arbitrarily large dataset (training set) with an fixed size of features (attributes) and each sample in the training set is assigned a label. The amount of labels is arbitrary (but limited), in a binary classification there are two different labels (e.g. malicious or benign in cases for domains). In the first step of the training, the whole training set is iterated and each time, a set of samples can be separated using one single attribute (in perspective to the assigned label) it is branched out and a new leaf is created. Each branch is then split into more fine grained subtrees as long as there is an \textit{information gain}, which means that all samples of the subset belong to the same class, i.e. are assigned the same label. The model can later be queried with an unlabeled data sample and the model returns the probability with which the data sample can be assigned to a class/label. 
+
+This way, having a labeled training set with limited size and by learning the characteristics of the labeled test sample, unlabeled data can be classified.
+
+
+%\input{content/Technical_Background/Detecting_Malicious_Domain_Names/Detecting_Malicious_Domain_Names}
+\input{content/Technical_Background/Benchmarks/Benchmarks}