rush hour 3

This commit is contained in:
2018-02-01 01:40:07 +01:00
parent 95d35f2470
commit 970af03c09
18 changed files with 183 additions and 177 deletions

View File

@@ -1,17 +1,17 @@
\section{Domain Name System}
\label{sec:DNS}
The \gls{dns} is one of the cornerstones of the internet as it is known today. Nearly every device, connected to the internet is using DNS. Initial designs have been proposed in 1983 and evolved over the following four years into the first globally adapted standard RFC 1034 \fsCite{rfc1034} (see also RFC 1035 for implementation and specification details \fsCite{rfc1035}). The main idea of the \gls{dns} is translating human readable domain names to network addresses. There are many extensions to the initial design including many security related features and enhancements or the support for \gls{ipv6} in 1995.
The Domain Name System is one of the cornerstones of the internet as it is known today. Nearly every device connected to the internet is using DNS. Initial designs have been proposed in 1983 and evolved over the following four years into the first globally adapted standard RFC 1034 \fsCite{rfc1034} (see also RFC 1035 for implementation and specification details \fsCite{rfc1035}). The main idea of the DNS is translating human readable domain names to network addresses. There are many extensions to the initial design including many security related features and enhancements or the support for IPv6 in 1995.
In order to understand how the \gls{dns} is misused for malicious activities and how to prevent these attacks, it is necessary to explain some basic mechanisms.
In order to understand how the DNS is misused for malicious activities and how to prevent these attacks, it is necessary to explain some basic mechanisms.
\subsection{Basics}
\label{subsec:basics}
In the early days of the internet the mapping between host names and IP addresses has been accomplished using a single file, \texttt{HOSTS.TXT}. This file was maintained on a central instance, the \gls{sri-nic}, and distributed to all hosts in the internet via \gls{ftp}. As this file grew and more machines got connected to the internet, the costs for distributing the mappings were increasing up to an unacceptable effort. Additionally, the initial trend of the internet, the \gls{arpanet} connecting multiple hosts together into one network, got outdated. The new challenge of the internet was to connect multiple local networks (which itself contain many machines) into a global, interactive and \gls{tcp/ip} based grid. With the amount of machines quickly increasing and the costs for distributing the \texttt{HOSTS.TXT} file rising, a new system for a reliable and fast resolution of addresses to host names had to be developed.
In the early days of the internet the mapping between host names and IP addresses has been accomplished using a single file, \texttt{HOSTS.TXT}. This file was maintained on a central instance, the SRI-NIC (Stanford Research Institute - Network Information Center), and distributed to all hosts in the internet via FTP (File Transfer Protocol). As this file grew and more machines got connected to the internet, the costs for distributing the mappings were increasing up to an unacceptable effort. Additionally, the initial trend of the internet, the Advanced Research Projects Agency Network (ARPANET) connecting multiple hosts together into one network, got outdated. The new challenge of the internet was to connect multiple local networks (which itself contain many machines) into a global, interactive and TCP/IP based grid. With the amount of machines quickly increasing and the costs for distributing the \texttt{HOSTS.TXT} file rising, a new system for a reliable and fast resolution of addresses to host names had to be developed.
\citeauthor{mockapetris1988development} proposed five conditions that had to be met by the base design of \gls{dns} \fsCite[p. 124]{mockapetris1988development}:
\citeauthor{mockapetris1988development} proposed five conditions that had to be met by the base design of DNS \fsCite[p. 124]{mockapetris1988development}:
\begin{itemize}
\item Provide at least the same information as HOSTS.TXT.
@@ -21,21 +21,21 @@ In the early days of the internet the mapping between host names and IP addresse
\item Provide tolerable performance.
\end{itemize}
For the \gls{dns} to be globally acceptable, it should furthermore not give too many restrictions on how the distributed local networks and the hosts are designed and operated. This includes i.e. not limiting the system to work for a single \gls{os} or software architecture, backing different network topologies or the support of encapsulation of other name spaces.
For the DNS to be globally acceptable, it should furthermore not give too many restrictions on how the distributed local networks and the hosts are designed and operated. This includes i.e. not limiting the system to work for a single operating system (OS) or software architecture, backing different network topologies or the support of encapsulation of other name spaces.
In general, avoid as many constraints and support as many implementation structures as possible.
\subsubsection{Architecture}
\label{subsubsec:architecture}
The \gls{dns} primarily builds on two types of components: name servers and resolvers. A name server holds information that can be used to handle incoming requests e.g. to resolve a domain name into an IP address. Although resolving domain names into IP addresses might be the primary use case, name servers can possess arbitrary (within the limits of DNS records see \ref{tab:resource_record_types}) information and provide service to retrieve this information. A resolver interacts with client software and implements algorithms to find a name server that holds the information requested by the client (see also Section~\ref{subsec:resolution} for how the resolution is working). Depending on the functionality needed, these two components may be split to different machines and locations or running on one machine. Whereas in former days the bandwidth of a workstation may not have been sufficient to run a resolver on, today it is more interesting to benefit from cached information for performance reasons. In a company network it is common to have multiple resolvers e.g. one per organizational unit.
The DNS primarily builds on two types of components: name servers and resolvers. A name server holds information that can be used to handle incoming requests e.g. to resolve a domain name into an IP address. Although resolving domain names into IP addresses might be the primary use case, name servers can possess arbitrary (within the limits of DNS records see \ref{tab:resource_record_types}) information and provide service to retrieve this information. A resolver interacts with client software and implements algorithms to find a name server that holds the information requested by the client (see also Section~\ref{subsec:resolution} for how resolution is working). Depending on the functionality needed, these two components may be split into different machines and locations or running on one machine. Whereas in former days the bandwidth of a workstation may not have been sufficient to run a resolver on, today it is more interesting to benefit from cached information for performance reasons. In a company network it is common to have multiple resolvers e.g. one per organizational unit.
\subsubsection{Name space}
\label{subsubsec:name_space}
The \gls{dns} is based on a naming system that consists of a hierarchical and logical tree structure and is called the domain namespace. It contains a single root node (\textit{top level domain} or \textit{TLD})and an arbitrary amount of nodes in subordinate levels in variable depths (descending called second level, third level domain, and so forth). Each node is uniquely identifiable through a \gls{fqdn} and usually represents a domain, machine or service in the network. The FQDN can be constructed by fully iterating the DNS tree, see Figure~\ref{fig:dns_tree_web_de} for an example of how the DNS tree for www.web.de is looking like (note that the Root node is often abbreviated with a simple dot). Furthermore, every domain can be subdivided into more fine-grained domains. These can again be specific machines or domains, called subdomains. This subdividing is an important concept for the internet to continue to grow and each responsible instance of a domain (e.g. a company or cooperative) is responsible for the maintenance and subdivision of the domain.
The DNS is based on a naming system that consists of a hierarchical and logical tree structure and is called the domain namespace. It contains a single root node (\textit{top level domain} or \textit{TLD}) and an arbitrary amount of nodes in subordinate levels in variable depths (descending called second level, third level domain, and so forth). Each node is uniquely identifiable through a fully qualified domain name and usually represents a domain, machine or service in the network. The FQDN can be constructed by fully iterating the DNS tree, starting from the Root node. See Figure~\ref{fig:dns_tree_web_de} for an example of how the DNS tree for www.web.de is looking like (note that the Root node is often abbreviated with a simple dot). Furthermore, every domain can be subdivided into more fine-grained domains. These can again be specific machines or domains, called subdomains. This subdividing is an important concept for the internet to continue to grow and each responsible instance of a domain (e.g. a company or cooperative) is responsible for the maintenance and subdivision of the domain.
\begin{figure}[!htbp]
\centering
@@ -45,10 +45,10 @@ The \gls{dns} is based on a naming system that consists of a hierarchical and lo
\end{figure}
\subsubsection{\gls{dns} Resource Records}
\subsubsection{DNS Resource Records}
\label{subsubsec:dns_resource_records}
See Table~\ref{tab:resource_record_types} for an list of built-in resource types in the DNS. Those built-in resource records do serve different purposes and are more or less frequently used.
See Table~\ref{tab:resource_record_types} for an list of built-in resource types in the DNS.
\begin{table}[!htbp]
@@ -60,9 +60,9 @@ See Table~\ref{tab:resource_record_types} for an list of built-in resource types
Value & Text Code & Type & Description \\ \midrule
1 & A & Address & \begin{tabular}[c]{@{}l@{}}Returns the 32 bit IPv4 address of a host. \\ Most commonly used for name resolution \\ of a host.\end{tabular} \\
28 & AAAA & IPv6 address & \begin{tabular}[c]{@{}l@{}}Similar to the A record, this returns the \\ address of an host. For IPv6 this has 128 bit.\end{tabular} \\
2 & NS & \begin{tabular}[c]{@{}l@{}}Name\\ Server\end{tabular} & \begin{tabular}[c]{@{}l@{}}Specifies the name of a \gls{dns} name server \\ that is authoritative for the zone. Each \\ zone must have at least one NS record \\ that points to its primary name server.\end{tabular} \\
5 & CNAME & \begin{tabular}[c]{@{}l@{}}Canonical\\ Name\end{tabular} & \begin{tabular}[c]{@{}l@{}}The CNAME records allows to define \\ aliases that point to the real canonical \\ name of the node. This can e.g. be used\\ to hide internal \gls{dns} structures and \\ provide a stable interface for outside users.\end{tabular} \\
6 & SOA & \begin{tabular}[c]{@{}l@{}}Start of\\ Authority\end{tabular} & \begin{tabular}[c]{@{}l@{}}The SOA record marks the start of a \gls{dns} \\ zone and provides important information \\ about the zone. Every zone must have \\ exactly one SOA records containing \\ e.g. name of the zone, primary \\ authoritative server name and the \\ administration email address.\end{tabular} \\
2 & NS & \begin{tabular}[c]{@{}l@{}}Name\\ Server\end{tabular} & \begin{tabular}[c]{@{}l@{}}Specifies the name of a DNS name server \\ that is authoritative for the zone. Each \\ zone must have at least one NS record \\ that points to its primary name server.\end{tabular} \\
5 & CNAME & \begin{tabular}[c]{@{}l@{}}Canonical\\ Name\end{tabular} & \begin{tabular}[c]{@{}l@{}}The CNAME records allows to define \\ aliases that point to the real canonical \\ name of the node. This can e.g. be used\\ to hide internal DNS structures and \\ provide a stable interface for outside users.\end{tabular} \\
6 & SOA & \begin{tabular}[c]{@{}l@{}}Start of\\ Authority\end{tabular} & \begin{tabular}[c]{@{}l@{}}The SOA record marks the start of a DNS \\ zone and provides important information \\ about the zone. Every zone must have \\ exactly one SOA record containing \\ e.g. name of the zone, primary \\ authoritative server name and the \\ administration email address.\end{tabular} \\
12 & PTR & Pointer & \begin{tabular}[c]{@{}l@{}}Provides a pointer to a different record\\ in the name space.\end{tabular} \\
15 & MX & Mail Exchange & \begin{tabular}[c]{@{}l@{}}Returns the host that is responsible for\\ handling emails sent to this domain.\end{tabular} \\
16 & TXT & Text String & \begin{tabular}[c]{@{}l@{}}Record which allows arbitrary \\ additional texts to be stored that are\\ related to the domain.\end{tabular} \\ \bottomrule
@@ -70,10 +70,11 @@ Value & Text Code & Type
\end{table}
\newpage
\subsubsection{Payload}
\label{subsubsec:payload}
In this section we will introduce the actual payload a \gls{dns} request as well as the response are built on. The format of each message that is shared between a resolver and \gls{dns} server has been initially defined in RFC 1035 \fsCite{rfc1035} and consecutively extended with new opcodes, response codes etc. This general format applies to both requests as well as responses and consists of five sections:
In this section we will introduce the actual payload a DNS request as well as the response are built on. The format of each message that is shared between a resolver and DNS server has been initially defined in RFC 1035 \fsCite{rfc1035} and consecutively extended with new opcodes (opcodes are references to different actions in DNS, e.g. to query or update a record \ref{tab:message_header_opcodes}), response codes etc. This general format applies to both requests as well as responses and consists of five sections:
\begin{enumerate}
\item Message Header
@@ -103,14 +104,14 @@ QR & \multicolumn{4}{c}{OPCODE} & AA & TC & RD & RA & Z & AD & CD & \multicolumn
\end{tabular}
\end{table}
Table~\ref{tab:message_header} shows the template of a \gls{dns} message header. In the following listing, an explanation for the respective variables and flags is given:
Table~\ref{tab:message_header} shows the template of a DNS message header. In the following listing, an explanation for the respective variables and flags is given:
\begin{itemize}
\item \textbf{Message ID:} 16 bit identifier supplied by the requester (any kind of software that generates a request) and sent back unchanged by the responder to identify the transaction and enables the requester to match up replies to outstanding request.
\item \textbf{QR:} Query/Response Flag one bit field whether this message is a query(0) or a response(1)
\item \textbf{QR:} Query/Response Flag one bit field whether this message is a query(0) or a response(1).
\item \textbf{OPCODE:} Four bit field that specifies the kind of query for this message. This is set by the requester and copied into the response. Possible values for the opcode field can be found in Table~\ref{tab:message_header_opcodes}
\item \textbf{OPCODE:} Four bit field that specifies the kind of query for this message. This is set by the requester and copied into the response. Possible values for the opcode field can be found in Table~\ref{tab:message_header_opcodes}.
\begin{table}[!htbp]
\centering
\caption{DNS: Message Header Opcodes}
@@ -130,7 +131,7 @@ Table~\ref{tab:message_header} shows the template of a \gls{dns} message header.
\item \textbf{AA:} Authoritative Answer this flag is set to 1 by the responding server if it is an authority for the domain name in the question section. If set to 0 this usually means that a cached record is returned.
\item \textbf{TC:} The Truncated bit is set to 1 if the response is larger then the permitted transmission channel length and the message has been truncated therefore. This usually indicates that \gls{dns} over \gls{udp} is used and the response payload size increases the maximum 512 bytes. The client may either requery over \gls{tcp} (with no size limits) or not bother at all if the truncated data was part of the Additional section. Set on all truncated messages except for the last one.
\item \textbf{TC:} The Truncated bit is set to 1 if the response is larger then the permitted transmission channel length and the message has been truncated therefore. This usually indicates that DNS over UDP (User Datagram Protocol) is used and the response payload size increases the maximum 512 bytes. The client may either requery over TCP (Transmission Control Protocol) (with no size limits) or not bother at all if the truncated data was part of the Additional section. Set on all truncated messages except for the last one.
\item \textbf{RD:} Recursion Desired this bit may be set in a query and is copied into the response if the name server supports recursion. If recursion is refused by this name server, e.g. it has been configured as authoritative only, the response does not have this bit set. Recursive query support is optional.
@@ -142,7 +143,7 @@ Table~\ref{tab:message_header} shows the template of a \gls{dns} message header.
\item \textbf{CD:} Checking Disabled also used by \gls{dnssec} and may be set in a requests to show that non-verified data is acceptable to the requester. If \gls{dnssec} is not available in the resolver, this is always set to 0.
\item \textbf{RCODE:} Response Code only available in response messages, these four bits are used to reveal errors while processing the query. Available error codes are listed in Table~\ref{tab:message_header_response_codes}. Error codes 0 to 5 have been initially available whereas error codes 6 to 10 are used for dynamic \gls{dns} defined in RFC 2136 \fsCite{rfc2136}.
\item \textbf{RCODE:} Response Code only available in response messages, these four bits are used to reveal errors while processing the query. Available error codes are listed in Table~\ref{tab:message_header_response_codes}. Error codes 0 to 5 have been initially available whereas error codes 6 to 10 are used for dynamic DNS defined in RFC 2136 \fsCite{rfc2136}.
\begin{table}[!htbp]
\centering
@@ -173,7 +174,7 @@ Table~\ref{tab:message_header} shows the template of a \gls{dns} message header.
\begin{table}[!htbp]
\centering
\caption{DNS: Question Section}
\label{tab_question_section}
\label{tab:question_section}
\begin{tabular}{@{}ccccccccc@{}}
\toprule
0 & 4 & 8 & 12 & 16 & 20 & 24 & 28 & 32 \\ \midrule
@@ -183,29 +184,30 @@ Table~\ref{tab:message_header} shows the template of a \gls{dns} message header.
\end{table}
See Table~\ref{tab:question_section} for the query layout.
\begin{itemize}
\item \textbf{Question Name:} Contains a variably sized payload including the domain, zone name or general object that is subject of the query. Encoded using standard \gls{dns} name notation. Depending on the Question Type, for example requesting an A Record will require an host part, such as www.domain.tld. A MX query will usually only contain a base domain name (domain.tld).
\item \textbf{Question Name:} Contains a variably sized payload including the domain, zone name or general object that is subject of the query. Encoded using standard DNS name notation. Depending on the Question Type, for example requesting an A Record will require an host part, such as www.domain.tld. A MX query will usually only contain a base domain name (domain.tld).
\item \textbf{Question Type:} Specifies the type of question being asked. This field may contain a code number corresponding to a particular type of resource being requested, see Table~\ref{tab:resource_record_types} for common resource types.
\item \textbf{Question Class:} The class of the resource records that are being requested (unsigned 16 bit value). Usually Internet, question classes are assigned by the IANA where all can be found (\fsCite{IANADNSClassesOnline})
\item \textbf{Question Class:} The class of the resource records that are being requested (unsigned 16 bit value). Usually Internet, question classes are assigned by the IANA where all can be found (\fsCite{IANADNSClassesOnline}).
\end{itemize}
There are more parameters available that can be specified when requesting a resource but do not have a higher relevance here.
There are more parameters available that can be specified when requesting a resource but do not have a higher relevance in this work.
\subsection{Domain Names}
\label{subsec:domain_names}
The structure of domain names is generally managed by the corresponding registrar, e.g. the DENIC e.G. (\fsCite{DENICOnline}) for .de domains. This includes for example which characters are allowed in second-level domains and the overall registration process. In the .de space, the second-level domain must contain between one and 63 characters, all characters of the latin alphabet can be used in addition to numbers, hyphen and all 93 characters of the internationalized domain name. The first, third, fourth and last characters is additionally not allowed to be a hyphen. Many different registrars use similar rules like this example which makes it hard to easily distinguish valid from non-valid domain names.
The structure of domain names is generally managed by the corresponding registrar, e.g. the DENIC e.G. (\fsCite{DENICOnline}) for .de domains. For example, this includes which characters are allowed in second-level domains and the overall registration process for domains. In the .de space, the second-level domain must contain between one and 63 characters while all characters of the latin alphabet can be used in addition to numbers, hyphen and all characters of the internationalized domain name specification (\fsCite{IDNOnline}). The first, third, fourth and last characters is additionally not allowed to be a hyphen. Many different registrars use similar rules like this example which makes it hard to generally distinguish valid from non-valid domain names.
\subsection{Resolution}
\label{subsec:resolution}
Figure~\ref{fig:address_resolution} quickly describes the process of how domain names are resolved from the perspective of a requesting machine. Each step here assumes that the request has not been performed before and such is not available in any cache. In the first step, the \textit{Operating System} is contacting the local resolver, e.g. a router in a private network or a dedicated resolve server in a larger company. As the \textit{DNS Resolver} does know nothing about the domain, it contacts the \textit{Root NS} to return the address of the responsible top-level domain server (\textit{TLD NS} for .com in this example). The resolver then asks the \textit{TLD NS} server to return back the address of the second-level domain server that is in charge of the requested zone (e.g. google.com). Finally the resolver queries the \textit{Google NS} server for the IP address of the \textit{Google Webserver} and sends it back to the \textit{Operating System} which can then establish a connection to the \textit{Google Webserver}.
Figure~\ref{fig:address_resolution} quickly describes the process of how domain names are resolved from the perspective of a requesting machine. Each step here assumes that the request has not been performed before and such is not available in any cache. In the first step, the \textit{Operating System} is contacting the local resolver, e.g. a router in a private network or a dedicated resolve server in a larger company, to resolve the domain name (www.google.com in this example). As the \textit{DNS Resolver} does know nothing about the domain, it contacts the \textit{Root NS} to return the address of the responsible top-level domain server (\textit{TLD NS} for .com in this case). The resolver then asks the \textit{TLD NS} server to return back the address of the second-level domain server that is in charge of the requested zone, e.g. google.com. Finally the resolver queries the \textit{Google NS} server for the IP address of the \textit{Google Webserver} (or www.google.com) and sends it back to the \textit{Operating System} which can then establish a connection to the google web page.
There are mainly two different types of DNS requests that are performed here. The \textit{Operating System} is sending a recursive request to the \textit{DNS Resolver} which itself is successively sending iterative requests to the higher level DNS servers. Usually most public servers do not allow recursive queries due to security risks (denial of service attacks).
There are mainly two different types of DNS requests which are both performed here. The \textit{Operating System} is sending a recursive request to the \textit{DNS Resolver} which itself is successively sending iterative requests to the higher level DNS servers. Usually most public servers do not allow recursive queries due to security risks (denial-of-service attack).
\begin{figure}[!htbp]
@@ -218,4 +220,4 @@ There are mainly two different types of DNS requests that are performed here. Th
\subsection{Passive DNS}
\label{subsec:passive_dns}
A Passive DNS database is a database that contains a history of all resolved DNS queries in a network. The traffic can be observed at any appropriate location in a network, e.g. on a resolver. The main advantage of passively collecting DNS traffic is that there are no operational changes needed to collect logs of resolutions, one simple way is to mirror the DNS port on the resolver and persist the traffic into files). A Passive DNS database can be used in a variety of actions to harden a network from different threats. Projects like the Security Information Exchange (SIE) collect passive DNS data from multiple sources and analyse the databases to find e.g. inconsistencies in the resolutions (\fsCite{SIEOnline}). Passive DNS databases can also be used by researchers or service providers to find performance issues, identify anomalies or generate usage statistics \fsCite{Deri:2012:TPD:2245276.2245396}.
A Passive DNS database is a database that contains a history of all resolved DNS queries in a network. The traffic can be observed at any appropriate location in a network, e.g. on a resolver. The main advantage of passively collecting DNS traffic is that there are no operational changes needed to collect logs of resolutions. One simple way is to mirror the DNS port on the resolver and persist the traffic into files. A Passive DNS database can be used in a variety of actions to harden a network from different threats. Projects like the Security Information Exchange (SIE) collect passive DNS data from multiple sources and analyse the databases to find e.g. inconsistencies in the resolutions (\fsCite{SIEOnline}). Passive DNS databases can also be used by researchers or service providers to find performance issues, identify resolution anomalies or generate usage statistics \fsCite{Deri:2012:TPD:2245276.2245396}.

View File

@@ -1,8 +1,6 @@
\section{Detecting Malicious Domain Names}
\label{sec:detecting_malicious_domain_names}
\todo{literature exposure section 6.1}
\subsection{Domain Name Characteristics}
\label{subsec:domain_name_characteristics}

View File

@@ -6,10 +6,6 @@
\section{Machine Learning}
\label{sec:machine_learning}
Machine learning is broad field in computer science that aims to give computers the ability to learn without being explicitly programmed for a special purpose. There are many different approaches available that have advantages and disadvantages in different areas. Machine learning in this work is mostly limited to decision tree learning. Decision tree learning is an approach that is generally adopted from how humans are making decisions. Given a set of attributes, humans are able to decide, e.g. whether to buy one or another product. Machine learning algorithms use a technique called training to build a model which can later be used to make decisions. A decision tree consists of three components: a node represents the test of a certain attribute to split up the tree, leafs are terminal nodes and represent the prediction (the class or label) of the path from the root node to the leaf, and edges correspond to the results of a test and establish a connection to the next node or leaf. This training is performed in multiple steps: Given an arbitrarily large dataset (training set) with an fixed size of features (attributes) and each sample in the training set is assigned a label. The amount of labels is arbitrary (but limited), in a binary classification there are two different labels (e.g. malicious or benign in cases for domains). In the first step of the training, the whole training set is iterated and each time, a set of samples can be separated using one single attribute (in perspective to the assigned label) it is branched out and a new leaf is created. Each branch is then split into more fine grained subtrees as long as there is an \textit{information gain}, which means that all samples of the subset belong to the same class, i.e. are assigned the same label. The model can later be queried with an unlabeled data sample and the model returns the probability with which the data sample can be assigned to a class/label.
Machine learning is a broad field in computer science that aims to give computers the ability to learn without being explicitly programmed for a special purpose. There are many different approaches available that have advantages and disadvantages in different areas like object recognition in images, self driving cars or forecastings. Machine learning in this work is mostly limited to decision tree learning. Decision tree learning is an approach that is generally adopted from how humans are making decisions. Given a set of attributes, humans are able to decide, e.g. whether to buy one or another product. Machine learning algorithms use a technique called training to build a model which can later be used to make decisions and e.g. classify a dataset. A decision tree consists of three components: a node represents the test of a certain attribute to split up the tree, leafs are terminal nodes and represent a prediction (class/label) using all attributes in the trace from the root node to the leaf, and edges correspond to the results of a test and establish a connection to the next node or leaf. The training is performed in multiple steps. Input for the training is an arbitrarily large dataset (training set) with an fixed size of features (attributes) and for each sample in the training set, the corresponding label has to be known. The amount of labels or classes is arbitrary (but limited), in a binary classification there are two different labels (e.g. malicious or benign in the case of this work). In the first step of the training, the whole training set is iterated and each time a set of samples can be separated using one single attribute (in perspective to the assigned label) it is branched out and a new leaf is created. Each branch is then split into more fine grained subtrees as long as there is an \textit{information gain}, which means that not all samples of the subset belong to the same class, i.e. are assigned the same label. The model can later be queried with an unlabeled data sample and the model returns the probability with which the data sample can be assigned to a class.
This way, having a labeled training set with limited size and by learning the characteristics of the labeled test sample, unlabeled data can be classified.
%\input{content/Technical_Background/Detecting_Malicious_Domain_Names/Detecting_Malicious_Domain_Names}
\input{content/Technical_Background/Benchmarks/Benchmarks}
This way, having a labeled training set with limited size and by learning the characteristics of the labeled test sample, unlabeled data can be classified. The most popular decision tree implementation is \textit{C4.5} \fsCite{Salzberg1994}. Many current implementations like \textit{CART} (Classification and Regression Trees \fsCite{SciKitOnline}) or \textit{J48} are based off of \textit{C4.5}.