DESIGN OF THE FRAMEWORK

4. DESIGN OF THE FRAMEWORK
1. 4.1 Aims
2. 4.2 Identifying selecting the main components of the framework

4. DESIGN OF THE FRAMEWORK

Having identified the requirements for conducting malware forensics, it follows that the aims for a framework to address these requirements also need to be determined.

4.1 Aims

Malware investigations can use a variety of software tools, some of which make claims to be suited for malware analysis. The proposed framework, referred to as the Malware Analysis Tool Evaluation Framework (MATEF), should provide a mechanism to evaluate these tools by quantifying their ability to detect artefacts produced by realworld malware samples (see Aim 1, Table 3).

Malin et al. (2008) argued that mawlare analysis could be divided into three broad techniques: temporal, relational, and functional analysis. Temporal analysis is concerned with the timeline of events surrounding reported activity, while relational analysis refers to the interaction between components of the malware and its environment. Finally, functional analysis relates to the actions the malware is reported to have performed.

The MATEF provides a mechanism to evaluate dynamic analysis software tools. It provides a means to measure the extent to which tools detect the artefacts produced by malware behaviour (see Aim 2, Table 3). On a Windows computer, this behaviour typically manifests itself in the form of file, registry, process, and network based artefacts.

Unlike regular software that is largely predictable, malware can be unpredictable in that some behaviour (and hence artefacts) may not be observed. This can happen when the required (and unknown) trigger conditions for a given binary are not met (Nataraj, Karthikeyan, Jacob, Manjunath, 2011). Thus, the behaviour of malware can be nondeterministic and vary, particularly if it is of a type that communicates with a Command and Control (CC) server (Akinrolabu, Agrafiotis, Erola, 2018).

Furthermore, malware can include ‘measures to impede automatic and manual analyses’ (D’Elia, Coppa, Palmaro, Cavallaro, 2020). Strategies include code obfuscation (Singh Singh, 2018), detection of debuggers or virtual machines (Chen, Huygens, Desmet, Joosen, 2016), and deployment of ‘split-personality’ malware techniques to change the behaviour of code when it is subjected to analysis (Murali, Ravi, Agarwal, 2020). Such techniques are designed to give misleading results under analysis. Hence mitigation against such risks should be considered when drawing conclusions from the testing of tools used to study malware, see Aim 3, Table 3.

Having identified the aims of the framework, consideration was then given as to how to achieve these aims. Hence, the following section seeks to identify the main components of the framework.

Table 2. Aims of the framework.

4.2 Identifying selecting the main components of the framework

The MATEF framework includes a number of components to satisfy the aims identified in Table 3. Each of these elements is briefly explored in the following sections, starting with the malware binaries themselves.

Table 3. Proposed requirements.

Malware sample source To maximise the validity of the evaluation process, realworld malware (a.k.a. malware ‘in the wild’) is used instead of fabricated malware (see Aim 1, Table 3). The stored malware employs password protected zip files to minimise the contamination risk during handling (see Requirement 1, Table 2). All samples are analysed offline (see Requirement 1, Table 2). Malware can be obtained from any source and imported into the malware library.

Malware library The malware library is a store of malware executables, each accessible through a consistent file naming convention, thus facilitating automation and use of VMs (see Requirement 10, Table 2). Access to this library is restricted to authorised users of the framework only (see Requirement 1, Table 2).

In addition to the malware binary file, information on its expected behaviour should also be stored locally as well (satisfying requirement 14 from Table 2). To be made readily available, this information will be stored in a malware database.

Malware database The malware database stores properties of each malware binary held in the malware library. As a minimum, the details stored include the hash value of the binary and the number of artefacts generated as a result of creating, modifying, or deleting files or registry keys. In addition, also stored are the number of ports opened and processes spawned as a result of executing the malware using automation scripts.

Manager scripts The manager scripts perform two fundamental roles. The first of these is the management of the database and the tool testing process, through tasks such as initiating a bank of virtual machines (see Requirement 10, Table 2). The second role is the movement of software tools and malware into the VMs and extraction of log files created out of these environments.

The Oracle Due to the lack of any theoretical or easily determined ‘ground truth,’ the MATEF determines the expected quantity of artefacts from an independent source (see Requirement 14, Table 2). The random nature of the artefacts generated by malware is such that the reported expected value is little more than an approximation of the ‘ground truth.’ This source referred to as the ‘Oracle’ could conceivably be any one of a number of online environments, such as that provided by F-Secure (2011) and JoeSandbox (2020).

Unlike online sandbox solutions to analyse malware, use of offline tools enables the investigator greater control of the test environment. Control measures include the configuration of virtual machines and the ability to run tests repeatedly over extended periods of time to identify predictable artefacts. The investigator can also control the distribution of potential personally identifiable information that may be hard coded into a custom-built malware binary. This mitigates the risk whereby malware authors may become alerted to an ongoing investigation by publishing such binaries to a public online platform (C. H. Malin et al., 2008).

Test environment The test environment is managed via automated scripts and enables multiple tests to be run in parallel and thus reduce the time required for large scale tool testing. In addition, this improves the statistical power (and hence the statistical significance) of the results (Smith, 2012) to address the anticipated variability of the malware under analysis.

Internet simulation The provision of network services (see Requirement 11, Table 2) provides the MATEF with an added level of realism to malware running within the Test Environment. Lee et al. (2019) report that as of 2017, over 90It is important this network provision is simulated to minimise any risk of the malware stealing any data or committing any unauthorised access to other networks (see Requirement 1, Table 2). Requests and responses should be passed to and from common network services that are exposed to the test environment through the component.

Logs of network activity, together with those generated by the tool under test, form a significant product of the test environment and feed into the analysis component.

Analysis component In order to undertake analysis of a software tool, the analysis component needs to establish three things. The first of these is to establish what the tool is to be compared against. As argued above, this should be the expected quantity (‘Expected value’) of artefacts observed, as opposed to their value.

Secondly, the analysis component needs the capability to extract the number of artefacts observed (‘Observed value’) by the tool under test from a log file bearing a filename that can be determined programmatically. This will allow multiple log files from different VMs and test runs to coexist (see Requirement 15, Table 2).

A third analysis requirement was that the analysis components must establish an assessment of the difference between the Expected and Observed values (see Requirement 7, Table 2). This is a critical value and contributes to establishing the validity of the tool under test. Variation of this value under repeated testing also provides a measure of the repeatability.

Summary Figure 1 shows how the components described above are combined to form the MATEF, together with the information flows between the components. Note boxes in grey are external components that currently sit outside the MATEF. At present, the statistical analysis component is performed using an independent statistical analysis tool. It is envisaged that future development of the MATEF will include a statistical component within the MATEF. The next section will discuss and evaluate the extent to which the design of the framework has met these requirements.

4. DESIGN OF THE FRAMEWORK

4.1 Aims

4.2 Identifying selecting the main components of the framework

Table of Contents