Multimodal-Fusion-Strategy-to-Classify-Malware

Research Papers:

Cluster Computing: Deep learning fusion for effective malware detection: leveraging visual features
ArXiv: Deep learning fusion for effective malware detection: leveraging visual features

All the resources that was used for this work,
(will be updated soon)

Big2015 Binary Dataset : kaggle Microsoft Challenge
MalHub Binary Dataset : malhub_binary_root

This work focuses on proposing a novel approach towards classifying malware binaries by extracting visual features from malware executables.

The dataset used in this work is from Kaggle Challenge for Malware Classification, Big2015.

Big2015 Malware Dataset consists of 9 families and 10,868 malware binary samples. Big2015 is a highly unbalanced dataset, with few families having more than 2000 malware samples, few more 1000 and others below 500.

First malware visual representation we use in this work is Grayscale image, which is generated using the decimal represenation of the hex code that was extracted from the malware executables using Hex Dump Tool.

The above image is an example snippet of the hexadecimal values extracted from Hex codes of a malware sample and the decimal values.

The below logic was used to convert Hex codes into hexadecimal values. (refer to hex_HDec.py)

    import re
       hex_regex = r'\b[0-9A-F]{2}\b'
       hex_codes = re.findall(hex_regex, contents)
       hex_str = ""
       for ele in hex_codes:
         hex_str += ele

To convert Hexadecimal to decimal, we used the below set of code. (refer to HDec_Dec.py)

    table = {'0': 0, '1': 1, '2': 2, '3': 3,
      '4': 4, '5': 5, '6': 6, '7': 7,
      '8': 8, '9': 9, 'A': 10, 'B': 11,
      'C': 12, 'D': 13, 'E': 14, 'F': 15}
    dec_list = []
    for ele in hex_list:
      hexadecimal = ele.strip().upper()
      res = 0
      size = len(hexadecimal) - 1
      for num in hexadecimal:
        res = res + table[num]*16**size
        size = size - 1
      dec_list.append(res)

Grayscale Image (GS)

The extracted Hexadecimal values are then convereted into decimal which is then used to generate Grayscale (GS) Images. Preview the code to generate GS images in GS_Img.py

The generated grayscale images are now used to train an independent VGG-16 Model. We chose VGG-16 model beacause, among the deep convolutional neural network models, VGG-16 is the most light-weighted comparing to ResNet-50, InceptionNet etc.

Entropy Graph (EG)

Entropy Graph is also generated from the same decimal values that are converted fromhexadecimal extracted from the Hex code of each malware sample. In computing, entropy is the randomness collected by an operating system or application for use in cryptography or other uses that require random data.

Below is the logic used for entropy extraction from decimal values:

    # creating an average entropy list of all the segments in the array 
    import math
    segment_size = 256
    averages = []
    for i in range(0, len(arr), segment_size):
      subset = arr[i:i+segment_size]
      entropy = 0
      for element in subset:
        prob = np.unique(element, return_counts = True)
        entropy += en(prob)
      average_entropy = entropy / segment_size
      average_entropy = float(average_entropy)
      averages.append(average_entropy)
      #average_str = str(average_entropy)

Run EntropyGraph.py to generate Entropy Graph from decimal values of malware samples.

Simhash Image (SH)

Simhash Images used in this work are generated not like Grayscale or Entropy from decimal. Rather, we utilize the assembly code of a malware sample to extract the operational code which is then passed through hash functions like MD5 to generate simhash images.

The assembly code of malware sample is the first data that is used, from which we extract the operational codes or opcodes (eg: push, mov, call, test, etc.). These mnemonic codes are now utilized to generate Simhash signatures for each malware samples using the MD5 hash function. Refer to asm_op.py to see the mnemonic code extraction logic.

op_sim.py is the coding for generating simhash signatures from mnemonic code of malware samples. Below is the logic of generating simhash signature.

# Calculate the hash value for each keyword and update the 'v' vector
    for keyword in keywords:
        b = hash_function(keyword)
        for i in range(n):
            if (b >> i) & 1 == 1:
                v[i] += 1
            else:
                v[i] -= 1
        for i in range(n):
            if v[i] > 0:
                s[i] = 1
            else:
                s[i] = 0

These Simhash signature are then used to generate Simhash images. (Refer to SimImg.py)

    sim = content.split()

    sim_list = []
    for ele in sim:
      el = int(ele)
      sim_list.append(el)

    array_2d = np.array(sim_list).reshape(16, 32) * 255

    image = im.fromarray(array_2d.astype(np.uint8), mode='L')

The generated Simhash images are non-square, which are not processable by the proposed VGG-16 model, therefore we resize the generated image without loosing its integrity using Bilinear Interpolation. In bilinear interpolation, the original image of size (m × n) is resized to (a × b), where a and b are set to 224 in this work, favorable to the VGG architecture. (Refer to BilinearInterpolation.py)

def bl_resize(original_img, new_h, new_w):
	#get dimensions of original image
	old_h, old_w = original_img.shape
	#create an array of the desired shape.
	#We will fill-in the values later.
	resized = np.zeros((new_h, new_w))
	#Calculate horizontal and vertical scaling factor
	w_scale_factor = (old_w ) / (new_w ) if new_h != 0 else 0
	h_scale_factor = (old_h ) / (new_h ) if new_w != 0 else 0
	for i in range(new_h):
		for j in range(new_w):
			#map the coordinates back to the original image
			x = i * h_scale_factor
			y = j * w_scale_factor
			#calculate the coordinate values for 4 surrounding pixels.
			x_floor = math.floor(x)
			x_ceil = min( old_h - 1, math.ceil(x))
			y_floor = math.floor(y)
			y_ceil = min(old_w - 1, math.ceil(y))


			if (x_ceil == x_floor) and (y_ceil == y_floor):
				q = original_img[int(x), int(y)]
			elif (x_ceil == x_floor):
				q1 = original_img[int(x), int(y_floor)]
				q2 = original_img[int(x), int(y_ceil)]
				q = q1 * (y_ceil - y) + q2 * (y - y_floor)
			elif (y_ceil == y_floor):
				q1 = original_img[int(x_floor), int(y)]
				q2 = original_img[int(x_ceil), int(y)]
				q = (q1 * (x_ceil - x)) + (q2	 * (x - x_floor))
			else:
				v1 = original_img[x_floor, y_floor]
				v2 = original_img[x_ceil, y_floor]
				v3 = original_img[x_floor, y_ceil]
				v4 = original_img[x_ceil, y_ceil]

				q1 = v1 * (x_ceil - x) + v2 * (x - x_floor)
				q2 = v3 * (x_ceil - x) + v4 * (x - x_floor)
				q = q1 * (y_ceil - y) + q2 * (y - y_floor)
				#print(q)
			resized[i,j] = q
	return resized.astype(np.uint8)

Family	Grayscale Image	Entropy Graph	Simhash Image
Gatak
Kelihos_ver1
Kelihos_ver3

Proposed Methodology

Experiment 1:

Effectiveness of GS, EG, and SH VGG16 models in classifying malware binaries

The primary experiment done in this work was to evaluate the performance of VGG16 models on individual malware visual features. And for that we designed a new architecture adding to the VGG16 architecture by freezing the pre-trained weights of VGG16.

Below given image depicts the proposed architecture of the proposed model

Each malware visual feature, that is, Grayscale Image, Entropy Graph, and Simhash Image will be trained seperately on different proposed VGG16 Architecture and the performances are analysed.

The below table shows the performance of all Grayscale(GS), Entropy Graph(EG) and Simhash (SH) VGG-16 Indepedent Models.

Refer to VGG_16_Independent for the coding of Independent VGG16 models trained on 3 different Malware Visual Feature (GS, EG, SH) and the below set of code is for the model that was desgined.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, Conv1D
from tensorflow.keras import layers
from keras_tuner.tuners import RandomSearch
from tensorflow import keras
from tensorflow.keras.applications import VGG16


def build_model(hp):

    model_1 = Sequential()

    vgg = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

    # Freeze the weights of all layers in the VGG16 model
    for layer in vgg.layers:
        layer.trainable = False

    # Add the VGG16 model to your own model
    model_1.add(vgg)

    # Add a 1D convolutional layer
    model_1.add(Conv1D(filters=32, kernel_size=3, activation='relu'))  # Example parameters, you can tune these

    # Remove the Flatten layer to maintain the spatial structure
    # model_1.add(Flatten())

    # Add the dense layer
    model_1.add(Dense(units=hp.Int('dense_units', min_value=32, max_value=512, step=32),
                    activation='relu'))

    model_1.add(Dropout(hp.Float('dropout', min_value=0.0, max_value=0.5, step=0.1)))

    # adding batch normalization layer
    model_1.add(keras.layers.BatchNormalization())

    model_1.add(Dense(units = hp.Int('extra_dense_units', min_value = 32, max_value = 512, step = 32), activation = 'relu'))

    # Add another Conv1D layer before the output layer
    model_1.add(Conv1D(filters=64, kernel_size=3, activation='relu'))

    # Flatten the output before the final dense layer
    model_1.add(Flatten())

    # Add the output layer
    model_1.add(Dense(units=9, activation='softmax'))

    # Compile the model
    model_1.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate', values=[1e-2, 1e-3])),
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    return model_1

Feature	Accuracy	Precision	Recall	F1-Score	Time
GS	0.998	1.0	0.996	0.997	0.01
EG	0.999	0.998	0.998	0.998	0.01
SH	0.996	0.966	0.963	0.963	0.01

Evaluating the results, you can see that Entropy Graph, among all three, GS, EG, and SH, is the best malware visual feature.

Experiment 2:

But in this work we go on to enhance the visual feature by merging both input and the model to see whether the final merged model gets better performance than the previous independent VGG16 models. The merging process was done using the merging operators in Keras Library.

Refer to https://keras.io/api/layers/merging_layers/

There are several merge operators/layers, but in this work we only focused on :

Concatenate Layer
Add Layer
Average Layer
Maximum Layer

Each layer has its own use cases and advantages. In this work we have proven for malware detection and classification, when we merge the visual features, concatenate layer is the most efficient.

Features	GS	EG	SH
GS	❌	✅	✅
EG	❌	❌	✅
GS, EG	❌	❌	✅

Each of these combinations are considered for all four operators (Concatenation, Add, Average, Maximum).

Concatenation

Concatenates a list of inputs. It takes as input a list of tensors, all of the same shape except for the concatenation axis, and returns a single tensor that is the concatenation of all inputs.

Before concatenating, we will load the three different modalitites of malware visual features and split them into train and test separately for three of them. Then we define VGG16 model for each modality which will be then concatenated together to form the proposed model.

The below code depicts the VGG16 models and then the contenation of them

VGG16:

# Designing the first VGG16 model for Hex Images
hex_ = tf.keras.applications.VGG16(weights = 'imagenet', include_top = False, input_shape = (224, 224, 3))

hex_._name = 'hex_vgg'

# Freeze the layers in the VGG16 model so that they are not trained during training
for layer in hex_.layers:
  layer.trainable = False

# Pass the input through the VGG16 model
hex_vgg_output = hex_(hex_in)

# Add a classifier on top of the model
#hex_model = Flatten(name = 'hex_flatten')(hex_vgg_output)
hex_model = Dense(512, activation='relu', name='hex_dense')(hex_vgg_output)

merge_operations loads the independent modality models and then merged using different operators.

Contenation

# Concatenate the output of the 2 models
merged = concatenate([hex_model, eg_model, sh_model])

# Add one or more dense layers on top of the merged output

# Add 1D convolutional layers
conv1 = Conv1D(filters=515, kernel_size=3, strides=1, activation='relu')(merged)
flatten = Flatten()(conv1)
dense1 = Dense(512, activation = 'relu')(flatten)
dense2 = Dense(256, activation = 'relu')(dense1)
dense3 = Dense(128, activation = 'relu')(dense2)
dense4 = Dense(64, activation = 'relu',kernel_regularizer = l2(0.01))(dense3)
output = Dense(9, activation = 'softmax')(dense4)

# Define the model
merged_model = Model(inputs = hex_in, outputs = output, name = 'merged_model') # hex_in is the shape of the input layer

# Compile the model
merged_model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

After analyzing the outputs of all four operators on different combinations of malware modalities, we came to a conclusion that concatenation operator when used for all three modalities (GS, EG, SH) is the highest performing model with 0.99 F1-Score.

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
assets		assets
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal-Fusion-Strategy-to-Classify-Malware

Research Papers:

Grayscale Image (GS)

Entropy Graph (EG)

Simhash Image (SH)

Proposed Methodology

Experiment 1:

Effectiveness of GS, EG, and SH VGG16 models in classifying malware binaries

Experiment 2:

Concatenation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal-Fusion-Strategy-to-Classify-Malware

Research Papers:

Grayscale Image (GS)

Entropy Graph (EG)

Simhash Image (SH)

Proposed Methodology

Experiment 1:

Effectiveness of GS, EG, and SH VGG16 models in classifying malware binaries

Experiment 2:

Concatenation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages