Complex jargon abounds in cyber security and certain terms are widely used but little understood by anyone other than malware analysts. To help clear away a bit of the confusion, we’ll be breaking down some of these esoteric concepts, giving them practical, meaningful value. The first concept we are looking at is code entropy. To understand what code entropy is, we need to examine general entropy as it relates to computer science, which is the degree of randomness in a system.

Entropy in Action

For a practical example of entropy at work, think about broken glass vs. a jigsaw puzzle. Broken glass is in complete disorder; i.e. high entropy. When we look at a jigsaw puzzle we can identify consistency in shapes and imagery, so the level of entropy is lower.

Each bit of disorder adds on to the previous bit of disorder, and the more random and disordered the pieces are, the higher the entropy will be.

Why is this related to code entropy? We’ll get to that at the end of our post.

Entropy’s Role in Obfuscation

When it comes to malware and code, entropy is the measure of randomness in relation to areas that require random data, like code obfuscation and data compression.

Obfuscation techniques are used by developers to either protect legitimate intellectual property such as software or to make malware more difficult to understand. Malware writers invest heavily in obfuscating their products for two distinct reasons: 1. To reverse engineer the malware and hide its true nature, 2. To change the particular code in such a way that it will avoid being detected by AV products.

Obfuscation techniques employ a wide range of tools to achieve their goal of making textual or binary data unreadable. They use compression packages, encoding conversion and various encryption techniques to pack the original code into something different. What emerges from the process of obfuscation are hard to read and analyze binaries that may be able to avoid scanners and simple reversing techniques.

Calculating Code Entropy with Shannon’s Formula

Measuring the code entropy helps malware researchers determine if a sample of malware has been obfuscated in any way i.e., compressed or encrypted. The most popular way to measure entropy in code is based on  Shannon’s Formula. With this formula, each binary is measured on a scale from 0-8. The lower the code entropy, the lower the chances are that the code has been obfuscated in any way. The higher the entropy, the greater the chances are that the code is compressed or encrypted.

Regular text files rely on lingual rules (like the fact that we know that “q” is generally followed by “u” and that your brain knows how to interpret “M lw re  st nks”, despite the missing vowels), and demonstrate a low entropy of about 4.5. Going back to our broken glass example: it would clearly be easier to put a jigsaw puzzle back together compared to gluing glass back to its original shape. The glass is at high entropy and there no way we could identify which of the pieces connect, unlike the puzzle which has consistent shapes and imagery which adds context and order to its pieces. And  as the glass pieces get further mixed up among themselves, it becomes even more difficult to perceive what they might have been in the first place.

Ordinary (not packed/compress/encrypted) binary files usually have an entropy of 5; Compressed malware has an entropy of about 6. The closer the entropy gets to 8 (maximum possible entropy), the more likely it is that the code is encrypted. At this point, our said broken glass is nothing more than impossible-to-repair shards, bearing little, if any, likeness to their original form.

Analyzing the level of entropy exhibited by codes helps analysts quickly and accurately identify compressed and encrypted malware samples. It’s a critical step in the journey to defeating harmful malware with the least casualties in the shortest amount of time.

See a Cyber Range Training Session in Action