# Modelling machine code semantics in C++

The life of an instruction in Remill

Peter Goodman peter@trailofbits.com

## We want to analyze machine code

- Is this program vulnerable to memory corruption, returnoriented programming attacks, or other exploits?
- Are these two functions equivalent?

- Thousands of instructions, many with complex side-effects
- Legacy (e.g. x87) and modern (e.g. AVX) features
- Memory is flat and opaque, no high-level types

### Remill translates x86/amd64 and AArch64 (ARMv8) instructions into LLVM bitcode

Motivation: LLVM bitcode is easier to analyze, and many analyses for LLVM bitcode already exist
 Challenge: Need LLVM bitcode semantics for all machine code instructions
 Solution: Implement instruction semantics with C++ functions, compile them to LLVM bitcode with Clang

#### Program source code is compiled into...

int \*RDX = ...;
for (long RDI = 0; ...; ++RDI) {
 RDX[RDI] = 1;

Assembly, a textual representation of... mov dword ptr [RDX + RDI \* 4], 0x1

Machine code, which we want to analyze c7 04 ba 01 00 00 00







#### Instructions implemented as C++ functions...

template <typename D, typename S>
DEF\_SEM(MOV, D dst, const S src) {
 WriteZExt(dst, Read(src));
 return memory;

| Operating on registers in a C++ structure |         |   |                               |  |
|-------------------------------------------|---------|---|-------------------------------|--|
| struc                                     | t State | • | <pre>public ArchState {</pre> |  |
| Ari                                       | thFlags |   | aflag;                        |  |
| GPR                                       |         |   | gpr;                          |  |
| • • •                                     |         |   | • • •                         |  |
| };                                        |         |   |                               |  |
|                                           |         |   |                               |  |

# And specialized by different instruction operand types DEF\_ISEL\_MnW\_In(MOV\_MEMv\_IMMz, MOV); // extern "C" constexpr auto MOV\_MEMv\_IMMz\_32 = MOV<M32W, I32>;

#### Remill uses instruction decoder information to select a C++ semantics function...

(AMD64 100000fb1 7 (BYTES c7 04 ba 01 00 00 00) MOV\_MEMv\_IMMz\_32 (WRITE\_OP (DWORD\_PTR (ADD (REG\_64 RDX) (MUL (REG\_64 RDI) (IMM\_64 0x4))))) (READ\_OP (SIGNED\_IMM\_32 0x1))

#### And calls the semantics within a "basic block" function with pre-defined "register" variables

Memory \*\_\_remill\_basic\_block(State &state, addr\_t pc, Memory \*memory) {
 auto &RDX = state.gpr.rdx.qword; // Pre-defined
 auto &RDI = state.gpr.rdi.qword; // Pre-defined
 memory = MOV\_MEMv\_IMMz\_32(memory, state, RDX + RDI \* 0x4, 0x1);
 return memory;
}

Remill aggressively optimizes the result into LLVM bitcode equivalent to the following
Memory \*\_\_remill\_basic\_block(State &state, addr\_t pc, Memory \*memory) {
 return \_\_remill\_write\_memory\_32(
 memory, state.gpr.rdx.qword + state.gpr.rdi.qword \* 4, 1);