capstone2llvmir入门---如何把汇编转换为llvmir

L剑仙 · 发表于 2021-5-10 10:08

本帖最后由 L剑仙于 2021-5-22 21:32 编辑

[TOC]

前言

本文简单分析介绍了capstone2llvmir源码与本地编译运行的方式，适合初步学习汇编转ir的原理并自己做简单修改，编译运行，做出自己的简易asm2llvmir小程序，有了llvmir，就可以优化、去混淆、干坏事了，详见本菜之前的文章

https://www.52pojie.cn/thread-1353006-1-1.html
利用编译器优化干掉虚假控制流
https://www.52pojie.cn/thread-1386178-1-1.html
利用编译器优化干掉控制流平坦化
ps：有些比较复杂的asm2ir转换源码里面没有，需要自己试着写，慢慢完善，然后编译成为自己的工具

recdec的源代码里很重要的部分capstone2llvmir与bin2llvmir，功能是把汇编转换为llvmir，我认真学习了这个神器并记录笔记

源代码https://retdec-tc.avast.com/repository/download/Retdec_DoxygenBuild/.lastSuccessful/build/doc/doxygen/html/files.html

它介绍里面有一个capstone2llvmirtool入门https://github.com/avast/retdec/wiki/Capstone2LlvmIr，我把它大概意思整理了一下

capstone2llvmirtool入门

对于不同代码的四种不同翻译方式

1.完整语义翻译完全把汇编语法翻译成ir，只对于足够简单的指令 ps：很多不常用指令翻译源码里没有，如果碰到需要模仿源码自己写

2.翻译为内部函数call 把一些汇编翻译成大多数编译器理解的内部函数，比如翻译一些跳转

3.翻译为伪代码call 根据Capstone反汇编信息创建伪代码call翻译指令 ps：看到这些call对应的汇编没被翻译，而它对于优化又很重要，就可以着手自己写翻译函数了，不重要直接忽略就行

4.不翻译忽略一些难以翻译的指令

具体原理概括

首先，创建翻译模块translator module

1.创建空的LLVM IR module

2.初始化Capstone engine 和其他数据结构

3.创建架构运行环境，也就是寄存器相关数据结构什么的

3.1把汇编地址映射为ir全局变量

@_asm_program_counter = internal global i64 0

; ...

; add eax, 0x1234 @ 0x1000

store volatile i64 4096, i64* @_asm_program_counter

; ... LLVM IR sequence for the add instruction

; sub ebx, 0x1234 @ 0x1005

store volatile i64 4101, i64* @_asm_program_counter

; ... LLVM IR sequence for the sub instruction

3.2控制流伪代码函数生成，为什么不用ir是因为ir通过块标签跳转而不是像汇编一样通过地址

Control-flow-related pseudo functions are generated.

; void (i<architecture_size> target_address)

declare void @__pseudo_call(i32)

; void (i<architecture_size> target_address)

declare void @__pseudo_return(i32)

; void (i<architecture_size> target_address)

declare void @__pseudo_branch(i32)

; void (i1 condition, i<architecture_size> target_address)

declare void @__pseudo_cond_branch(i1, i32)

3.3架构相关寄存器全局变量初始化

@EAX = internal global i32 0

@ecx = internal global i32 0

; ...

@st0 = internal global x86_fp80 0xK00000000000000000000

@st1 = internal global x86_fp80 0xK00000000000000000000

然后，通过translator 执行翻译

1.用Capstone engine 反编译二进制，对于一句汇编，它大概包含如下信息

add eax, 0x1234:

       General info:
                id     :  8 (add)
                addr   :  1000
                size   :  5
                bytes  :  05 34 12 00 00
                mnem   :  add
                op str :  eax, 0x1234
        Detail info:
                R regs :  0
                W regs :  1
                        25 (eflags)
                groups :  0
        Architecture-dependent info:
                prefix :  00 00 00 00  (-, -, -, -)
                opcode :  05 00 00 00
                rex    :  0
                addr sz:  4
                modrm  :  0
                sib    :  0
                disp   :  0
                sib idx:  0 (-)
                sib sc :  0
                sib bs :  0 (-)
                sse cc :  X86_SSE_CC_INVALID
                avx cc :  X86_AVX_CC_INVALID
                avx sae:  false
                avx rm :  X86_AVX_RM_INVALID
                op cnt :  2

                        type   :  X86_OP_REG
                        reg    :  19 (eax)
                        size   :  4
                        access :  CS_AC_READ + CS_AC_WRITE
                        avx bct:  X86_AVX_BCAST_INVALID
                        avx 0 m:  false

                        type   :  X86_OP_IMM
                        imm    :  1234
                        size   :  4
                        access :  CS_AC_INVALID
                        avx bct:  X86_AVX_BCAST_INVALID
                        avx 0 m:  false

2.找到翻译方式翻译指令到ir id保存了操作码

2.1Capstone ID is mapped to an ID-specific routine 每个id也就是操作码对应一个 routine

2.2Capstone ID is mapped to a specific pseudo assembly generation method

id对应一个 pseudo method汇编伪代码生成方法

__asm_<mnem>(op0)
op0 = __asm_<mnem>(op0)
__asm_<mnem>(op0, op1)
op0 = __asm_<mnem>(op1)
op0 = __asm_<mnem>(op0, op1)
__asm_<mnem>(op0, op1, op2)

2.3Capstone ID is not mapped to any value

啥也没匹配到。使用Capstone-provided instruction info信息自动创建call，这取决于Capstone提供信息的质量

源码结构：

公开接口include/retdec/capstone2llvmir
隐藏接口src/capstone2llvmir

接口Capstone2LlvmIrTranslator
实现Capstone2LlvmIrTranslator_impl
相应架构实现Capstone2LlvmIrTranslatorArm

capstone2llvmir入口

直接看入口，入口在capstone2llvmirtool/capstone2llvmir.cpp里main 函数(还有一个在retdec\src\bin2llvmir\optimizations\decoder里，学习这2个函数，就能学会如何使用translate函数翻译asm为ir)，先创建一个llvm::function，填入一个block与return，根据cpu架构创建翻译器Capstone2LlvmIrTranslator::createArch，最后通过capstone2llvmir/capstone2llvmir.h定义的translate函数翻译asm为ir，传入data，size，base获得irb

main
{
 llvm::Function::Create
 llvm::BasicBlock::Create
 Capstone2LlvmIrTranslator::createArch
 translate(po.code.data(), po.code.size(), po.base, irb)
}

translate函数

cs_malloc分配capstone的handle，用这个handle通过cs_disasm_iter把二进制翻译为汇编保存在insn

generateSpecialAsm2LlvmInstr ，关键函数generateSpecialAsm2LlvmInstr 把insn的address转换为llvm全局变量，每种架构都有一个程序计数器记录程序执行到哪个地址了，arm就是pc，每执行一句就修改pc，这里的pc值就来源于generateSpecialAsm2LlvmInstr 转换的globalvalue

translateInstruction真正进入到关键把insn翻译为ir，这里4种方式对应前面的4种翻译策略，简单看一下骨架

Capstone2LlvmIrTranslator_impl<CInsn, CInsnOp>::translate
{
 cs_malloc
 cs_disasm_iter  
 generateSpecialAsm2LlvmInstr
 translateInstruction //在capstone2llvmir_impl.h声明的虚函数，不同架构有不同的translateInstruction实现
 {
  *f=*(_i2fm.find(i→id)) //如果在Instruction translation map _i2fm里找到翻译函数，直接通过指针调用，对应1
  {
   translateAdd
   translateB
   ...
  }
  or translatePseudoAsmGeneric //如果没有找到，回到translatePseudoAsmGeneric函数，对应2
  {
   loadOp
   loadRegister
   getPseudoAsmFunction
   CreateCall                  //对应3
   storeOp
   storeRegister
  }
 }
}

translateInstruction函数

关键函数translateInstruction，把汇编insn转换为llvmir，它是capstone2llvmir_impl.h声明的一个虚函数

不同的汇编都有自己的translateInstruction实现，arm的在src\capstone2llvmir\arm\arm.cpp

这里面一个重要结构体_cs_insn，电脑里的python3安装了capstone我们翻``python``看它的结构

以前读cs的笔记：https://www.52pojie.cn/thread-1145301-1-1.html
class _cs_insn(ctypes.Structure):
    _fields_ = (
        ('id', ctypes.c_uint),
        ('address', ctypes.c_uint64),
        ('size', ctypes.c_uint16),
        ('bytes', ctypes.c_ubyte * 16),
        ('mnemonic', ctypes.c_char * 32),
        ('op_str', ctypes.c_char * 160),
        ('detail', ctypes.POINTER(_cs_detail)),
    )
class _cs_detail(ctypes.Structure):
    _fields_ = (
        ('regs_read', ctypes.c_uint16 * 12),
        ('regs_read_count', ctypes.c_ubyte),
        ('regs_write', ctypes.c_uint16 * 20),
        ('regs_write_count', ctypes.c_ubyte),
        ('groups', ctypes.c_ubyte * 8),
        ('groups_count', ctypes.c_ubyte),
        ('arch', _cs_arch),
    )
class _cs_arch(ctypes.Union):
    _fields_ = (
        ('arm64', arm64.CsArm64),
        ('arm', arm.CsArm),
        ('m68k', m68k.CsM68K),
        ('mips', mips.CsMips),
        ('x86', x86.CsX86),
        ('ppc', ppc.CsPpc),
        ('sparc', sparc.CsSparc),
        ('sysz', systemz.CsSysz),
        ('xcore', xcore.CsXcore),
        ('tms320c64x', tms320c64x.CsTMS320C64x),
        ('m680x', m680x.CsM680x),
        ('evm', evm.CsEvm),
    )    
/// Instruction structure
typedef struct cs_arm {
    bool usermode;    ///< User-mode registers to be loaded (for LDM/STM instructions)
    int vector_size;     ///< Scalar size for vector instructions
    arm_vectordata_type vector_data; ///< Data type for elements of vector instructions
    arm_cpsmode_type cps_mode;    ///< CPS mode for CPS instruction
    arm_cpsflag_type cps_flag;    ///< CPS mode for CPS instruction
    arm_cc cc;            ///< conditional code for this insn
    bool update_flags;    ///< does this insn update flags?
    bool writeback;        ///< does this insn write-back?
    arm_mem_barrier mem_barrier;    ///< Option for some memory barrier instructions

    /// Number of operands of this instruction,
    /// or 0 when instruction has no operand.
    uint8_t op_count;

    cs_arm_op operands[36];    ///< operands for this instruction.
} cs_arm;  
typedef enum arm_cc {
    ARM_CC_INVALID = 0,
    ARM_CC_EQ,            ///< Equal                      Equal
    ARM_CC_NE,            ///< Not equal                  Not equal, or unordered
    ARM_CC_HS,            ///< Carry set                  >, ==, or unordered
    ARM_CC_LO,            ///< Carry clear                Less than
    ARM_CC_MI,            ///< Minus, negative            Less than
    ARM_CC_PL,            ///< Plus, positive or zero     >, ==, or unordered
    ARM_CC_VS,            ///< Overflow                   Unordered
    ARM_CC_VC,            ///< No overflow                Not unordered
    ARM_CC_HI,            ///< Unsigned higher            Greater than, or unordered
    ARM_CC_LS,            ///< Unsigned lower or same     Less than or equal
    ARM_CC_GE,            ///< Greater than or equal      Greater than or equal
    ARM_CC_LT,            ///< Less than                  Less than, or unordered
    ARM_CC_GT,            ///< Greater than               Greater than
    ARM_CC_LE,            ///< Less than or equal         <, ==, or unordered
    ARM_CC_AL             ///< Always (unconditional)     Always (unconditional)
} arm_cc;
/// Instruction operand
typedef struct cs_arm_op {
    int vector_index;    ///< Vector Index for some vector operands (or -1 if irrelevant)

    struct {
        arm_shifter type;
        unsigned int value;
    } shift;

    arm_op_type type;    ///< operand type

    union {
        int reg;    ///< register value for REG/SYSREG operand
        int32_t imm;            ///< immediate value for C-IMM, P-IMM or IMM operand
        double fp;            ///< floating point value for FP operand
        arm_op_mem mem;        ///< base/index/scale/disp value for MEM operand
        arm_setend_type setend; ///< SETEND instruction's operand type
    };

    /// in some instructions, an operand can be subtracted or added to
    /// the base register,
    /// if TRUE, this operand is subtracted. otherwise, it is added.
    bool subtracted;

    /// How is this operand accessed? (READ, WRITE or READ|WRITE)
    /// This field is combined of cs_ac_type.
    /// NOTE: this field is irrelevant if engine is compiled in DIET mode.
    uint8_t access;

    /// Neon lane index for NEON instructions (or -1 if irrelevant)
    int8_t neon_lane;
} cs_arm_op;

translateInstruction代码粗看

void Capstone2LlvmIrTranslatorArm_impl::translateInstruction(
        cs_insn* i,
        llvm::IRBuilder<>& irb)
{
    _insn = i;

    cs_detail* d = i->detail;
    cs_arm* ai = &d->arm;//这里储存了arm架构相关信息

    auto fIt = _i2fm.find(i->id);//这里id存储着指令类型
    //_i2fm是一个hash表在arm_init.cpp中初始化，存储部分arm指令与翻译方法一一对应，如ARM_INS_ADC对应Capstone2LlvmIrTranslatorArm_impl::translateAdc，可以看到还有很多指令还没有转换函数
    if (fIt != _i2fm.end() && fIt->second != nullptr)//如果在hash里找到了
    {
        auto f = fIt->second;//获得翻译方法f

        bool branchInsn = i->id == ARM_INS_B || i->id == ARM_INS_BX
                || i->id == ARM_INS_BL || i->id == ARM_INS_BLX
                || i->id == ARM_INS_CBZ || i->id == ARM_INS_CBNZ;
        if (ai->cc == ARM_CC_AL || ai->cc == ARM_CC_INVALID || branchInsn)
        //这里区分条件跳和非条件跳，cc就是condition code的意思
        {
            _inCondition = false;
            (this->*f)(i, ai, irb);//直接指针调用f处理irb
        }
        else
        {
            _inCondition = true;

            auto* cond = generateInsnConditionCode(irb, ai);//条件跳要generateIfThen先生成ifthen的bodyIrb
            auto bodyIrb = generateIfThen(cond, irb);

            (this->*f)(i, ai, bodyIrb);
        }
    }
    else
    {
        throwUnhandledInstructions(i);

        if (ai->cc == ARM_CC_AL || ai->cc == ARM_CC_INVALID)
        {
            _inCondition = false;
            translatePseudoAsmGeneric(i, ai, irb);//如果在_i2fm的hash表里没找到对应，继续用translatePseudoAsmGeneric生成ir，它定义在capstone2llvmir_impl.cpp里
        }
        else
        {
            _inCondition = true;

            auto* cond = generateInsnConditionCode(irb, ai);
            auto bodyIrb = generateIfThen(cond, irb);

            translatePseudoAsmGeneric(i, ai, bodyIrb);
        }
    }
}

这里面有一个重要的hash表_i2fm全称Instruction translation map，把汇编指令和翻译ir函数指针一一对应，比如ARM_INS_ADC加法指令对应指针 &Capstone2LlvmIrTranslatorArm_impl::translateAdc

还有arm_init.cpp中定义的寄存器符号名字对应的哈希表r2n，寄存器符号类型对应的哈希表r2t这两个重要结构，他们完全抽象出了arm寄存器为c++数据结构

translatePseudoAsmGeneric函数

翻译asm为一般的伪代码函数，就是处理在_i2fm表里面没有对应翻译函数的指令如何翻译

1.根据capstone提供的指令信息，搞明白要生成的ir有多少寄存器与非寄存器的读写，需要创建多少llvm的type和value,函数有没有返回值等信息

2.根据之前创建的llvm的type和value创建参数和返回值，把__asm_与助记符insn->mnemonic拼接起来命名函数名字，生成一个空壳伪函数

3.我们在生成ir的时候，如果观察到一些以汇编助记符命名的ir函数，就可以知道这句汇编指令没有对应的翻译函数，然后自己写一个完成完全的翻译，当然，_i2fm表里面给的翻译函数99％情况下够用了

void Capstone2LlvmIrTranslator_impl<CInsn, CInsnOp>::translatePseudoAsmGeneric(
                cs_insn* i,
                CInsn* ci,
                llvm::IRBuilder<>& irb)//这里区分一下cs_insn是带有address，mnemonic，op_str，detail的信息很全的结构体，CInsn仅仅就是原始的汇编二进制指令结构
{
        std::vector<llvm::Value*> vals;
        std::vector<llvm::Type*> types;

        unsigned writeCnt = 0;
        llvm::Type* writeType = getDefaultType();
        bool writesOp = false;
        for (std::size_t j = 0; j < ci->op_count; ++j)//先遍历CInsn二进制汇编的operands读取寄存器相关信息，确定生成ir需要什么样的value和type
        {
                auto& op = ci->operands[j];
                auto access = getOperandAccess(op);//getOperandAccess获得operands是读取写入还是其他
// regs_read，字面理解是，返回存储所有读取的隐式寄存器的list，实测只有pc，lr，sp和状态寄存器会被存储在list中
// regs_write，字面理解是，返回存储所有写入的隐式寄存器的list，实测只有pc，lr，sp和状态寄存器会被存储在list中
// regs_access，合并上面2个的结果
// # Access types for instruction operands.
// CS_AC_INVALID  = 0        # Invalid/unitialized access type.
// CS_AC_READ     = (1 << 0) # Operand that is read from.
// CS_AC_WRITE    = (1 << 1) # Operand that is written to.
                if (access == CS_AC_INVALID || (access & CS_AC_READ))//如果有读取存在，调用loadOp翻译获得需要的llvm的value与type，存入vals与type向量
                {
                        auto* o = loadOp(op, irb);
                        vals.push_back(o);
                        types.push_back(o->getType());
                }

                if (access & CS_AC_WRITE)//如果有写入寄存器，writesOp为真，调用getRegisterType获得寄存器类型llvm的value，存入vals向量
                                                                //如果不是写入寄存器，可能写入到内存地址之类的，直接默认存储到vals向量
                {
                        writesOp = true;
                        ++writeCnt;

                        if (isOperandRegister(op))//如果写入寄存器
                        {
                                auto* t = getRegisterType(op.reg);
                                if (writeCnt == 1 || writeType == t)
                                {
                                        writeType = t;
                                }
                                else
                                {
                                        writeType = getDefaultType();
                                }
                        }
                        else
                        {
                                writeType = getDefaultType();
                        }
                }
        }

        if (vals.empty())//如果上面遍历之后vals还是空，再次通过detail->regs_read_count遍历所有读取寄存器相关信息存入vals
        {
                // All registers must be ok, or don't use them at all.
                std::vector<uint32_t> readRegs;
                readRegs.reserve(i->detail->regs_read_count);
                for (std::size_t j = 0; j < i->detail->regs_read_count; ++j)
                {
                        auto r = i->detail->regs_read[j];
                        if (getRegister(r))
                        {
                                readRegs.push_back(r);
                        }
                        else
                        {
                                readRegs.clear();
                                break;
                        }
                }

                for (auto r : readRegs)
                {
                        auto* op = loadRegister(r, irb); //如果有读取寄存器操作，调用loadRegister获得irb
                        vals.push_back(op);
                        types.push_back(op->getType());
                }
        }

        auto* retType = writesOp ? writeType : irb.getVoidTy();//只要writesOp为真，retType就为返回类型，否则返回类型为void
        llvm::Function* fnc = getPseudoAsmFunction(//通过getPseudoAsmFunction创建翻译对应cs_insn* i，类型为types，返回值为retType的llvm函数原型
                        i,                                 //注意这个函数只是一个空壳，是没有内部ir的，它通过getPseudoAsmFunctionName命名函数名字，就是把__asm_与助记符insn->mnemonic拼接起来
                        retType,                           //这样等我们看到生成的ir时，就知道这句汇编指令没有对应的翻译函数，然后自己写一个类似Capstone2LlvmIrTranslatorArm_impl::translateAdc的翻译函数
                        types);

        auto* c = irb.CreateCall(fnc, vals);//通过CreateCall创建参数为vals，原型为fnc的伪代码函数c

        std::set<uint32_t> writtenRegs;
        if (retType)
        {
                for (std::size_t j = 0; j < ci->op_count; ++j)//先通过op_count遍历operands写入寄存器相关信息
                {
                        auto& op = ci->operands[j];
                        if (getOperandAccess(op) & CS_AC_WRITE)//Return (list-of-registers-read, list-of-registers-modified) by this instructions
                        {
                                storeOp(op, c, irb);//通过storeOp函数创建存储ir

                                if (isOperandRegister(op))
                                {
                                        writtenRegs.insert(op.reg);//存储到被写入寄存器writtenRegs集合里
                                }
                        }
                }
        }

        // All registers must be ok, or don't use them at all.
        std::vector<uint32_t> writeRegs;
        writeRegs.reserve(i->detail->regs_write_count);
        for (std::size_t j = 0; j < i->detail->regs_write_count; ++j)//再次通过detail遍历写入寄存器(不包含writtenRegs里被写入的寄存器)相关信息存储到writeRegs向量
        {
                auto r = i->detail->regs_write[j];
                if (writtenRegs.count(r))
                {
                        // silently ignore
                }
                else if (getRegister(r))
                {
                        writeRegs.push_back(r);
                }
                else
                {
                        writeRegs.clear();
                        break;
                }
        }

        for (auto r : writeRegs)
        {
                llvm::Value* val = retType->isVoidTy()
                                ? llvm::cast<llvm::Value>(
                                                llvm::UndefValue::get(getRegisterType(r)))
                                : llvm::cast<llvm::Value>(c);
                storeRegister(r, val, irb);//遍历writeRegs调用storeRegister函数翻译ir，注意这里排除了上面storeOp翻译的ir，否则会重复
        }
}

自己编译

git clone https://github.com/avast/retdec.git
cd retdec
mkdir build && cd build
语法cmake .. -DCMAKE_INSTALL_PREFIX=<path> -DRETDECENABLE<component>=ON

cmake ../ -DRETDEC_ENABLE_CAPSTONE2LLVMIRTOOL=ON 只编译CAPSTONE2LLVMIR前端，这里是原汁原味一句一句翻译asm为ir的逻辑，也就是本文讲的

//cmake ../ -DRETDEC_ENABLE_BIN2LLVMIRTOOL=ON 注意这个是之前版本的，现在已经没有BIN2LLVMIRTOOL了，只有一个库

cmake ../ -DRETDEC_ENABLE_RETDECTOOL=ON 只编译RETDECTOOL前端，也就是之前版本的bin2llvmir前端，这里先通过CAPSTONE2LLVMIR处理得到的ir，然后通过很多pass对于最初的ir进行了分析和优化，其中的到达定值分析和构造西沟分析等都非常的巧妙，值得研究，关键接口函数retdec::disassemble(po.inputFile, &fs)

ps：这里要从git上下载capstone与keystone与llvm相关的库，我下的比较慢可以修改为国内的源
```
git remote set-url --push origin  https://github.com/Hackergeek/architectur
```
make -jN (N 一般设置为核心数+1)，然后在retdec\build\src\下面找到可执行文件，像下面这样

retdec-decompiler是bin2llvmir2cpp

retdectool是bin2llvmir(capstone2llvmir+多个pass优化后)

capstone2llvmirtool是capstone2llvmir原汁原味

./retdec-decompiler --help
./retdec-decompiler:
Mandatory arguments:
        INPUT_FILE File to decompile.
General arguments:
        [-o|--output FILE] Output file (default: INPUT_FILE.c if OUTPUT_FORMAT is plain, INPUT_FILE.c.json if OUTPUT_FORMAT is json|json-human).
        [-s|--silent] Turns off informative output of the decompilation.
        [-f|--output-format OUTPUT_FORMAT] Output format [plain|json|json-human] (default: plain).
        [-m|--mode MODE] Force the type of decompilation mode [bin|raw] (default: bin).
        [-p|--pdb FILE] File with PDB debug information.
        [-k|--keep-unreachable-funcs] Keep functions that are unreachable from the main function.
        [--cleanup] Removes temporary files created during the decompilation.
        [--config] Specify JSON decompilation configuration file.
        [--disable-static-code-detection] Prevents detection of statically linked code.
Selective decompilation arguments:
        [--select-ranges RANGES] Specify a comma separated list of ranges to decompile (example: 0x100-0x200,0x300-0x400,0x500-0x600).
        [--select-functions FUNCS] Specify a comma separated list of functions to decompile (example: fnc1,fnc2,fnc3).
        [--select-decode-only] Decode only selected parts (functions/ranges). Faster decompilation, but worse results.
Raw or Intel HEX decompilation arguments:
        [-a|--arch ARCH] Specify target architecture [mips|pic32|arm|thumb|arm64|powerpc|x86|x86-64].
                         Required if it cannot be autodetected from the input (e.g. raw mode, Intel HEX).
        [-e|--endian ENDIAN] Specify target endianness [little|big].
                             Required if it cannot be autodetected from the input (e.g. raw mode, Intel HEX).
        [-b|--bit-size SIZE] Specify target bit size [16|32|64] (default: 32).
                             Required if it cannot be autodetected from the input (e.g. raw mode).
        [--raw-section-vma ADDRESS] Virtual address where section created from the raw binary will be placed.

retdectool

retdectool也就是以前的bin2llvmir可执行文件，从入口开始学习这个，搞清楚如何通过各种库把汇编转换为ir，然后通过各种分析优化pass得到可读性很强的ir，main函数retdec-master\src\retdectool\retdec.cpp里,关键是disassemble，第一个string指针参数表示待处理文件路径inputPath，第二个生成的ir结果，存储在FunctionSet类型的fs指针，这里可以看一下retdec::common::Function的数据结构，存储了函数类型，ir等有用信息

main
{
llvmModuleContextPair disassemble(
        const std::string& inputPath,
        retdec::common::FunctionSet* fs)
 {
    auto context = std::make_unique<llvm::LLVMContext>();
    auto module = createLlvmModule(*context);

    config::Config c;
    c.parameters.setInputFile(inputPath);

    // Create a PassManager to hold and optimize the collection of passes we
    // are about to build.
    llvm::legacy::PassManager pm;//创建一个PassManager，它的作用是管理pass，我们可以往其中添加很多pass，然后通过run遍历执行所有pass

    pm.add(new bin2llvmir::ProviderInitialization(&c));
    //ProviderInitialization继承自modulepass，路径src\bin2llvmir\optimizations\provider_init，执行runonmodule
    pm.add(new bin2llvmir::Decoder());
    //Decoder这个pass就是对capstone2llvmir的进一步封装了
    // Now that we have all of the passes ready, run them.
    pm.run(*module);

    fillFunctions(*module, fs);

    return LlvmModuleContextPair{std::move(module), std::move(context)};
 }
}

附录：capstone2llvmir目录结构与源码

capstone2llvmir主目录
arm分目录
arm.cpp	ARM implementation of `Capstone2LlvmIrTranslator` arm翻译声明
arm_impl.h	ARM implementation of `Capstone2LlvmIrTranslator` arm翻译实现
arm_init.cpp	Initializations for ARM implementation of `Capstone2LlvmIrTranslator`初始化
capstone2llvmir.cpp	Converts bytes to Capstone representation, and Capstone representation to LLVM IR 重要接口声明
capstone2llvmir_impl.cpp	Common public interface for translators converting bytes to LLVM IR 重要接口实现
capstone2llvmir_impl.h	Common private implementation for translators converting bytes to LLVM IR
capstone_utils.h	Utility functions for types, enums, etc. defined in Capstone
exceptions.cpp	Definitions of exceptions used in capstone2llmvir library
llvmir_utils.cpp	LLVM IR utilities
llvmir_utils.h	LLVM IR utilities

zchao33 · 发表于 2021-5-10 11:15

kankan keyi!@

SIYEisland · 发表于 2021-5-10 20:46

谢谢分享

cptw · 发表于 2021-5-12 12:55

谢谢分享

yutu925 · 发表于 2021-5-13 16:02

好文章，学习了，感谢分享。

L剑仙 · 发表于 2021-5-13 23:56

感谢h大的精华

寿阳炎 · 发表于 2021-5-18 11:47

大佬厉害

hzwang1966 · 发表于 2021-5-19 15:07

直呼内行，太强了

4396的梦想 · 发表于 2021-5-20 15:34

我很赞同！

l441669899 · 发表于 2021-5-21 07:47

学习学习，感谢！

帐号		自动登录	找回密码
密码			注册[Register]

[Android 脱壳] capstone2llvmir入门---如何把汇编转换为llvmir

前言

capstone2llvmirtool入门

对于不同代码的四种不同翻译方式

具体原理概括

首先，创建翻译模块translator module

然后，通过translator 执行翻译

capstone2llvmir入口

translate函数

translateInstruction函数

translatePseudoAsmGeneric函数

自己编译

retdectool

附录：capstone2llvmir目录结构与源码

免费评分