Introduction
EVM-IR Specification v0.1
Status: Draft
Audience: Compiler engineers, VM architects, language designers
Purpose of This Specification
This document defines EVM-IR, a low-level, SSA-based intermediate representation for compiling high-level smart contract languages to Ethereum Virtual Machine (EVM) bytecode.
EVM-IR is designed to:
- Serve as a common backend target for multiple languages (Ora and others)
- Provide a precise, analyzable model of EVM-style execution
- Enable optimizations and formal reasoning at a level higher than bytecode
- Remain EVM-aware without mirroring the EVM’s stack machine structure
- Be simple to implement in existing compiler toolchains (e.g., MLIR dialects)
EVM-IR is not a source language, runtime, or bytecode format. It is a compiler IR that sits between high-level language frontends and final EVM code generation.
Design Goals
EVM-IR is guided by the following design goals:
Language-Agnostic
EVM-IR must not embed assumptions from a single language (e.g. Ora, Solidity, Vyper). Any frontend that can lower its semantics to a:
- typed, SSA-based control-flow graph
- without composite types or high-level exceptions
should be able to target EVM-IR.
EVM-Aware but Not Stack-Shaped
The EVM is a stack machine. EVM-IR is not.
Instead, EVM-IR:
- Models EVM concepts explicitly (storage, memory, calldata, transient storage)
- Uses SSA values and typed operations instead of implicit stacks
- Defers stack concerns to a later stackification phase
This separation makes IR easier to reason about, optimize, and verify.
Canonical SSA Form
EVM-IR is always in a structured SSA form with:
- Typed SSA values
- Basic blocks with explicit terminators
- No critical edges
- No unreachable blocks after legalization
- No PHI operations in the final canonical form (state merges use memory)
A separate legalizer pass normalizes arbitrary EVM-IR into this canonical form.
Deterministic Lowering to Bytecode
Given a canonical EVM-IR module, lowering to EVM bytecode should be:
- Deterministic: no hidden semantics or non-local choices
- Predictable: small IR changes produce small codegen changes
- Reasonable: preserves obvious performance expectations
To achieve this, the specification defines:
- Allowed operations and their semantics
- Control-flow and structural invariants
- Stackification rules (SSA → stack machine)
Explicit, Minimal Semantics
EVM-IR avoids embedding high-level language features such as:
- exceptions
- generics
- inheritance
- traits / interfaces
- language-specific error models
Instead, it exposes a minimal set of operations that can encode these features using control flow, memory, storage, and calls.
Non-Goals
EVM-IR intentionally does not attempt to:
- Define a new source language or syntax
- Replace the Ethereum ABI specification
- Define gas pricing or performance guarantees
- Encode every EVM quirk as a first-class concept
- Serve as a human-friendly assembly language
- Model non-EVM backends (e.g., WASM) directly
Language frontends are responsible for:
- Type checking and high-level type systems
- Ownership / borrowing / capability models
- Semantic checks (e.g. overflow policies)
- Composite type lowering (structs, arrays, enums)
- High-level optimizations
EVM-IR assumes that all high-level decisions have been made before lowering.
Position in the Compilation Pipeline
EVM-IR sits in the middle of the compilation pipeline:
High-Level Language (Ora, etc.)
│
▼
Frontend IR / AST / MLIR dialect
│
(language-specific lowering)
▼
EVM-IR
│
[Legalizer Pass]
▼
Canonical EVM-IR (SSA)
│
[Stackifier Pass]
▼
Stack-Oriented IR
│
[Backend / Codegen]
▼
EVM Bytecode
Key properties:
- Frontends lower their own constructs to EVM-IR, not directly to bytecode.
- Legalizer ensures EVM-IR satisfies canonical constraints (no PHI, no composites, normalized CFG).
- Stackifier converts SSA into linear, stack-based code suitable for bytecode emission.
- Bytecode generator encodes the stack-level program into raw EVM opcodes (out of scope for v0.1).
Module and Function Model
An EVM-IR compilation unit is a module.
A module contains:
- Functions: code bodies with typed parameters and return types
- Global metadata: debug info, ABI information, target properties
- Optionally: declarations of external functions or runtime hooks
Each function:
- Has a name (e.g.,
@transferor@__entry) - Has a signature: list of parameter types and result types
- Contains a non-empty set of basic blocks
- Has exactly one entry block
- Uses SSA values produced by operations within the function
Basic blocks:
- Contain a sequence of operations
- End with a terminator (branch, conditional branch, return, revert, etc.)
- May have zero or more predecessors
- Have no implicit control-flow edges
Canonical IR Requirements (High-Level)
While the detailed canonicalization rules are specified in the Legalizer section, the following high-level requirements hold for canonical EVM-IR:
-
SSA Form
Each value is assigned exactly once and used any number of times. No mutable variables; state is represented via memory, storage, or transient storage. -
No PHI Nodes in Final Form
IR may temporarily use PHI-like constructs (MLIR block arguments, etc.),
but the legalizer must ultimately lower all control-flow merges to explicit memory operations (e.g., stores in predecessors, loads in merge blocks). -
Explicit Control Flow
Every block ends with a terminator. No fall-through without an explicit branch. -
No Composite Types
Structs, tuples, arrays, and maps are not first-class types in EVM-IR. They must be lowered to primitive values and memory operations. -
Well-Formed CFG
No unreachable blocks. No critical edges that violate canonicalization rules. -
Well-Typed Operations
Every operation must satisfy type rules defined in the Operations section. Implicit conversions are not allowed.
EVM Awareness and Address Spaces
Although EVM-IR is not a stack machine, it is explicitly aware of the distinct EVM address spaces:
- Memory: transient per-call, word-addressable, zero-initialized
- Storage: persistent contract key-value store
- Calldata: read-only call input buffer
- Transient Storage: EIP-1153 temporary key-value store, cleared at end of transaction
- Code: read-only contract code region for
EXTCODE*operations
These are expressed via typed pointers with address space identifiers, rather than via raw numeric offsets alone. This allows:
- static verification of address space usage
- more robust lowering to EVM opcodes
- analysis and optimization passes to reason about memory vs. storage vs. transient vs. calldata
Relationship With Frontends (e.g., Ora)
EVM-IR is intentionally not tied to the Ora language, but Ora is a primary expected frontend.
Frontends are responsible for:
- Introducing EVM-IR functions that implement their language functions
- Lowering language-level constructs (errors, results, enums, structs, etc.) into memory and control flow
- Providing ABI metadata used by the ABI lowering stage
- Injecting verification-friendly patterns if desired
The IR itself does not know about:
- Ora’s refinement types
- Ora’s proof obligations
- Language-level error unions
Those are erased or encoded into EVM-IR primitives before or during lowering.
ABI and Entry Semantics (High-Level View)
EVM-IR supports:
-
A special entry dispatcher function that:
- examines calldata
- extracts the 4-byte selector
- routes execution to the appropriate function body
- handles unknown selectors and ETH-only transfers
-
Per-function ABI decode logic that:
- reads argument words from calldata
- decodes static types into SSA values
- decodes dynamic ABI types into memory regions
-
Per-function ABI encode logic that:
- writes return values to memory in ABI layout
- returns a pointer + length pair via
return/revert
Details are provided in the ABI Lowering section, but the introduction notes that EVM-IR is designed to act as the point where ABI obligations become explicit.
Debug Information and Tooling
EVM-IR defines a minimal, language-agnostic debug model:
- Source locations (file, line, column)
- Variable metadata (name, type, location)
- Scope information (function/block scopes)
The specification also describes how to export this information into a format compatible with the ethdebug/format project, allowing integration with:
- debuggers
- tracers
- on-chain analysis tools
Details are provided in the Debug Information section.
Stackification and Backend
EVM-IR is not directly executable. It must be lowered to:
- Canonical EVM-IR (SSA) — after legalization
- Stack-Oriented IR — via the stackifier
- EVM bytecode — via a backend encoder (out of scope for v0.1)
The stackifier is responsible for:
- choosing frame layouts for stack and memory
- scheduling instructions according to stack constraints
- introducing DUP/SWAP operations when necessary
- ensuring efficient code generation
The backend then maps Stack IR to raw opcodes.
Document Structure
This specification is split into the following standalone sections:
types.md— Type system (primitive types, pointers, address spaces)ops.md— Operations (semantics, type rules, constraints)legalizer.md— Canonical form, legalizer rules, CFG normalizationstackifier.md— SSA → stack machine transformationdebug.md— Debug metadata and ethdebug integrationabi-lowering.md— ABI decoding, encoding, dispatcher, fallback/receiveexamples.md— Worked examples of complete lowering pipelinesappendix.md— Reference tables, glossary, and auxiliary material
Each file can be read independently, but together they describe the full EVM-IR model.