In the previous article (BIOS execution in QEMU: where it all starts) we described how QEMU starts executing the BIOS image and what binary translation means. Even though we found the BIOS image instructions in the host memory and established a sort of correlation to emulated physical addresses executed by QEMU’s CPU, there was still a leap in-between. That is what we will explore now.
Modern CPUs view memory as a virtual and continuous range. A hardware component called MMU (Memory Management Unit) does a translation between virtual and physical addresses, that ultimately go into the system memory bus. To increase performance, MMUs implement a cache scheme called TLB (Translation Lookaside Buffer). These same concepts of ‘translating between two disjoint address spaces’ and ‘caching translations’ are implemented by QEMU’s software MMU (softmmu).
QEMU’s CPU executes instructions on an emulated physical address space (emu-phy-addresses, from now on), when in real mode. These instructions are located in QEMU’s process memory (host virtual space); so there has to be a way of finding and fetching them based on their emu-phy-address while generating Translation Blocks. The same applies to read/write CPU instructions that access the main memory or memory-mapped devices.
What we described is essentially a mapping problem; and the key structure in QEMU to tackle it down is MemoryRegion. Each Memory Region instance describes an emu-phy-address range. Looking at some of its members (address, size, name, read-onlyness status, sub-regions, aliases, etc.) should give an idea of that the region is for. However, the effect of an access will depend on its type.
Let’s have a look at a Memory Regions map just before CPU execution begins:
- system [0x0 – UINT64_MAX)
- ram-below-4g [0x0 – 0x8000000)
- Alias of pc.ram
- smram-region [0xa0000 – 0xc0000)
- Alias of pci
- pam-pci [0xf0000 – 0x100000)
- Alias of pci
- pam-rom [0xf0000 – 0x100000)
- Alias of pc.ram
- pam-pci [0xf0000 – 0x100000)
- Alias of pc.ram
- pam-ram [0xf0000 – 0x100000)
- Alias of pc.ram
- …
- Several pam-* regions similar to the previous (omitted)
- ioapic [0xfec00000 – 0xfec01000)
- hpet [0xfed00000 – 0xfed00400)
- apic-msi [0xfee00000 – 0xfef00000)
- pci [0x0 – UINT64_MAX)
- vga-lowmem [0xa0000 – 0xc0000)
- pc.rom [0xc0000 – 0xe0000)
- isa-bios [0xe0000 – 0x100000)
- Alias of pc.bios
- pc.bios [0xfffc0000 – 0x100000000)
- ram-below-4g [0x0 – 0x8000000)
- io
- piix4-pm [0x0 – 0x40)
- acpi-evt [0x0 – 0x4)
- acpi-cnt [0x4 – 0x6)
- acpi-tmr [0x8 – 0xc)
- dma-chan [0x0 – 0x8)
- dma-cont [0x8 – 0x10)
- pic [0x20 – 0x22)
- pit [0x40 – 0x44)
- i8042-data [0x60]
- pcspk [0x61]
- i8042-cmd [0x64]
- rtc [0x70 – 0x72)
- rtc-index [0x0]
- kvmvapic [0x7e – 0x80)
- ioport80 [0x80]
- dma-page [0x81 – 0x84)
- dma-page [0x87]
- dma-page [0x89 – 0x8c)
- dma-page [0x8f]
- port92 [0x92]
- pic [0xa0 – 0xa2)
- apm-io [0xb2 – 0xb4)
- dma-chan [0xc0 – 0xd0)
- dma-cont [0xd0 – 0xe0)
- ioportF0 [0xf0]
- ide [0x170 – 0x178)
- vbe [0x1ce – 0x1d1)
- ide [0x1f0 – 0x1f8)
- ide [0x376]
- parallel [0x378 – 0x380)
- vga [0x3b4 – 0x3b6)
- vga [0x3ba]
- vga [0x3c0 – 0x3d0)
- vga [0x3d4 – 0x3d6)
- vga [0x3da]
- fdc [0x3f1 – 0x3f6)
- ide [0x3f6]
- fdc [0x3f7]
- serial [0x3f8 – 0x400)
- elcr [0x4d0]
- elcr [0x4d1]
- fwcfg [0x510 – 0x512)
- fwcfg.dma [0x514 – 0x51c)
- pci-conf-idx [0xcf8 – 0xcfc)
- piix3-reset-control [0xcf9]
- pci-conf-data [0xcfc – 0xd00)
- vmport [0x5658]
- acpi-pci-hotplug [0xae00 – 0xae14)
- acpi-cpu-hotplug [0xaf00 – 0xaf20)
- acpi-gpe0 [0xafe0 – 0xafe4)
- pm-smbus [0xb100 – 0xb140)
- piix4-pm [0x0 – 0x40)
(*) A breakpoint in memory_map_init gives pointers to navigate this map.
(**) In bold the Memory Regions of interest for this article.
If the CPU starts executing at 0xfffffff0 emu-phy-address, pc.bios seems to be the Memory Region involved when generating the first Translation Block. pc.bios is a read-only RAM region. As such, its ram_block member refers to a RAMBlock structure; which in turn has a host pointer to the actual chunk in host’s memory.
As seen in the previous article, rom_add_file reads the BIOS image from a file and a Rom structure indicates its location in the host memory. This location has no relation to pc.bios, though. Bytes have to be copied from one place to the other. That’s precisely what address_space_write_rom_internal does: pc.bios’s RAM block is filled with BIOS bytes from the Rom instance upon a virtual machine reset.
Time for a quick test.
BIOS image bytes in the host memory, pointed by a Rom structure:
1 2 |
(gdb) x/5xb ((Rom*)0x5555563d6ac0)->data+0x3fff0 0x5555567cadf0: 0xea 0x5b 0xe0 0x00 0xf0 |
Note that 0x3fff0 is the offset to the BIOS image entry point, and what we are seeing there are the bytes that belong to the first instruction.
pc.bios RAM block before copy:
1 2 |
(gdb) x/5xb ((MemoryRegion*)0x55555622ee00)->ram_block->host+0x3fff0 0x7fffd7c3fff0: 0x00 0x00 0x00 0x00 0x00 |
pc.bios RAM block after copy:
1 2 |
(gdb) x/5xb ((MemoryRegion*)0x55555622ee00)->ram_block->host+0x3fff0 0x7fffd7c3fff0: 0xea 0x5b 0xe0 0x00 0xf0 |
The only pointer to the pc.bios RAM chunk is in &(ram_block->host). If 0xfffffff0 is ever translated to a host memory address, that pointer has to be read. If the bytes from the first instruction are ever retrieved, &(ram_block->host+0x3fff0) has to be read too. We will set memory breakpoints in both locations and see what happens.
The memory breakpoint in &(ram_block->host) is hit first, within tlb_set_page_with_attrs. The call stack gives us an idea of how we got there. Before starting a new Translation Block for 0xfffffff0 emu-phy-address, a lookup on a hash table is done to check if one has been already generated (see tb_htable_lookup). The host address that corresponds to the emu-phy-address looked up is used as part of the key hash. As a result, 0xfffffff0 needs to be converted -for the first time- into a host address (see get_page_addr_code).
Mimicking real MMUs, there is a cache scheme provided by a TLB software implementation. get_page_addr_code will try to get the host address from the TLB, and receives two parameters to accomplish that: a pointer to the CPU state (CPUX86State) and the emu-phy-address to be looked up (0xfffffff0). The first step is to obtain a pointer to the CPU’s CPUTLB instance, which contains all the TLB information. CPUX86State is a structure within an enclosing X86CPU one. Thus, we can obtain a pointer to X86CPU from the one we have to CPUX86State by subtracting its offset:
1 2 |
(gdb) print/x (((X86CPU*)(((char*)env) - (size_t)(&(((X86CPU*)0)->env))))) $10 = 0x555556656400 |
Once we have a X86CPU pointer we can navigate to its CPUTLB:
1 2 |
(gdb) print/x &(((X86CPU*)0x555556656400)->neg->tlb) $13 = 0x55555665e6c0 |
In CPUTLB we find two sets of tables with different information, intended for fast and slow access -this is a common pattern to avoid polluting the CPU cache rows with information not frequently needed-. Each set has 3 MMU tables in i386: MMU_USER (for non-privileged mode), MMU_KSMAP (for privileged mode with SMAP protection) and MMU_KNOSMAP (for privileged mode without SMAP). MMU_KNOSMAP is used in real mode, as we are now.
Finally we got to the fast-access MMU_KNOSMAP table container: CPUTLBDescFast. This container consists of a mask and a table of CPUTLBEntry entries. Each entry has an address (different values for read, write and execute access) and an addend. If a TLB entry exists for a looked up address, its host counterpart can be obtained by adding the addend to it.
Let’s see what we have there:
1 2 3 4 |
(gdb) print/x (((X86CPU*)0x555556656400)->neg->tlb->f[2].mask) $15 = 0x1fe0 (gdb) print *(((X86CPU*)0x555556656400)->neg->tlb->f[2].table) $17 = {{{addr_read = 18446744073709551615, addr_write = 18446744073709551615, addr_code = 18446744073709551615, addend = 18446744073709551615}, dummy = '\377' <repeats 32 times>}} |
Seems like the first entry in the table is not valid (all bytes set to 0xff). But, is that entry the one that corresponds to 0xfffffff0 emu-phy-address? To get the exact index into the table, we need to do:
1 |
(emu-phy-address >> TARGET_PAGE_BITS) & (mask >> CPU_TLB_ENTRY_BITS) |
For 4096 bytes pages (TARGET_PAGE_BITS = 12) and a 0xfffffff0 emu-phy-address, the first term is 0xfffff. The second term is 0x1fe0 (mask) >> 5 (CPU_TLB_ENTRY_BITS = 5) = 0xff. The resulting entry index is 0xff.
Each entry on the table is sizeof(CPUTLBEntry) = 32 bytes long. We can now read the right entry by adding index * sizeof(CPUTLBEntry) to the table pointer:
1 2 |
(gdb) print *(CPUTLBEntry*)((char*)(((X86CPU*)0x555556656400)->neg->tlb->f[2].table)+(32*0xff)) $28 = {{{addr_read = 18446744073709551615, addr_write = 18446744073709551615, addr_code = 18446744073709551615, addend = 18446744073709551615}, dummy = '\377' <repeats 32 times>}} |
The entry for 0xfffffff0 looks still invalid. For an entry to be valid, entry-address & (TARGET_PAGE_MASK | TLB_INVALID_MASK) must be equal to the looked-up-address & TARGET_PAGE_MASK. That is a simple comparison at page-aligned addresses, with a catch indicated by TLB_INVALID_MASK. The reason for this catch is disambiguation between valid and invalid entries. If TLB_INVALID_MASK were not used, a TLB entry whose entry-address is 0xFFFFFFFFFFFFFFFF could be considered valid when compared to a 0xFFFFFFFFFFFFFFF0 looked-up-address (0xFFFFFFFFFFFFF000 == 0xFFFFFFFFFFFFF000). When TLB_INVALID_MASK is part of the mask, the ‘TARGET_PAGE_BITS – 1’ bit in its term will be 1 for an invalid address and 0 for a valid one. If it is 1, there cannot be a match against the looked-up-address & TARGET_PAGE_MASK term because lower bits there are 0 for page alignment. The previous invalid case ends up being 0xFFFFFFFFFFFFF800 != 0xFFFFFFFFFFFFF000.
There is an observation that I want to stress before moving on: TLB translations occur at a page level. This means that translations for addresses in the range 0xfffff000 – 0xffffffff will all be handled by the same TLB entry.
A few pointers to the code that finds the current CPU’s MMU table, the TLB entry for a looked-up address and decides whether or not it is valid:
All these functions are used by get_page_addr_code. After verifying that the TLB entry for 0xfffffff0 is not valid, x86_cpu_tlb_fill and handle_mmu_fault are called.
In an x86 CPU operating in protected mode, the PG bit from the CR0 register will indicate that pagination is enabled. That means that the Page Tables, pointed by CR3, need to be navigated to find a proper mapping -if there is one- or signal a page fault. However, our CPU is in real mode and pagination is disabled: we jump pretty quickly to do_mapping.
do_mapping calculates a few values which are then passed to tlb_set_page_with_attrs. When in real mode, and going back to our 0xfffffff0 case, these values are pretty straight forward:
- vaddr = 0xfffff000
- paddr = 0xfffff000
- prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC
- mmu_idx = MMU_KNOSMAP (0x2)
- page_size = 4096
I won’t go into every tlb_set_page_with_attrs detail while creating and appending the TLB entry to the table, but want to mention how the entry’s addend is determined and how it relates to the pc.bios Memory Region.
To locate the pc.bios Memory Region starting from a CPUX86State instance, the following calls are performed:
From a CPUX86State instance we reach a struct AddressSpaceDispatch one. The latter contains a PhysPageMap map, which has a table of MemoryRegionSection entries. Each of these entries refer to a fragment within a Memory Region. Using those structures, address_space_lookup_region finds the MemoryRegionSection instance that corresponds to the 0xfffff000 emu-phy-address.
A few definitions now:
- section is the MemoryRegionSection instance found for 0xfffff000 (belonging to the pc.bios Memory Region)
- section->mr points to the pc.bios Memory Region
- vaddr_page is vaddr & TARGET_PAGE_MASK = 0xfffff000 & 0xfffffffffffff000 = 0xfffff000
- paddr_page is paddr & TARGET_PAGE_MASK = 0xfffff000 & 0xfffffffffffff000 = 0xfffff000
The TLB entry addend for 0xfffff000 is calculated as follows:
1 2 3 |
xlat = (paddr_page - section->offset_within_address_space) + section->offset_within_region addend = section->mr->ram_block->host + xlat - vaddr_page |
In real mode, paddr_page and vaddr_page are the same value and can be removed leading to a simpler equation:
1 |
addend = section->mr->ram_block->host - section->offset_within_address_space + section->offset_within_region |
section->offset_within_address_space is an offset to the start of the Memory Region in terms of emu-phy-addresses. When a region is an alias of a canonical one, section->mr points to the canonical region and section->offset_within_region is the alias offset within it.
Let’s see this in numbers:
1 2 3 4 5 6 7 8 9 10 |
(gdb) print *(MemoryRegionSection*)$rax $3 = {size = 262144, mr = 0x55555622ee00, fv = 0x55555773c270, offset_within_region = 0, offset_within_address_space = 4294705152, readonly = true, nonvolatile = false} (gdb) print/x paddr_page $5 = 0xfffff000 (gdb) print/x ((MemoryRegionSection*)$rax)->offset_within_address_space $6 = 0xfffc0000 (gdb) print/x ((MemoryRegionSection*)$rax)->offset_within_region $7 = 0x0 (gdb) print/x ((MemoryRegionSection*)$rax)->mr->ram_block->host $8 = 0x7fffd7c00000 |
(*) RAX register is section.
addend is then 0x7fffd7c00000 – 0xfffc0000 = 0x7ffed7c40000
Now that we have the addend, a simple sanity check can be applied: TLB’s addend + 0xfffffff0 emu-phy-address must contain the first BIOS instruction in host’s memory:
1 2 |
(gdb) x/5xb 0x7ffed7c40000 + 0xfffffff0 0x7fffd7c3fff0: 0xea 0x5b 0xe0 0x00 0xf0 |
We got it right! With TLB entry for 0xfffffff0 filled, the Translation Block generation proceeds -an existing one was obviously not found-.
Calculating the addend looks a bit convoluted but it’s easier when focusing on our pc.bios case only. We want to add a fixed number to an emu-phy-address and get a host address, because that would be a fast and efficient translation strategy. If the pc.bios Memory Region starts at 0xfffc0000 and its ram_block is located at 0x7fffd7c00000, then we have a simple equation for the first region emu-phy-address: 0x7fffd7c00000 = addend + 0xfffc0000. As a result, addend has to be 0x7ffed7c40000 for any emu-phy-address that belongs to pc.bios.
Complexity comes to handle Memory Region alias cases. isa-bus (starting at 0xe0000) is an alias to the last 128 KB of pc.bios. The offset to the BIOS entry point, in isa-bios range, is 0x1fff0. This means that 0xffff0 (0xe0000 + 0x1fff0) is an emu-phy-address within the isa-bus range that maps to the same host address than 0xfffffff0: 0x7fffd7c3fff0. Intuitively, 0x7fffd7c3fff0 = isa-bus-addend + 0xffff0; so isa-bus-addend must be 0x7fffd7b40000.
Let’s independently verify that. section->offset_within_address_space is 0xe0000 because that is the isa-bus start address. section->offset_within_region is 0x20000 because that’s the offset of isa-bus within pc.bios. Going back to the simplified addend equation, we have: addend = 0x7fffd7c00000 – 0xe0000 +0x20000 = 0x7fffd7b40000.
After this detour, we continue execution. The memory breakpoint set in the address containing the BIOS entry point instruction is hit within cpu_ldub_kernel_ra. Looks like this function received a 0xfffffff0 emu-phy-address in its target_ulong ptr parameter. Hard to check with the debugger due to severe inlining:
1 2 3 4 |
(gdb) x/1i $rip => 0x555555968fa3 <disas_insn+307>: mov %r10d,%edx (gdb) print/x $r10d $2 = 0xea |
For those a bit lazy -like me!- who don’t want to recompile QEMU disabling inlining in cpu_ldub_kernel_ra, some reverse engineering should be enough to fill in the blanks. We are in disas_insn+307 and r10d contains the first byte of the entry point instruction. Let’s look a bit backwards:
1 2 3 4 5 6 7 8 9 10 |
(gdb) disas disas_insn ... 0x0000555555968f29 <+185>: callq 0x55555595cd60 <advance_pc> 0x0000555555968f2e <+190>: mov $0x1,%r8d 0x0000555555968f34 <+196>: mov %rax,%rsi ... 0x0000555555968f9e <+302>: movzbl (%rax,%rsi,1),%r10d => 0x0000555555968fa3 <+307>: mov %r10d,%edx (gdb) print/x $rsi $4 = 0xfffffff0 |
The RSI register has a 0xfffffff0 emu-phy-address and is the result of calling advance_pc. If we analyze disas_insn source code and in particular the first call to advance_pc, it’s now evident that RSI has the target_ulong ptr parameter value. Anecdotally, the inlined cpu_ldub_kernel_ra code goes between disas_insn+190 and disas_insn+307, with a few additional instructions between disas_insn+1328 and disas_insn+1347.
A TLB lookup procedure is initiated in disas_insn+190; similar to the one previously described but this time we have a TLB hit for 0xfffffff0. Execution continues with the retrieval of the BIOS entry point instruction and the generation of a Translation Block.