In our previous article Callsite resolution in the JVM we discussed how a monomorphic callsite looks like after inline caching:
1 2 3 |
public static void j2(I i) throws Throwable { i.m(); } |
Java
1 2 3 4 |
... 0x7f5db4e0b7cd: movabs $0x801000400,%rax 0x7f5db4e0b7d7: callq 0x7fffe55f9800 .. |
x86_64 assembly (generated by C1)
As a reminder, the RAX
register is loaded with the Klass*
for which the optimization is valid: a receiver object of the A1
type. The callq
instruction that follows goes to the unverified entry of the A1::m
method, where the Klass*
in RAX
is checked against the receiver object’s class. If the check succeeds, execution proceeds to the actual method. In this article we will see what happens when the check fails or, in other words, when the receiver object is not an instance of A1
and the optimization is not longer valid.
Refreshing our previous classes hierarchy, the I
interface has two implementors that provide concrete methods m
: A1:m
and A2::m
. For this experiment, we will send an instance of A2
to the callsite optimized for A1
:
1 2 3 4 |
public static void main(String[] args) throws Throwable { j2(new A1()); // Optimization of the callsite for A1 instances j2(new A2()); // A2 reaches the A1-optimized callsite } |
This is the unverified entry of A1::m
, hit by the receiver object that is an instance of the A2
class:
1 2 3 4 5 |
0x7fffe55f9800: mov 0x8(%rsi),%r10d 0x7fffe55f9804: movabs $0x800000000,%r11 0x7fffe55f980e: add %r11,%r10 0x7fffe55f9811: cmp %rax,%r10 0x7fffe55f9814: jne 0x7fffe5699f20 |
The RSI
register holds a pointer to the receiver object. At offset 0x8 we find the object’s compressed Klass*
. The instructions that follow expand the Klass*
into R10
. Finally, the check against the Klass
for which the optimization is valid —A1
, available in the RAX
register as seen before— takes place. In our case, the check fails as A1
!= A2
, and execution moves to 0x7fffe5699f20
.
At 0x7fffe5699f20
we find the same code blob generated by SharedRuntime::generate_resolve_blob
that we saw before, but this time the function invoked in the JVM is SharedRuntime::handle_wrong_method_ic_miss(JavaThread*)
:
1 2 3 4 5 6 7 8 9 10 |
0x7fffe5699f20: push %rbp 0x7fffe5699f21: mov %rsp,%rbp 0x7fffe5699f24: pushfq 0x7fffe5699f25: sub $0x8,%rsp 0x7fffe5699f29: sub $0x80,%rsp 0x7fffe5699f30: mov %rax,0x78(%rsp) ... 0x7fffe5699f8c: mov %rsp,0x2d8(%r15) 0x7fffe5699f93: mov %r15,%rdi 0x7fffe5699f96: callq 0x7ffff6c48f98 <SharedRuntime::handle_wrong_method_ic_miss(JavaThread*)> |
Inside the JVM, the SharedRuntime::handle_ic_miss_helper_internal
method receives the following arguments:
- receiver (
Handle
):A2
instance - caller_nm (
CompiledMethod*
):Main::j2
- caller_frame (
frame&
):- _sp: stack pointer right before calling
callq
inMain::j2
‘s callsite (*(_sp-0x8) is _pc) - _pc: instruction right after
callq
inMain::j2
‘s callsite - _cb:
Main::j2
- _sp: stack pointer right before calling
- callee_method (
methodHandle
):A2::m
- bc (
Bytecodes::Code
):Bytecodes::_invokeinterface
- call_info (
CallInfo&
):- _resolved_klass:
I
- _resolved_method:
H::m
- _selected_method:
A2::m
- _resolved_klass:
- needs_ic_stub_refill (
bool&
): output parameter
The callsite is not longer monomorphic: the selected method is now A2::m
instead of A1::m
. Information about the optimized callsite can be obtained from the caller_frame
‘s _pc
, and takes the form of a CompiledIC
instance. The field _call
in this instance points to the callsite’s callq
instruction (0x7f5db4e0b7d7
), and the field _value
to the instruction that loads the RAX
register (0x7f5db4e0b7cd
). The method CompiledIC::is_megamorphic
gets the current destination of the callsite (unverified entry of A1::m
) and checks if there is a VtableStub
instance right before (see VtableStubs::entry_point
). As there is not, execution moves to CompiledIC::set_to_megamorphic
which will do the transformation. The transformation implies patching the callsite, which in its megamorphic form will look like this:
1 2 3 4 |
... 0x7f5db4e0b7cd: movabs $0x7ffff01fc670,%rax 0x7f5db4e0b7d7: callq 0x7f5db4e01fb0 ... |
What is loaded in RAX
now is a pointer to the callsite’s CompiledICHolder
instance, which in this case holds the following information:
- _holder_metadata: H
- _holder_klass: I
You can tell from these values that a CompiledICHolder
in megamorphic callsites has pointers to both the resolved class and the resolved method class. Also remember that there is a pointer to the receiver object in RSI
. Both the CompiledICHolder
instance and the receiver object are inputs to the megamorphic callsite.
Execution moves to 0x7f5db4e01fb0
. This assembly code is generated by VtableStubs::create_itable_stub
. There is one of these code blobs, represented by an instance of VtableStub
, for each entry number used in a megamorphic itable call. In our case, the itable entry number is 0: H::m
is at position 0 in H
itables.
Let’s analyze the code in 0x7f5db4e01fb0
chunk by chunk.
1 2 |
0x7f5db4e01fb0: mov 0x10(%rax),%rbx 0x7f5db4e01fb4: mov 0x8(%rax),%rax |
The sequence starts by loading the _holder_klass
(I
interface) into RBX
and _holder_metadata
(H
interface) into RAX
, from a CompiledICHolder
instance.
1 2 3 |
0x7f5db4e01fb8: mov 0x8(%rsi),%r10d 0x7f5db4e01fbc: movabs $0x800000000,%r11 0x7f5db4e01fc6: add %r11,%r10 |
Then, it loads the uncompressed receiver object’s class to R10
.
1 2 |
0x7f5db4e01fc9: mov 0xa8(%r10),%r11d 0x7f5db4e01fd0: lea 0x1e0(%r10,%r11,8),%r11 |
The length of the receiver object’s class vtable is loaded to R11
. The size of InstanceKlass
(0x1e0
) is added to the size of a vtable entry (8
) times the number of entries (6
). As a result of this computation, R11
points to the address right after the receiver object’s class vtable. What we have in this position is the first itable of the receiver object’s class, if any. Otherwise, NULL
.
1 2 3 |
0x7f5db4e01fd8: mov (%r11),%r10 0x7f5db4e01fdb: cmp %r10,%rbx 0x7f5db4e01fde: je 0x7f5db4e01ff5 |
This code checks if the first itable is for the I
interface. In such case, execution jumps to 0x7f5db4e01ff5
—we will analyze this path later—. Notice that R10
can be either NULL
(no more itables available) or point to an interface.
1 2 |
0x7f5db4e01fe0: test %r10,%r10 0x7f5db4e01fe3: je 0x7f5db4e02040 |
This is the beginning of a loop that goes over the receiver object’s itables. One of the conditions to stop by jumping to 0x7f5db4e02040
is when there are no more itables and R10
is NULL
. Otherwise, we continue below.
1 2 |
0x7f5db4e01fe9: add $0x10,%r11 0x7f5db4e01fed: mov (%r11),%r10 |
At this point, we move to the next itable by adding 0x10
to R11
. Notice that we are advancing a fix offset to iterate itables. Thus, what each entry has is not the itable itself —which is of a variable size— but a 16-bytes header that points both to the interface and to the actual itable (as an offset). After this code, R10
has either a pointer to the next interface or NULL
.
1 2 |
0x7f5db4e01ff0: cmp %r10,%rbx 0x7f5db4e01ff3: jne 0x7f5db4e01fe0 |
If the itable is not for I
, we jump back in the loop to try the next one or stop. If execution does not jump, we know that the receiver object is of a class that implements I
. The latter case, which is also the landing site of the first itable check, is handled next:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
0x7f5db4e01ff5: mov 0x8(%rsi),%r10d 0x7f5db4e01ff9: movabs $0x800000000,%r11 0x7f5db4e02003: add %r11,%r10 0x7f5db4e02006: mov 0xa8(%r10),%r11d 0x7f5db4e0200d: lea 0x1e0(%r10,%r11,8),%r11 0x7f5db4e02015: lea (%r10),%r10 0x7f5db4e02018: mov (%r11),%rbx 0x7f5db4e0201b: cmp %rbx,%rax 0x7f5db4e0201e: je 0x7f5db4e02035 0x7f5db4e02020: test %rbx,%rbx 0x7f5db4e02023: je 0x7f5db4e02040 0x7f5db4e02029: add $0x10,%r11 0x7f5db4e0202d: mov (%r11),%rbx 0x7f5db4e02030: cmp %rbx,%rax 0x7f5db4e02033: jne 0x7f5db4e02020 |
This sequence is similar to the one we just described: the receiver object itables are iterated again, with the only difference that we are now checking if the receiver object class implements the interface H
. Assuming that the receiver object is of a class that implements both the I
and H
interfaces, the following sequence executes:
1 2 3 |
0x7f5db4e02035: mov 0x8(%r11),%r11d 0x7f5db4e02039: mov (%r10,%r11,1),%rbx 0x7f5db4e0203d: jmpq *0x50(%rbx) |
The offset to the H
itable is loaded into R11
. When we add the base address of the receiver object’s class (available in R10
), we get the absolute address for the H
itable. We then load the entry number 0 of the H
itable to RBX
. Thus, RBX
has a Method*
to an implementation of m
, which in our example is A2::m
. At offset 0x50
of the Method
instance there is the entry point from compiled code and that’s where execution continues. The callsite, after deoptimization, is ready to handle both A1
and A2
receiver objects without any other transformation.
When the class of the receiver object does not implement either the I
or H
interface, execution lands here:
1 |
0x7f5db4e02040: jmpq 0x7fffe5696920 |
Execution moves to SharedRuntime::handle_wrong_method
and SharedRuntime::reresolve_call_site
in the JVM.
Finally, I’d like to raise some attention on the computational cost that interface calls have when the inline cache optimization cannot be applied. Why run time checking of interfaces is required would be the subject of a different blog post.