Monomorphic to megamorphic callsite in C1

In our previous article Callsite resolution in the JVM we discussed how a monomorphic callsite looks like after inline caching:

public static void j2(I i) throws Throwable {
    i.m();
}

public static void j2(I i) throws Throwable {

i.m();

}

Java

...
0x7f5db4e0b7cd: movabs $0x801000400,%rax
0x7f5db4e0b7d7: callq  0x7fffe55f9800
..

...

0x7f5db4e0b7cd: movabs $0x801000400,%rax

0x7f5db4e0b7d7: callq 0x7fffe55f9800

x86_64 assembly (generated by C1)
As a reminder, the RAX register is loaded with the Klass* for which the optimization is valid: a receiver object of the A1 type. The callq instruction that follows goes to the unverified entry of the A1::m method, where the Klass* in RAX is checked against the receiver object’s class. If the check succeeds, execution proceeds to the actual method. In this article we will see what happens when the check fails or, in other words, when the receiver object is not an instance of A1 and the optimization is not longer valid.

Refreshing our previous classes hierarchy, the I interface has two implementors that provide concrete methods m: A1:m and A2::m. For this experiment, we will send an instance of A2 to the callsite optimized for A1:

public static void main(String[] args) throws Throwable {
    j2(new A1()); // Optimization of the callsite for A1 instances
    j2(new A2()); // A2 reaches the A1-optimized callsite
}

public static void main(String[] args) throws Throwable {

j2(new A1()); // Optimization of the callsite for A1 instances

j2(new A2()); // A2 reaches the A1-optimized callsite

}

This is the unverified entry of A1::m, hit by the receiver object that is an instance of the A2 class:

0x7fffe55f9800: mov    0x8(%rsi),%r10d
0x7fffe55f9804: movabs $0x800000000,%r11
0x7fffe55f980e: add    %r11,%r10
0x7fffe55f9811: cmp    %rax,%r10
0x7fffe55f9814: jne    0x7fffe5699f20

0x7fffe55f9800: mov 0x8(%rsi),%r10d

0x7fffe55f9804: movabs $0x800000000,%r11

0x7fffe55f980e: add %r11,%r10

0x7fffe55f9811: cmp %rax,%r10

0x7fffe55f9814: jne 0x7fffe5699f20

The RSI register holds a pointer to the receiver object. At offset 0x8 we find the object’s compressed Klass*. The instructions that follow expand the Klass* into R10. Finally, the check against the Klass for which the optimization is valid —A1, available in the RAX register as seen before— takes place. In our case, the check fails as A1 != A2, and execution moves to 0x7fffe5699f20.

At 0x7fffe5699f20 we find the same code blob generated by SharedRuntime::generate_resolve_blob that we saw before, but this time the function invoked in the JVM is SharedRuntime::handle_wrong_method_ic_miss(JavaThread*):

0x7fffe5699f20: push   %rbp
0x7fffe5699f21: mov    %rsp,%rbp
0x7fffe5699f24: pushfq 
0x7fffe5699f25: sub    $0x8,%rsp
0x7fffe5699f29: sub    $0x80,%rsp
0x7fffe5699f30: mov    %rax,0x78(%rsp)
...
0x7fffe5699f8c: mov    %rsp,0x2d8(%r15)
0x7fffe5699f93: mov    %r15,%rdi
0x7fffe5699f96: callq  0x7ffff6c48f98 &lt;SharedRuntime::handle_wrong_method_ic_miss(JavaThread*)&gt;

0x7fffe5699f20: push %rbp

0x7fffe5699f21: mov %rsp,%rbp

0x7fffe5699f24: pushfq

0x7fffe5699f25: sub $0x8,%rsp

0x7fffe5699f29: sub $0x80,%rsp

0x7fffe5699f30: mov %rax,0x78(%rsp)

...

0x7fffe5699f8c: mov %rsp,0x2d8(%r15)

0x7fffe5699f93: mov %r15,%rdi

0x7fffe5699f96: callq 0x7ffff6c48f98 <SharedRuntime::handle_wrong_method_ic_miss(JavaThread*)>

Inside the JVM, the SharedRuntime::handle_ic_miss_helper_internal method receives the following arguments:

receiver (Handle): A2 instance
caller_nm (CompiledMethod*): Main::j2
caller_frame (frame&):
- _sp: stack pointer right before calling callq in Main::j2‘s callsite (*(_sp-0x8) is _pc)
- _pc: instruction right after callq in Main::j2‘s callsite
- _cb: Main::j2
callee_method (methodHandle): A2::m
bc (Bytecodes::Code): Bytecodes::_invokeinterface
call_info (CallInfo&):
- _resolved_klass: I
- _resolved_method: H::m
- _selected_method: A2::m
needs_ic_stub_refill (bool&): output parameter

The callsite is not longer monomorphic: the selected method is now A2::m instead of A1::m. Information about the optimized callsite can be obtained from the caller_frame‘s _pc, and takes the form of a CompiledIC instance. The field _call in this instance points to the callsite’s callq instruction (0x7f5db4e0b7d7), and the field _value to the instruction that loads the RAX register (0x7f5db4e0b7cd). The method CompiledIC::is_megamorphic gets the current destination of the callsite (unverified entry of A1::m) and checks if there is a VtableStub instance right before (see VtableStubs::entry_point). As there is not, execution moves to CompiledIC::set_to_megamorphic which will do the transformation. The transformation implies patching the callsite, which in its megamorphic form will look like this:

...
0x7f5db4e0b7cd: movabs $0x7ffff01fc670,%rax
0x7f5db4e0b7d7: callq  0x7f5db4e01fb0
...

...

0x7f5db4e0b7cd: movabs $0x7ffff01fc670,%rax

0x7f5db4e0b7d7: callq 0x7f5db4e01fb0

...

What is loaded in RAX now is a pointer to the callsite’s CompiledICHolder instance, which in this case holds the following information:

_holder_metadata: H
_holder_klass: I

You can tell from these values that a CompiledICHolder in megamorphic callsites has pointers to both the resolved class and the resolved method class. Also remember that there is a pointer to the receiver object in RSI. Both the CompiledICHolder instance and the receiver object are inputs to the megamorphic callsite.

Execution moves to 0x7f5db4e01fb0. This assembly code is generated by VtableStubs::create_itable_stub. There is one of these code blobs, represented by an instance of VtableStub, for each entry number used in a megamorphic itable call. In our case, the itable entry number is 0: H::m is at position 0 in H itables.

Let’s analyze the code in 0x7f5db4e01fb0 chunk by chunk.

0x7f5db4e01fb0: mov    0x10(%rax),%rbx
0x7f5db4e01fb4: mov    0x8(%rax),%rax

1 2	0x7f5db4e01fb0: mov 0x10(%rax),%rbx 0x7f5db4e01fb4: mov 0x8(%rax),%rax

The sequence starts by loading the _holder_klass (I interface) into RBX and _holder_metadata (H interface) into RAX, from a CompiledICHolder instance.

0x7f5db4e01fb8: mov    0x8(%rsi),%r10d
0x7f5db4e01fbc: movabs $0x800000000,%r11
0x7f5db4e01fc6: add    %r11,%r10

0x7f5db4e01fb8: mov 0x8(%rsi),%r10d

0x7f5db4e01fbc: movabs $0x800000000,%r11

0x7f5db4e01fc6: add %r11,%r10

Then, it loads the uncompressed receiver object’s class to R10.

0x7f5db4e01fc9: mov    0xa8(%r10),%r11d
0x7f5db4e01fd0: lea    0x1e0(%r10,%r11,8),%r11

1 2	0x7f5db4e01fc9: mov 0xa8(%r10),%r11d 0x7f5db4e01fd0: lea 0x1e0(%r10,%r11,8),%r11

The length of the receiver object’s class vtable is loaded to R11. The size of InstanceKlass (0x1e0) is added to the size of a vtable entry (8) times the number of entries (6). As a result of this computation, R11 points to the address right after the receiver object’s class vtable. What we have in this position is the first itable of the receiver object’s class, if any. Otherwise, NULL.

0x7f5db4e01fd8: mov    (%r11),%r10
0x7f5db4e01fdb: cmp    %r10,%rbx
0x7f5db4e01fde: je     0x7f5db4e01ff5

0x7f5db4e01fd8: mov (%r11),%r10

0x7f5db4e01fdb: cmp %r10,%rbx

0x7f5db4e01fde: je 0x7f5db4e01ff5

This code checks if the first itable is for the I interface. In such case, execution jumps to 0x7f5db4e01ff5 —we will analyze this path later—. Notice that R10 can be either NULL (no more itables available) or point to an interface.

0x7f5db4e01fe0: test   %r10,%r10
0x7f5db4e01fe3: je     0x7f5db4e02040

1 2	0x7f5db4e01fe0: test %r10,%r10 0x7f5db4e01fe3: je 0x7f5db4e02040

This is the beginning of a loop that goes over the receiver object’s itables. One of the conditions to stop by jumping to 0x7f5db4e02040 is when there are no more itables and R10 is NULL. Otherwise, we continue below.

0x7f5db4e01fe9: add    $0x10,%r11
0x7f5db4e01fed: mov    (%r11),%r10

1 2	0x7f5db4e01fe9: add $0x10,%r11 0x7f5db4e01fed: mov (%r11),%r10

At this point, we move to the next itable by adding 0x10 to R11. Notice that we are advancing a fix offset to iterate itables. Thus, what each entry has is not the itable itself —which is of a variable size— but a 16-bytes header that points both to the interface and to the actual itable (as an offset). After this code, R10 has either a pointer to the next interface or NULL.

0x7f5db4e01ff0: cmp    %r10,%rbx
0x7f5db4e01ff3: jne    0x7f5db4e01fe0

1 2	0x7f5db4e01ff0: cmp %r10,%rbx 0x7f5db4e01ff3: jne 0x7f5db4e01fe0

If the itable is not for I, we jump back in the loop to try the next one or stop. If execution does not jump, we know that the receiver object is of a class that implements I. The latter case, which is also the landing site of the first itable check, is handled next:

0x7f5db4e01ff5: mov    0x8(%rsi),%r10d
0x7f5db4e01ff9: movabs $0x800000000,%r11
0x7f5db4e02003: add    %r11,%r10
0x7f5db4e02006: mov    0xa8(%r10),%r11d
0x7f5db4e0200d: lea    0x1e0(%r10,%r11,8),%r11
0x7f5db4e02015: lea    (%r10),%r10
0x7f5db4e02018: mov    (%r11),%rbx
0x7f5db4e0201b: cmp    %rbx,%rax
0x7f5db4e0201e: je     0x7f5db4e02035
0x7f5db4e02020: test   %rbx,%rbx
0x7f5db4e02023: je     0x7f5db4e02040
0x7f5db4e02029: add    $0x10,%r11
0x7f5db4e0202d: mov    (%r11),%rbx
0x7f5db4e02030: cmp    %rbx,%rax
0x7f5db4e02033: jne    0x7f5db4e02020

0x7f5db4e01ff5: mov 0x8(%rsi),%r10d

0x7f5db4e01ff9: movabs $0x800000000,%r11

0x7f5db4e02003: add %r11,%r10

0x7f5db4e02006: mov 0xa8(%r10),%r11d

0x7f5db4e0200d: lea 0x1e0(%r10,%r11,8),%r11

0x7f5db4e02015: lea (%r10),%r10

0x7f5db4e02018: mov (%r11),%rbx

0x7f5db4e0201b: cmp %rbx,%rax

0x7f5db4e0201e: je 0x7f5db4e02035

0x7f5db4e02020: test %rbx,%rbx

0x7f5db4e02023: je 0x7f5db4e02040

0x7f5db4e02029: add $0x10,%r11

0x7f5db4e0202d: mov (%r11),%rbx

0x7f5db4e02030: cmp %rbx,%rax

0x7f5db4e02033: jne 0x7f5db4e02020

This sequence is similar to the one we just described: the receiver object itables are iterated again, with the only difference that we are now checking if the receiver object class implements the interface H. Assuming that the receiver object is of a class that implements both the I and H interfaces, the following sequence executes:

0x7f5db4e02035: mov    0x8(%r11),%r11d
0x7f5db4e02039: mov    (%r10,%r11,1),%rbx
0x7f5db4e0203d: jmpq   *0x50(%rbx)

0x7f5db4e02035: mov 0x8(%r11),%r11d

0x7f5db4e02039: mov (%r10,%r11,1),%rbx

0x7f5db4e0203d: jmpq *0x50(%rbx)

The offset to the H itable is loaded into R11. When we add the base address of the receiver object’s class (available in R10), we get the absolute address for the H itable. We then load the entry number 0 of the H itable to RBX. Thus, RBX has a Method* to an implementation of m, which in our example is A2::m. At offset 0x50 of the Method instance there is the entry point from compiled code and that’s where execution continues. The callsite, after deoptimization, is ready to handle both A1 and A2 receiver objects without any other transformation.

When the class of the receiver object does not implement either the I or H interface, execution lands here:

0x7f5db4e02040: jmpq   0x7fffe5696920

1	0x7f5db4e02040: jmpq 0x7fffe5696920

Execution moves to SharedRuntime::handle_wrong_method and SharedRuntime::reresolve_call_site in the JVM.

Finally, I’d like to raise some attention on the computational cost that interface calls have when the inline cache optimization cannot be applied. Why run time checking of interfaces is required would be the subject of a different blog post.