Monomorphic to megamorphic callsite in C1

In our previous article Callsite resolution in the JVM we discussed how a monomorphic callsite looks like after inline caching:

Java

x86_64 assembly (generated by C1)
As a reminder, the RAX register is loaded with the Klass* for which the optimization is valid: a receiver object of the A1 type. The callq instruction that follows goes to the unverified entry of the A1::m method, where the Klass* in RAX is checked against the receiver object’s class. If the check succeeds, execution proceeds to the actual method. In this article we will see what happens when the check fails or, in other words, when the receiver object is not an instance of A1 and the optimization is not longer valid.

Refreshing our previous classes hierarchy, the I interface has two implementors that provide concrete methods m: A1:m and A2::m. For this experiment, we will send an instance of A2 to the callsite optimized for A1:

This is the unverified entry of A1::m, hit by the receiver object that is an instance of the A2 class:

The RSI register holds a pointer to the receiver object. At offset 0x8 we find the object’s compressed Klass*. The instructions that follow expand the Klass* into R10. Finally, the check against the Klass for which the optimization is valid —A1, available in the RAX register as seen before— takes place. In our case, the check fails as A1 != A2, and execution moves to 0x7fffe5699f20.

At 0x7fffe5699f20 we find the same code blob generated by SharedRuntime::generate_resolve_blob that we saw before, but this time the function invoked in the JVM is SharedRuntime::handle_wrong_method_ic_miss(JavaThread*):

Inside the JVM, the SharedRuntime::handle_ic_miss_helper_internal method receives the following arguments:

  • receiver (Handle): A2 instance
  • caller_nm (CompiledMethod*): Main::j2
  • caller_frame (frame&):
    • _sp: stack pointer right before calling callq in Main::j2‘s callsite (*(_sp-0x8) is _pc)
    • _pc: instruction right after callq in Main::j2‘s callsite
    • _cb: Main::j2
  • callee_method (methodHandle): A2::m
  • bc (Bytecodes::Code): Bytecodes::_invokeinterface
  • call_info (CallInfo&):
    • _resolved_klass: I
    • _resolved_method: H::m
    • _selected_method: A2::m
  • needs_ic_stub_refill (bool&): output parameter

The callsite is not longer monomorphic: the selected method is now A2::m instead of A1::m. Information about the optimized callsite can be obtained from the caller_frame‘s _pc, and takes the form of a CompiledIC instance. The field _call in this instance points to the callsite’s callq instruction (0x7f5db4e0b7d7), and the field _value to the instruction that loads the RAX register (0x7f5db4e0b7cd). The method CompiledIC::is_megamorphic gets the current destination of the callsite (unverified entry of A1::m) and checks if there is a VtableStub instance right before (see VtableStubs::entry_point). As there is not, execution moves to CompiledIC::set_to_megamorphic which will do the transformation. The transformation implies patching the callsite, which in its megamorphic form will look like this:

What is loaded in RAX now is a pointer to the callsite’s CompiledICHolder instance, which in this case holds the following information:

  • _holder_metadata: H
  • _holder_klass: I

You can tell from these values that a CompiledICHolder in megamorphic callsites has pointers to both the resolved class and the resolved method class. Also remember that there is a pointer to the receiver object in RSI. Both the CompiledICHolder instance and the receiver object are inputs to the megamorphic callsite.

Execution moves to 0x7f5db4e01fb0. This assembly code is generated by VtableStubs::create_itable_stub. There is one of these code blobs, represented by an instance of VtableStub, for each entry number used in a megamorphic itable call. In our case, the itable entry number is 0: H::m is at position 0 in H itables.

Let’s analyze the code in 0x7f5db4e01fb0 chunk by chunk.

The sequence starts by loading the _holder_klass (I interface) into RBX and _holder_metadata (H interface) into RAX, from a CompiledICHolder instance.

Then, it loads the uncompressed receiver object’s class to R10.

The length of the receiver object’s class vtable is loaded to R11. The size of InstanceKlass (0x1e0) is added to the size of a vtable entry (8) times the number of entries (6). As a result of this computation, R11 points to the address right after the receiver object’s class vtable. What we have in this position is the first itable of the receiver object’s class, if any. Otherwise, NULL.

This code checks if the first itable is for the I interface. In such case, execution jumps to 0x7f5db4e01ff5 —we will analyze this path later—. Notice that R10 can be either NULL (no more itables available) or point to an interface.

This is the beginning of a loop that goes over the receiver object’s itables. One of the conditions to stop by jumping to 0x7f5db4e02040 is when there are no more itables and R10 is NULL. Otherwise, we continue below.

At this point, we move to the next itable by adding 0x10 to R11. Notice that we are advancing a fix offset to iterate itables. Thus, what each entry has is not the itable itself —which is of a variable size— but a 16-bytes header that points both to the interface and to the actual itable (as an offset). After this code, R10 has either a pointer to the next interface or NULL.

If the itable is not for I, we jump back in the loop to try the next one or stop. If execution does not jump, we know that the receiver object is of a class that implements I. The latter case, which is also the landing site of the first itable check, is handled next:

This sequence is similar to the one we just described: the receiver object itables are iterated again, with the only difference that we are now checking if the receiver object class implements the interface H. Assuming that the receiver object is of a class that implements both the I and H interfaces, the following sequence executes:

The offset to the H itable is loaded into R11. When we add the base address of the receiver object’s class (available in R10), we get the absolute address for the H itable. We then load the entry number 0 of the H itable to RBX. Thus, RBX has a Method* to an implementation of m, which in our example is A2::m. At offset 0x50 of the Method instance there is the entry point from compiled code and that’s where execution continues. The callsite, after deoptimization, is ready to handle both A1 and A2 receiver objects without any other transformation.

When the class of the receiver object does not implement either the I or H interface, execution lands here:

Execution moves to SharedRuntime::handle_wrong_method and SharedRuntime::reresolve_call_site in the JVM.

Finally, I’d like to raise some attention on the computational cost that interface calls have when the inline cache optimization cannot be applied. Why run time checking of interfaces is required would be the subject of a different blog post.

Leave a Reply

Your email address will not be published.