从Hotspot JIT编译器打印生成的汇编代码

有时，在对Java应用程序进行性能分析时，有必要了解Hotspot JIT编译器生成的汇编代码。这对于确定已做出的优化决策以及我们的代码更改如何影响生成的汇编代码非常有用。在调试并行算法以确保已按预期应用可见性规则时，知道何时发出什么指令也很有用。通过这种方式，我在各种JVM中发现了很多错误。

该博客说明了如何安装反汇编程序插件，并提供了针对特定方法的命令行选项。

安装

以前，有必要获取调试版本以打印由Hotspot JIT为Oracle / SUN JVM生成的汇编代码。从Java 7开始，如果在标准Oracle Hotspot JVM中安装了反汇编程序插件，则可以打印生成的汇编代码。要为64位Linux安装插件，请按照以下步骤操作：

从https://kenai.com/projects/base-hsdis/downloads下载适当的二进制文件或从源代码构建
在Linux上，将linux-hsdis-amd64.so重命名为libhsdis-amd64.so
将共享库复制到$ JAVA_HOME / jre / lib / amd64 / server

您现在已经安装了插件！

测试程序

为了测试插件，我们需要一些代码，这些代码既对程序员很有趣，又执行得足够热，可以被JIT优化。 JIT何时进行优化的一些细节可以在这里找到。下面的代码可用于通过读写易失字段来测量两个线程之间的平均延迟。这些易失字段很有趣，因为它们需要关联的硬件篱笆来遵守Java内存模型。

import static java.lang.System.out;public class InterThreadLatency
{private static final int REPETITIONS = 100 * 1000 * 1000;private static volatile int ping = -1;private static volatile int pong = -1;public static void main(final String[] args)throws Exception{for (int i = 0; i < 5; i++){final long duration = runTest();out.printf("%d - %dns avg latency - ping=%d pong=%d\n",i,duration / (REPETITIONS * 2),ping,pong);}}private static long runTest() throws InterruptedException{final Thread pongThread = new Thread(new PongRunner());final Thread pingThread = new Thread(new PingRunner());pongThread.start();pingThread.start();final long start = System.nanoTime();pongThread.join();return System.nanoTime() - start;}public static class PingRunner implements Runnable{public void run(){for (int i = 0; i < REPETITIONS; i++){ping = i;while (i != pong){// busy spin}}}}public static class PongRunner implements Runnable{public void run(){for (int i = 0; i < REPETITIONS; i++){while (i != ping){// busy spin}pong = i;}}}
}

印刷汇编代码

使用以下语句可以打印所有生成的汇编代码。

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly InterThreadLatency

但是，这会使您处于无法看到树木的森林的情况。通常，针对特定方法更有用。对于此测试，Hotspot将优化run（）方法并生成两次。一次用于OSR版本，然后一次用于标准JIT版本。标准的JIT版本如下。

java -XX:+UnlockDiagnosticVMOptions '-XX:CompileCommand=print,*PongRunner.run' InterThreadLatencyCompiled method (c2)   10531    5             InterThreadLatency$PongRunner::run (30 bytes)total in heap  [0x00007fed81060850,0x00007fed81060b30] = 736relocation     [0x00007fed81060970,0x00007fed81060980] = 16main code      [0x00007fed81060980,0x00007fed81060a00] = 128stub code      [0x00007fed81060a00,0x00007fed81060a18] = 24oops           [0x00007fed81060a18,0x00007fed81060a30] = 24scopes data    [0x00007fed81060a30,0x00007fed81060a78] = 72scopes pcs     [0x00007fed81060a78,0x00007fed81060b28] = 176dependencies   [0x00007fed81060b28,0x00007fed81060b30] = 8
Decoding compiled method 0x00007fed81060850:
Code:
[Entry Point]
[Constants]# {method} 'run' '()V' in 'InterThreadLatency$PongRunner'#           [sp+0x20]  (sp of caller)0x00007fed81060980: mov    0x8(%rsi),%r10d0x00007fed81060984: shl    $0x3,%r100x00007fed81060988: cmp    %r10,%rax0x00007fed8106098b: jne    0x00007fed81037a60  ;   {runtime_call}0x00007fed81060991: xchg   %ax,%ax0x00007fed81060994: nopl   0x0(%rax,%rax,1)0x00007fed8106099c: xchg   %ax,%ax
[Verified Entry Point]0x00007fed810609a0: sub    $0x18,%rsp0x00007fed810609a7: mov    %rbp,0x10(%rsp)    ;*synchronization entry; - InterThreadLatency$PongRunner::run@-1 (line 58)0x00007fed810609ac: xor    %r11d,%r11d0x00007fed810609af: mov    $0x7ad0fcbf0,%r10  ;   {oop(a 'java/lang/Class' = 'InterThreadLatency')}0x00007fed810609b9: jmp    0x00007fed810609d00x00007fed810609bb: nopl   0x0(%rax,%rax,1)   ; OopMap{r10=Oop off=64};*goto; - InterThreadLatency$PongRunner::run@15 (line 60)0x00007fed810609c0: test   %eax,0xaa1663a(%rip)        # 0x00007fed8ba77000;*goto; - InterThreadLatency$PongRunner::run@15 (line 60);   {poll}0x00007fed810609c6: nopw   0x0(%rax,%rax,1)   ;*iload_1; - InterThreadLatency$PongRunner::run@8 (line 60)0x00007fed810609d0: mov    0x74(%r10),%r9d    ;*getstatic ping; - InterThreadLatency::access$000@0 (line 3); - InterThreadLatency$PongRunner::run@9 (line 60)0x00007fed810609d4: cmp    %r9d,%r11d0x00007fed810609d7: jne    0x00007fed810609c00x00007fed810609d9: mov    %r11d,0x78(%r10)0x00007fed810609dd: lock addl $0x0,(%rsp)     ;*putstatic pong; - InterThreadLatency::access$102@2 (line 3); - InterThreadLatency$PongRunner::run@19 (line 65)0x00007fed810609e2: inc    %r11d              ;*iinc; - InterThreadLatency$PongRunner::run@23 (line 58)0x00007fed810609e5: cmp    $0x5f5e100,%r11d0x00007fed810609ec: jl     0x00007fed810609d0  ;*if_icmpeq; - InterThreadLatency$PongRunner::run@12 (line 60)0x00007fed810609ee: add    $0x10,%rsp0x00007fed810609f2: pop    %rbp0x00007fed810609f3: test   %eax,0xaa16607(%rip)        # 0x00007fed8ba77000;   {poll_return}0x00007fed810609f9: retq                      ;*iload_1; - InterThreadLatency$PongRunner::run@8 (line 60)0x00007fed810609fa: hlt    0x00007fed810609fb: hlt    0x00007fed810609fc: hlt    0x00007fed810609fd: hlt    0x00007fed810609fe: hlt    0x00007fed810609ff: hlt    
[Exception Handler]
[Stub Code]0x00007fed81060a00: jmpq   0x00007fed8105eaa0  ;   {no_reloc}
[Deopt Handler Code]0x00007fed81060a05: callq  0x00007fed81060a0a0x00007fed81060a0a: subq   $0x5,(%rsp)0x00007fed81060a0f: jmpq   0x00007fed81038c00  ;   {runtime_call}0x00007fed81060a14: hlt    0x00007fed81060a15: hlt    0x00007fed81060a16: hlt    0x00007fed81060a17: hlt    
OopMapSet contains 1 OopMaps#0 
OopMap{r10=Oop off=64}

有趣的观察

上面的汇编代码用红色突出显示的行非常有趣。当一个
写入volatile字段后，在Java内存模型下，写入必须顺序一致，即，由于通常应用的优化（例如将写入暂存到存储缓冲区）而不会重新排序。这可以通过插入适当的内存屏障来实现。在上述情况下，Hotspot选择通过执行MOV指令（寄存器到内存地址-即写操作），然后发出具有排序语义的LOCK ADD指令（不使用栈指针作为围篱惯用语）来强制执行排序。这在x86处理器上并不理想。使用单个LOCK XCHG指令进行写入，可以更有效，更正确地执行相同的操作。这使我想知道JVM是否存在一些重大的折衷，以使其可以跨许多体系结构移植，而不是在x86上可以做到的最好。

参考：在Mechanical Sympathy博客上，从我们的JCG合作伙伴 Martin Thompson 打印从Hotspot JIT编译器生成的汇编代码。