My Notes: Performance tunning

1. 性能优化的一般步骤。

   1.1 首先，在系统一级进行。在网络IO性能，磁盘性能，内存使用率等地方找出瓶颈，可用的分析工具有任务管理器，windows自带的perfmon.exe等。一般情况下这是花精力少，见效明显的方法。

   1.2 其次，从应用程序的角度进行，关注锁等共享资源的竞争，调整程序结构，优化运算的并行度。这个步骤能取得的效果与应用程序的特性有关。

   1.3 最后，从CPU硬件结构的角度进行，关注Cache的使用，总线使用效率等。对于Intel的CPU，可用工具有"Intel VTune Performance Analyzer"。这个步骤稍微复杂一些，但对于某些特定的应用，能轻易地提高一倍以上的运算效率，原因有二：1) CPU访问Cache的效率比访问内存快上几个数量级；2)在多核情况下，内存总线经常成为计算机系统的性能瓶颈。

2. Intel VTune Performance Analyzer工具的使用。vTune提供两种采样模式
   time-based sampling:
   event-based sampling: 使用Processor内部的计数器来统计软件的工作状态，不同的CPU构架有不同的Event，所以使用前必须先了解CPU类型

   常用的评估参数

   2.1 Cycles per Retired Instruction (CPI) = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY
      统计每条指令所花的时钟周期数，越小越好。由于多流水线结构，CPU能在一个时钟周期完成多条指令，即CPI远小于1

   2.2 Excution Stalled rate = (UOPS_EXECUTED.CORE_STALL_CYCLES / (UOPS_EXECUTED.CORE_STALL_CYCLES + UOPS_EXECUTED.CORE_ACTIVE_CYCLES) ) * 100

   2.3 L2 Cache Miss Impact
   2.4 Branch Misprediction Ratio
   2.5 Bus Utilization Ratio

3 Core i7, Xeon 5500系列CPU常用的event

   CPU_CLK_UNHALTED.THREAD: counter measures unhalted clockticks on a per thread basis.So for each tick of the CPU's clock, the counter will count 2 ticks if Hyper-Threading is enabled, 1 tick if Hyper-Threading is disabled. There is no per-core clocktick counter.

   CPU_CLK_UNHALTED.REF: counts unhalted clockticks per thread, at the reference frequency for the CPU. In other words, the CPU_CLK_UNHALTED.REF counter should not increase or decrease as a result of frequency changes due to Turbo Mode or Speedstep Technology

    UOPS_EXECUTED.CORE_STALL_CYCLES: counter measures when the EXECUTION stage of the pipeline is stalled. This counter counts per CORE, not per thread.

    UOPS_EXECUTED.CORE_ACTIVE_CYCLES:


    INST_RETIRED.ANY

My Notes

Friday, August 6, 2010

Performance tunning

No comments:

About Me

Blog Archive

Labels