[compiler] improve inlining heuristics: call frequency per executed bytecodes

TLDR: Inline less, but more where it matters. ~10% decrease in Turbofan compile time including off-thread, while improving Octane scores by ~2%. How things used to work: There is a flag FLAG_min_inlining_frequency that limits inlining by the callsite being sufficiently frequently executed. This call frequency was measured relative to invocations of the parent (= the function we originally optimize). At the same time, the limit was very low (0.15), meaning we mostly relied on the total amount of inlined code (FLAG_max_inlined_bytecode_size_cumulative) to limit inlining. How things work now: Instead of measuring call frequency relative to parent invocations, we should have a measure that predicts how often the callsite in question will be executed in the future. An obvious attempt at that would be to measure how often the callsite was executed in absolute numbers in the past. But depending on how fast feedback stabilizes, it can take more or less time until we optimize a function. If we just take the absolute call frequency up to the point in time when we optimize, we would inline more for functions that stabilize slowly, which doesn't make sense. So instead, we measure absolute call count per KB of executed bytecodes of the parent function. Since inlining big functions is more expensive, this threshold is additionally scaled linearly with the bytecode-size of the inlinee. The resulting formula is: call_frequency > FLAG_min_inlining_frequency * (bytecode.length() - FLAG_max_inlined_bytecode_size_small) / (FLAG_max_inlined_bytecode_size - FLAG_max_inlined_bytecode_size_small) The new threshold is chosen in a way that it effectively limits inlining, which allows us to increase FLAG_max_inlined_bytecode_size_cumulative without increasing inlining in general. The reduction in compile time (x64 build) of ~10% was observed in Octane, ARES-6, web-tooling-benchmark, and the standalone TypeScript benchmark. The hope is that this will reduce CPU-time in real-world situations too. The Octane improvements come from inlining more in places where it matters. Bug: v8:6682 Change-Id: I99baa17dec85b71616a3ab3414d7e055beca39a0 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1768366 Commit-Queue: Tobias Tebbi <tebbi@chromium.org> Reviewed-by: Jakob Gruber <jgruber@chromium.org> Reviewed-by: Ross McIlroy <rmcilroy@chromium.org> Reviewed-by: Georg Neis <neis@chromium.org> Reviewed-by: Maya Lekova <mslekova@chromium.org> Cr-Commit-Position: refs/heads/master@{#63449}

[compiler] improve inlining heuristics: call frequency per executed bytecodes
TLDR: Inline less, but more where it matters. ~10% decrease in Turbofan compile time including off-thread, while improving Octane scores by ~2%. How things used to work: There is a flag FLAG_min_inlining_frequency that limits inlining by the callsite being sufficiently frequently executed. This call frequency was measured relative to invocations of the parent (= the function we originally optimize). At the same time, the limit was very low (0.15), meaning we mostly relied on the total amount of inlined code (FLAG_max_inlined_bytecode_size_cumulative) to limit inlining. How things work now: Instead of measuring call frequency relative to parent invocations, we should have a measure that predicts how often the callsite in question will be executed in the future. An obvious attempt at that would be to measure how often the callsite was executed in absolute numbers in the past. But depending on how fast feedback stabilizes, it can take more or less time until we optimize a function. If we just take the absolute call frequency up to the point in time when we optimize, we would inline more for functions that stabilize slowly, which doesn't make sense. So instead, we measure absolute call count per KB of executed bytecodes of the parent function. Since inlining big functions is more expensive, this threshold is additionally scaled linearly with the bytecode-size of the inlinee. The resulting formula is: call_frequency > FLAG_min_inlining_frequency * (bytecode.length() - FLAG_max_inlined_bytecode_size_small) / (FLAG_max_inlined_bytecode_size - FLAG_max_inlined_bytecode_size_small) The new threshold is chosen in a way that it effectively limits inlining, which allows us to increase FLAG_max_inlined_bytecode_size_cumulative without increasing inlining in general. The reduction in compile time (x64 build) of ~10% was observed in Octane, ARES-6, web-tooling-benchmark, and the standalone TypeScript benchmark. The hope is that this will reduce CPU-time in real-world situations too. The Octane improvements come from inlining more in places where it matters. Bug: v8:6682 Change-Id: I99baa17dec85b71616a3ab3414d7e055beca39a0 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1768366 Commit-Queue: Tobias Tebbi <tebbi@chromium.org> Reviewed-by: Jakob Gruber <jgruber@chromium.org> Reviewed-by: Ross McIlroy <rmcilroy@chromium.org> Reviewed-by: Georg Neis <neis@chromium.org> Reviewed-by: Maya Lekova <mslekova@chromium.org> Cr-Commit-Position: refs/heads/master@{#63449}
352a154e · Tobias Tebbi · Commit Bot · 604ef7bb · 352a154e · 352a154e
Commit 352a154e authored Aug 29, 2019 by Tobias Tebbi Committed by Commit Bot Aug 29, 2019
14 changed files
--- a/src/builtins/base.tq
+++ b/src/builtins/base.tq
@@ -1387,11 +1387,8 @@ extern class FeedbackVector extends HeapObject {
  closure_feedback_cell_array: FixedArray;
  length: int32;
  invocation_count: int32;
-  profiler_ticks: int32;
-  // TODO(v8:9287) The padding is not necessary on platforms with 4 bytes
-  // tagged pointers, we should make it conditional; however, platform-specific
-  // interacts badly with GCMole, so we need to address that first.
-  padding: uint32;
+  profiler_ticks_since_last_feedback_change: int32;
+  total_profiler_ticks: int32;
 }

 extern class FeedbackCell extends Struct {

--- a/src/codegen/code-stub-assembler.cc
+++ b/src/codegen/code-stub-assembler.cc
@@ -10263,9 +10263,8 @@ void CodeStubAssembler::ReportFeedbackUpdate(
    SloppyTNode<FeedbackVector> feedback_vector, SloppyTNode<IntPtrT> slot_id,
    const char* reason) {
  // Reset profiler ticks.
-  StoreObjectFieldNoWriteBarrier(
-      feedback_vector, FeedbackVector::kProfilerTicksOffset, Int32Constant(0),
-      MachineRepresentation::kWord32);
+  StoreFeedbackVectorProfilerTicksSinceLastFeedbackChange(feedback_vector,
+                                                          Int32Constant(0));

 #ifdef V8_TRACE_FEEDBACK_UPDATES
  // Trace the update.

--- a/src/codegen/compiler.cc
+++ b/src/codegen/compiler.cc
@@ -830,7 +830,7 @@ MaybeHandle<Code> GetOptimizedCode(Handle<JSFunction> function,

  // Reset profiler ticks, function is no longer considered hot.
  DCHECK(shared->is_compiled());
-  function->feedback_vector().set_profiler_ticks(0);
+  function->feedback_vector().set_profiler_ticks_since_last_feedback_change(0);

  VMState<COMPILER> state(isolate);
  TimerEventScope<TimerEventOptimizeCode> optimize_code_timer(isolate);
@@ -2273,7 +2273,9 @@ bool Compiler::FinalizeOptimizedCompilationJob(OptimizedCompilationJob* job,
  Handle<SharedFunctionInfo> shared = compilation_info->shared_info();

  // Reset profiler ticks, function is no longer considered hot.
-  compilation_info->closure()->feedback_vector().set_profiler_ticks(0);
+  compilation_info->closure()
+      ->feedback_vector()
+      .set_profiler_ticks_since_last_feedback_change(0);

  DCHECK(!shared->HasBreakInfo());


--- a/src/compiler/js-inlining-heuristic.cc
+++ b/src/compiler/js-inlining-heuristic.cc
@@ -25,6 +25,16 @@ namespace {
 bool IsSmall(BytecodeArrayRef bytecode) {
  return bytecode.length() <= FLAG_max_inlined_bytecode_size_small;
 }
+double CallFrequencyLimit(BytecodeArrayRef bytecode) {
+  if (IsSmall(bytecode)) return 0;
+  int length = bytecode.length();
+  DCHECK_GT(length, FLAG_max_inlined_bytecode_size_small);
+  DCHECK_LE(length, FLAG_max_inlined_bytecode_size);
+  return FLAG_min_inlining_frequency *
+         (length - FLAG_max_inlined_bytecode_size_small) /
+         (FLAG_max_inlined_bytecode_size -
+          FLAG_max_inlined_bytecode_size_small);
+}
 }  // namespace

 JSInliningHeuristic::Candidate JSInliningHeuristic::CollectFunctions(
@@ -107,6 +117,15 @@ Reduction JSInliningHeuristic::Reduce(Node* node) {
    return NoChange();
  }

+  // Gather feedback on how often this call site has been hit before.
+  if (node->opcode() == IrOpcode::kJSCall) {
+    CallParameters const p = CallParametersOf(node->op());
+    candidate.frequency = p.frequency();
+  } else {
+    ConstructParameters const p = ConstructParametersOf(node->op());
+    candidate.frequency = p.frequency();
+  }
+
  bool can_inline_candidate = false, candidate_is_small = true;
  candidate.total_size = 0;
  Node* frame_state = NodeProperties::GetFrameStateInput(node);
@@ -135,7 +154,10 @@ Reduction JSInliningHeuristic::Reduce(Node* node) {
    SharedFunctionInfoRef shared = candidate.functions[i].has_value()
                                       ? candidate.functions[i].value().shared()
                                       : candidate.shared_info.value();
-    candidate.can_inline_function[i] = shared.IsInlineable();
+    if (!shared.IsInlineable()) {
+      candidate.can_inline_function[i] = false;
+      continue;
+    }
    // Do not allow direct recursion i.e. f() -> f(). We still allow indirect
    // recurion like f() -> g() -> f(). The indirect recursion is helpful in
    // cases where f() is a small dispatch function that calls the appropriate
@@ -150,27 +172,27 @@ Reduction JSInliningHeuristic::Reduce(Node* node) {
      TRACE("Not considering call site #%d:%s, because of recursive inlining\n",
            node->id(), node->op()->mnemonic());
      candidate.can_inline_function[i] = false;
+      continue;
    }
    // A function reaching this point should always have its bytecode
    // serialized.
    BytecodeArrayRef bytecode = candidate.bytecode[i].value();
-    if (candidate.can_inline_function[i]) {
+    // Don't consider a {candidate} whose call frequency is below the threshold.
+    // The frequency is the estimated call count per KB of executed bytecode of
+    // the function we're optimizing. The threshold is scaled linearly based on
+    // the size of the {candidate}.
+    if (candidate.frequency.IsKnown() &&
+        candidate.frequency.value() < CallFrequencyLimit(bytecode)) {
+      candidate.can_inline_function[i] = false;
+      continue;
+    }
+    candidate.can_inline_function[i] = true;
    can_inline_candidate = true;
    candidate.total_size += bytecode.length();
-    }
    candidate_is_small = candidate_is_small && IsSmall(bytecode);
  }
  if (!can_inline_candidate) return NoChange();

-  // Gather feedback on how often this call site has been hit before.
-  if (node->opcode() == IrOpcode::kJSCall) {
-    CallParameters const p = CallParametersOf(node->op());
-    candidate.frequency = p.frequency();
-  } else {
-    ConstructParameters const p = ConstructParametersOf(node->op());
-    candidate.frequency = p.frequency();
-  }
-
  // Handling of special inlining modes right away:
  //  - For restricted inlining: stop all handling at this point.
  //  - For stressing inlining: immediately handle all functions.
@@ -183,14 +205,6 @@ Reduction JSInliningHeuristic::Reduce(Node* node) {
      break;
  }

-  // Don't consider a {candidate} whose frequency is below the
-  // threshold, i.e. a call site that is only hit once every N
-  // invocations of the caller.
-  if (candidate.frequency.IsKnown() &&
-      candidate.frequency.value() < FLAG_min_inlining_frequency) {
-    return NoChange();
-  }
-
  // Forcibly inline small functions here. In the case of polymorphic inlining
  // candidate_is_small is set only when all functions are small.
  if (candidate_is_small) {

--- a/src/compiler/js-operator.h
+++ b/src/compiler/js-operator.h
@@ -29,6 +29,8 @@ struct JSOperatorGlobalCache;

 // Defines the frequency a given Call/Construct site was executed. For some
 // call sites the frequency is not known.
+// Call frequency is measured as invocations per KB of executed bytecode of the
+// function we're optimizing, based on runtime profiler ticks.
 class CallFrequency final {
 public:
  CallFrequency() : value_(std::numeric_limits<float>::quiet_NaN()) {}

--- a/src/compiler/pipeline.cc
+++ b/src/compiler/pipeline.cc
@@ -1188,7 +1188,17 @@ struct GraphBuilderPhase {
    if (data->info()->is_bailout_on_uninitialized()) {
      flags |= BytecodeGraphBuilderFlag::kBailoutOnUninitialized;
    }
-    CallFrequency frequency(1.0f);
+    double invocation_count =
+        data->info()->closure()->feedback_vector().invocation_count();
+    double total_ticks =
+        data->info()->closure()->feedback_vector().total_profiler_ticks();
+    if (total_ticks == 0) {
+      // This can only happen in tests when forcing optimization.
+      // Pick a small number so that inlining still happens.
+      total_ticks = 1.0 / FLAG_interrupt_budget;
+    }
+    double executed_bytecode_bytes = total_ticks * FLAG_interrupt_budget;
+    CallFrequency frequency(invocation_count / (executed_bytecode_bytes / KB));
    BuildGraphFromBytecode(
        data->broker(), temp_zone, data->info()->bytecode_array(),
        data->info()->shared_info(),

--- a/src/diagnostics/objects-printer.cc
+++ b/src/diagnostics/objects-printer.cc
@@ -1127,7 +1127,9 @@ void FeedbackVector::FeedbackVectorPrint(std::ostream& os) {  // NOLINT
    os << optimization_marker();
  }
  os << "\n - invocation count: " << invocation_count();
-  os << "\n - profiler ticks: " << profiler_ticks();
+  os << "\n - profiler ticks since last feedback change: "
+     << profiler_ticks_since_last_feedback_change();
+  os << "\n - total profiler ticks: " << total_profiler_ticks();

  FeedbackMetadataIterator iter(metadata());
  while (iter.HasNext()) {

--- a/src/execution/runtime-profiler.cc
+++ b/src/execution/runtime-profiler.cc
@@ -5,6 +5,7 @@
 #include "src/execution/runtime-profiler.h"

 #include "src/base/platform/platform.h"
+#include "src/base/safe_conversions.h"
 #include "src/codegen/assembler.h"
 #include "src/codegen/compilation-cache.h"
 #include "src/codegen/compiler.h"
@@ -150,7 +151,8 @@ void RuntimeProfiler::MaybeOptimize(JSFunction function,
 }

 bool RuntimeProfiler::MaybeOSR(JSFunction function, InterpretedFrame* frame) {
-  int ticks = function.feedback_vector().profiler_ticks();
+  int ticks =
+      function.feedback_vector().profiler_ticks_since_last_feedback_change();
  // TODO(rmcilroy): Also ensure we only OSR top-level code if it is smaller
  // than kMaxToplevelSourceSize.

@@ -172,7 +174,8 @@ bool RuntimeProfiler::MaybeOSR(JSFunction function, InterpretedFrame* frame) {

 OptimizationReason RuntimeProfiler::ShouldOptimize(JSFunction function,
                                                   BytecodeArray bytecode) {
-  int ticks = function.feedback_vector().profiler_ticks();
+  int ticks =
+      function.feedback_vector().profiler_ticks_since_last_feedback_change();
  int ticks_for_optimization =
      kProfilerTicksBeforeOptimization +
      (bytecode.length() / kBytecodeSizeAllowancePerTick);
@@ -227,10 +230,13 @@ void RuntimeProfiler::MarkCandidatesForOptimization() {

    // TODO(leszeks): Move this increment to before the maybe optimize checks,
    // and update the tests to assume the increment has already happened.
-    int ticks = function.feedback_vector().profiler_ticks();
-    if (ticks < Smi::kMaxValue) {
-      function.feedback_vector().set_profiler_ticks(ticks + 1);
-    }
+    int64_t stable_ticks =
+        function.feedback_vector().profiler_ticks_since_last_feedback_change();
+    function.feedback_vector().set_profiler_ticks_since_last_feedback_change(
+        base::saturated_cast<int32_t>(stable_ticks + 1));
+    int64_t total_ticks = function.feedback_vector().total_profiler_ticks();
+    function.feedback_vector().set_total_profiler_ticks(
+        base::saturated_cast<int32_t>(total_ticks + 1));
  }
  any_ic_changed_ = false;
 }

--- a/src/flags/flag-definitions.h
+++ b/src/flags/flag-definitions.h
@@ -552,10 +552,12 @@ DEFINE_BOOL(function_context_specialization, false,
 DEFINE_BOOL(turbo_inlining, true, "enable inlining in TurboFan")
 DEFINE_INT(max_inlined_bytecode_size, 500,
           "maximum size of bytecode for a single inlining")
-DEFINE_INT(max_inlined_bytecode_size_cumulative, 1000,
-           "maximum cumulative size of bytecode considered for inlining")
+DEFINE_INT(max_inlined_bytecode_size_cumulative, 2500,
+           "the soft limit for maximum cumulative size of bytecode considered "
+           "for inlining (can be exceeded by small functions)")
 DEFINE_INT(max_inlined_bytecode_size_absolute, 5000,
-           "maximum cumulative size of bytecode considered for inlining")
+           "the hard limit for maximum cumulative size of bytecode considered "
+           "for inlining")
 DEFINE_FLOAT(reserve_inline_budget_scale_factor, 1.2,
             "maximum cumulative size of bytecode considered for inlining")
 DEFINE_INT(max_inlined_bytecode_size_small, 30,
@@ -564,7 +566,9 @@ DEFINE_INT(max_optimized_bytecode_size, 60 * KB,
           "maximum bytecode size to "
           "be considered for optimization; too high values may cause "
           "the compiler to hit (release) assertions")
-DEFINE_FLOAT(min_inlining_frequency, 0.15, "minimum frequency for inlining")
+DEFINE_FLOAT(min_inlining_frequency, 21.7,
+             "minimum call frequency for inlining, measured in invocations per "
+             "KB of executed bytecode, scaled down with inlinee size")
 DEFINE_BOOL(polymorphic_inlining, true, "polymorphic inlining")
 DEFINE_BOOL(stress_inline, false,
            "set high thresholds for inlining to inline as much as possible")

--- a/src/heap/factory.cc
+++ b/src/heap/factory.cc
@@ -525,8 +525,8 @@ Handle<FeedbackVector> Factory::NewFeedbackVector(
                               : OptimizationMarker::kNone)));
  vector->set_length(length);
  vector->set_invocation_count(0);
-  vector->set_profiler_ticks(0);
-  vector->clear_padding();
+  vector->set_profiler_ticks_since_last_feedback_change(0);
+  vector->set_total_profiler_ticks(0);
  vector->set_closure_feedback_cell_array(*closure_feedback_cell_array);

  // TODO(leszeks): Initialize based on the feedback metadata.

--- a/src/ic/ic.cc
+++ b/src/ic/ic.cc
@@ -284,15 +284,15 @@ void IC::OnFeedbackChanged(const char* reason) {
 void IC::OnFeedbackChanged(Isolate* isolate, FeedbackVector vector,
                           FeedbackSlot slot, const char* reason) {
  if (FLAG_trace_opt_verbose) {
-    if (vector.profiler_ticks() != 0) {
+    if (vector.profiler_ticks_since_last_feedback_change() != 0) {
      StdoutStream os;
      os << "[resetting ticks for ";
      vector.shared_function_info().ShortPrint(os);
-      os << " from " << vector.profiler_ticks()
+      os << " from " << vector.profiler_ticks_since_last_feedback_change()
         << " due to IC change: " << reason << "]" << std::endl;
    }
  }
-  vector.set_profiler_ticks(0);
+  vector.set_profiler_ticks_since_last_feedback_change(0);

 #ifdef V8_TRACE_FEEDBACK_UPDATES
  if (FLAG_trace_feedback_updates) {

--- a/src/objects/feedback-vector-inl.h
+++ b/src/objects/feedback-vector-inl.h
@@ -107,14 +107,9 @@ ACCESSORS(FeedbackVector, closure_feedback_cell_array, ClosureFeedbackCellArray,
          kClosureFeedbackCellArrayOffset)
 INT32_ACCESSORS(FeedbackVector, length, kLengthOffset)
 INT32_ACCESSORS(FeedbackVector, invocation_count, kInvocationCountOffset)
-INT32_ACCESSORS(FeedbackVector, profiler_ticks, kProfilerTicksOffset)
-
-void FeedbackVector::clear_padding() {
-  if (FIELD_SIZE(kPaddingOffset) == 0) return;
-  DCHECK_EQ(4, FIELD_SIZE(kPaddingOffset));
-  memset(reinterpret_cast<void*>(address() + kPaddingOffset), 0,
-         FIELD_SIZE(kPaddingOffset));
-}
+INT32_ACCESSORS(FeedbackVector, profiler_ticks_since_last_feedback_change,
+                kProfilerTicksSinceLastFeedbackChangeOffset)
+INT32_ACCESSORS(FeedbackVector, total_profiler_ticks, kTotalProfilerTicksOffset)

 bool FeedbackVector::is_empty() const { return length() == 0; }


--- a/src/objects/feedback-vector.cc
+++ b/src/objects/feedback-vector.cc
@@ -245,7 +245,8 @@ Handle<FeedbackVector> FeedbackVector::New(
          FLAG_log_function_events ? OptimizationMarker::kLogFirstExecution
                                   : OptimizationMarker::kNone)));
  DCHECK_EQ(vector->invocation_count(), 0);
-  DCHECK_EQ(vector->profiler_ticks(), 0);
+  DCHECK_EQ(vector->profiler_ticks_since_last_feedback_change(), 0);
+  DCHECK_EQ(vector->total_profiler_ticks(), 0);

  // Ensure we can skip the write barrier
  Handle<Object> uninitialized_sentinel = UninitializedSentinel(isolate);

--- a/src/objects/feedback-vector.h
+++ b/src/objects/feedback-vector.h
@@ -204,12 +204,14 @@ class FeedbackVector : public HeapObject {
  // [invocation_count]: The number of times this function has been invoked.
  DECL_INT32_ACCESSORS(invocation_count)

-  // [profiler_ticks]: The number of times this function has been seen by the
-  // runtime profiler.
-  DECL_INT32_ACCESSORS(profiler_ticks)
-
-  // Initialize the padding if necessary.
-  inline void clear_padding();
+  // [profiler_ticks_since_last_feedback_change]: The number of times this
+  // function has been seen by the runtime profiler since the last optimization
+  // or feedback change.
+  DECL_INT32_ACCESSORS(profiler_ticks_since_last_feedback_change)
+
+  // [total_profiler_ticks]: Total profiler ticks, not reset on feedback changes
+  // or optimizations.
+  DECL_INT32_ACCESSORS(total_profiler_ticks)

  inline void clear_invocation_count();