linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hillf Danton <hdanton@sina.com>
To: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	mingo@kernel.org, vincent.guittot@linaro.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mgorman@suse.de, bristot@redhat.com, corbet@lwn.net,
	kprateek.nayak@amd.com, youssefesmat@chromium.org,
	joel@joelfernandes.org, efault@gmx.de
Subject: Re: [PATCH 00/17] sched: EEVDF using latency-nice
Date: Mon, 10 Apr 2023 16:23:07 +0800	[thread overview]
Message-ID: <20230410082307.1327-1-hdanton@sina.com> (raw)
In-Reply-To: <20230410031350.GA49280@maniforge>

On 9 Apr 2023 22:13:50 -0500 David Vernet <void@manifault.com>
> 
> Hi Peter,
> 
> I used the EEVDF scheduler to run workloads on one of Meta's largest
> services (our main HHVM web server), and I wanted to share my
> observations with you.

Thanks for your testing.
> 
> 3. Low latency + long slice are not mutually exclusive for us
> 
> An interesting quality of web workloads running JIT engines is that they
> require both low-latency, and long slices on the CPU. The reason we need
> the tasks to be low latency is they're on the critical path for
> servicing web requests (for most of their runtime, at least), and the
> reasons we need them to have long slices are enumerated above -- they
> thrash the icache / DSB / iTLB, more aggressive context switching causes
> us to thrash on paging from disk, and in general, these tasks are on the
> critical path for servicing web requests and we want to encourage them
> to run to completion.
> 
> This causes EEVDF to perform poorly for workloads with these
> characteristics. If we decrease latency nice for our web workers then

Take a look at the diff below.

> they'll have lower latency, but only because their slices are smaller.
> This in turn causes the increase in context switches, which causes the
> thrashing described above.
> 
> Worth noting -- I did try and increase the default base slice length by
> setting sysctl_sched_base_slice to 35ms, and these were the results:
> 
> With EEVDF slice 35ms and latency_nice 0
> ----------------------------------------
> - .5 - 2.25% drop in throughput
> - 2.5 - 4.5% increase in p95 latencies
> - 2.5 - 5.25% increase in p99 latencies
> - Context switch per minute increase: 9.5 - 12.4%
> - Involuntary context switch increase: ~320 - 330%
> - Major fault delta: -3.6% to 37.6%
> - IPC decrease .5 - .9%
> 
> With EEVDF slice 35ms and latency_nice -8 for web workers
> ---------------------------------------------------------
> - .5 - 2.5% drop in throughput
> - 1.7 - 4.75% increase in p95 latencies
> - 2.5 - 5% increase in p99 latencies
> - Context switch per minute increase: 10.5 - 15%
> - Involuntary context switch increase: ~327 - 350%
> - Major fault delta: -1% to 45%
> - IPC decrease .4 - 1.1%
> 
> I was expecting the increase in context switches and involuntary context
> switches to be lower what than they ended up being with the increased
> default slice length. Regardless, it still seems to tell a relatively
> consistent story with the numbers from above. The improvement in IPC is
> expected, though also less improved than I was anticipating (presumably
> due to the still-high context switch rate). There were also fewer major
> faults per minute compared to runs with a shorter default slice.
> 
> Note that even if increasing the slice length did cause fewer context
> switches and major faults, I still expect that it would hurt throughput
> and latency for HHVM given that when latency-nicer tasks are eventually
> given the CPU, the web workers will have to wait around for longer than
> we'd like for those tasks to burn through their longer slices.
> 
> In summary, I must admit that this patch set makes me a bit nervous.
> Speaking for Meta at least, the patch set in its current form exceeds
> the performance regressions (generally < .5% at the very most) that
> we're able to tolerate in production. More broadly, it will certainly
> cause us to have to carefully consider how it affects our model for
> server capacity.
> 
> Thanks,
> David
> 

In order to only narrow down the poor performance reported, make a tradeoff
between runtime and latency simply by restoring sysctl_sched_min_granularity
at tick preempt, given the known order on the runqueue.

--- x/kernel/sched/fair.c
+++ y/kernel/sched/fair.c
@@ -5172,6 +5172,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 static void
 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
+	unsigned int sysctl_sched_latency = 1000000ULL;
+	unsigned long delta_exec;
+
+	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+	if (delta_exec < sysctl_sched_latency)
+		return;
 	if (pick_eevdf(cfs_rq) != curr) {
 		resched_curr(rq_of(cfs_rq));
 		/*


  parent reply	other threads:[~2023-04-10  8:23 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20230328092622.062917921@infradead.org>
     [not found] ` <20230328110354.641979416@infradead.org>
2023-03-30  7:53   ` [PATCH 15/17] [RFC] sched/eevdf: Sleeper bonus Hillf Danton
     [not found] ` <20230328110354.141543852@infradead.org>
2023-03-30 11:02   ` [PATCH 08/17] sched/fair: Implement an EEVDF like policy Hillf Danton
     [not found] ` <20230328110354.562078801@infradead.org>
     [not found]   ` <CAKfTPtAkFBw5zt0+WK7dWBUE9OrbOOExG8ueUE6ogdCEQZhpXQ@mail.gmail.com>
2023-04-01 23:23     ` [PATCH 14/17] sched/eevdf: Better handle mixed slice length Hillf Danton
2023-04-02  2:40       ` Mike Galbraith
2023-04-02  6:28         ` Hillf Danton
     [not found] ` <20230410031350.GA49280@maniforge>
2023-04-10  8:23   ` Hillf Danton [this message]
2023-04-11 10:15     ` [PATCH 00/17] sched: EEVDF using latency-nice Mike Galbraith
2023-04-11 13:33       ` Hillf Danton
2023-04-11 14:56         ` Mike Galbraith
     [not found]         ` <20230412025042.1413-1-hdanton@sina.com>
2023-04-12  4:05           ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230410082307.1327-1-hdanton@sina.com \
    --to=hdanton@sina.com \
    --cc=bristot@redhat.com \
    --cc=corbet@lwn.net \
    --cc=efault@gmx.de \
    --cc=joel@joelfernandes.org \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=youssefesmat@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox