From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4FAD0C433F5
	for <linux-mm@archiver.kernel.org>; Wed, 15 Dec 2021 23:26:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8B16B6B0071; Wed, 15 Dec 2021 18:26:42 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 860D96B0073; Wed, 15 Dec 2021 18:26:42 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 728376B0074; Wed, 15 Dec 2021 18:26:42 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0046.hostedemail.com [216.40.44.46])
	by kanga.kvack.org (Postfix) with ESMTP id 639996B0071
	for <linux-mm@kvack.org>; Wed, 15 Dec 2021 18:26:42 -0500 (EST)
Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 271EC180AB9F8
	for <linux-mm@kvack.org>; Wed, 15 Dec 2021 23:26:32 +0000 (UTC)
X-FDA: 78921615024.28.99C4EA3
Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50])
	by imf27.hostedemail.com (Postfix) with ESMTP id 7656A40006
	for <linux-mm@kvack.org>; Wed, 15 Dec 2021 23:26:31 +0000 (UTC)
Received: by mail-wr1-f50.google.com with SMTP id v11so40828907wrw.10
        for <linux-mm@kvack.org>; Wed, 15 Dec 2021 15:26:31 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=aWEO4+lOEaRD/eaLAldY1rujZBp501QLD0CtTf2OGKA=;
        b=Wwg0Zje4gwnyDqzh6zBIkVcOEeg1o9Yay/pTmAgzONLQN+nlRUYC1VRinffQ9rwEvZ
         FIy8AVONbOxsS6nJh30y7I9KC2LLiFw/QncF9XjD5vpRyzwCwr7iTUDc9hA1U/Nsph4g
         Jn9hR5KE/kgsIbkRKVgflH0LwhnERJFi0FTSXdkkXKYnfN7nwrV6lYD2vFrNiK8clZIw
         QvGpfmYKPsA44l6MZ8zQZg9O9aLtq4u3AVprrjiUw+rbmHHmnvGNaG5DgZ733L+mBl6q
         I2TKNE66D1maqk92/Ot9v6UO6fzNyG452yZFHwzXovtaKnMahpvNAB3YKpqCRqdCc42I
         ojzQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=aWEO4+lOEaRD/eaLAldY1rujZBp501QLD0CtTf2OGKA=;
        b=zBj3UAn4STOuFS0OVzqzv2cXbf9lbLYOhqZ/0khy6o3uDQ40OIYW/Vht4jBVG4H+P9
         Z8s8XwADT+5JP+gWEjtkfMQ2hyFEH9hlkIRjFeYeGaur/Ij1txIPkD5hrdD4eAmYB4aV
         HkCgN8VaFBbas4pL0/vMBcuPy4Nq6IYPkV2+JMV7sjgoD+MJNBxzniVkcx9E/f+ZHoxt
         pRUiGUqyf2vsgEgqqoAAcYbw0oF4czkGCXKBdFsarrYCfi8O2CiGynw+8jQp7psE14LC
         iQwtPBznVXtPW3pick7dtocJv8lG0aqFUhCaiZztW560fnwoGoFK2WaLjhfnmPPhnbdC
         gCSg==
X-Gm-Message-State: AOAM531O+/pW1SecA5gewZwA2tExr6McETe2LVmxL0c4QoTakg7m2NxA
	wvdPKE8mLXkIEJQF215eqxhxDNln6S2ZuMuqcHqIwg==
X-Google-Smtp-Source: ABdhPJwEeohnXfddqmmLfWY+PzER4jE5agUTbIOZPR/cQUXqFmsd9PLsQpxGAjQnucUxh/dXB006eBPsmx3o86OTQeo=
X-Received: by 2002:adf:ee0c:: with SMTP id y12mr6698308wrn.82.1639610790378;
 Wed, 15 Dec 2021 15:26:30 -0800 (PST)
MIME-Version: 1.0
References: <20211214204445.665580974@infradead.org> <CAFTs51XRJj1pwF6q5hwdGP0jtXmY81QQmTzyuA26fHMH0zCymw@mail.gmail.com>
 <Ybm+HJzkO/0BB4Va@hirez.programming.kicks-ass.net> <CAFTs51Xb6m=htpWsVk577n-h_pRCpqRcBg6-OhBav8OadikHkw@mail.gmail.com>
 <YboxjUM+D9Kg52mO@hirez.programming.kicks-ass.net> <CAPNVh5cJy2y+sTx0cPA1BPSAg=GjXC8XGT7fLzHwzvXH2=xjmw@mail.gmail.com>
 <20211215222524.GH16608@worktop.programming.kicks-ass.net>
In-Reply-To: <20211215222524.GH16608@worktop.programming.kicks-ass.net>
From: Peter Oskolkov <posk@google.com>
Date: Wed, 15 Dec 2021 15:26:19 -0800
Message-ID: <CAPNVh5cfoehYpOu2PE59L3_yxmZaXgJ6oC1eg923rmaiK4f87A@mail.gmail.com>
Subject: Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups
To: Peter Zijlstra <peterz@infradead.org>
Cc: Peter Oskolkov <posk@posk.io>, Ingo Molnar <mingo@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, 
	juri.lelli@redhat.com, Vincent Guittot <vincent.guittot@linaro.org>, 
	dietmar.eggemann@arm.com, Steven Rostedt <rostedt@goodmis.org>, 
	Ben Segall <bsegall@google.com>, mgorman@suse.de, bristot@redhat.com, 
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, 
	Linux Memory Management List <linux-mm@kvack.org>, linux-api@vger.kernel.org, x86@kernel.org, 
	Paul Turner <pjt@google.com>, Andrei Vagin <avagin@google.com>, Jann Horn <jannh@google.com>, 
	Thierry Delisle <tdelisle@uwaterloo.ca>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 7656A40006
X-Stat-Signature: bo99iuaks5cnckh1mdaz1cp3izjo7jtj
Authentication-Results: imf27.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Wwg0Zje4;
	spf=pass (imf27.hostedemail.com: domain of posk@google.com designates 209.85.221.50 as permitted sender) smtp.mailfrom=posk@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1639610791-118205
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Dec 15, 2021 at 2:25 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Dec 15, 2021 at 11:49:51AM -0800, Peter Oskolkov wrote:
>
> > TL;DR: our models are different here. In your model a single server
> > can have a bunch of workers interacting with it; in my model only a
> > single RUNNING worker is assigned to a server, which it wakes when it
> > blocks.
>
> So part of the problem is that none of that was evident from the code.
> It is also completely different from the scheduler code it lives in,
> making it double confusing.
>
> After having read the code, I still had no clue what so ever how it was
> supposed to be used. Which is where my reverse engineering started :/

I posted a doc patch:
https://lore.kernel.org/lkml/20211122211327.5931-6-posk@google.com/
a lib patch with userspace code:
https://lore.kernel.org/lkml/20211122211327.5931-5-posk@google.com/
and a doc patch for the lib/userspace code:
https://lore.kernel.org/lkml/20211122211327.5931-7-posk@google.com/

I spent at least two weeks polishing the lib patch and the docs, much
more if previous patchsets are to be taken into account. Yes, they are
confusing, and most likely answer all of the wrong questions, but I
did try to make my approach as clear as possible... I apologize if
that was not very successful...

>
> > More details:
> >
> > "Working servers" cannot get wakeups, because a "working server" has a
> > single RUNNING worker attached to it. When a worker blocks, it wakes
> > its attached server and becomes a detached blocked worker (same is
> > true if the worker is "preempted": it blocks and wakes its assigned
> > server).
>
> But who would do the preemption if the server isn't allowed to run?
>
> > Blocked workers upon wakeup do this, in order:
> >
> > - always add themselves to the runnable worker list (the list is
> > shared among ALL servers, it is NOT per server);
>
> That seems like a scalability issue. And, as said, it is completely
> alien when compared to the way Linux itself does scheduling.
>
> > - wake a server pointed to by idle_server_ptr, if not NULL;
> > - sleep, waiting for a wakeup from a server;
> >
> > Server S, upon becoming IDLE (no worker to run, or woken on idle
> > server list) does this, in order, in userspace (simplified, see
> > umcg_get_idle_worker() in
> > https://lore.kernel.org/lkml/20211122211327.5931-5-posk@google.com/):
> > - take a userspace (spin) lock (so the steps below are all within a
> > single critical section):
>
> Don't ever suggest userspace spinlocks, they're horrible crap.

This can easily be a mutex, not really important (although for very
short critical sections with only memory reads/writes, like here, spin
locks often perform better, in our experience).

>
> > - compare_xchg(idle_server_ptr, NULL, S);
> >   - if failed, there is another server in idle_server_ptr, so S adds
> > itself to the userspace idle server list, releases the lock, goes to
> > sleep;
> >   - if succeeded:
> >     - check the runnable worker list;
> >         - if empty, release the lock, sleep;
> >         - if not empty:
> >            - get the list
> >            - xchg(idle_server_ptr, NULL) (either S removes itself, or
> > a worker in the kernel does it first, does not matter);
> >            - release the lock;
> >            - wake server S1 on idle server list. S1 goes through all
> > of these steps.
> >
> > The protocol above serializes the userspace dealing with the idle
> > server ptr/list. Wakeups in the kernel will be caught if there are
> > idle servers. Yes, the protocol in the userspace is complicated (more
> > complicated than outlined above, as the reaped idle/runnable worker
> > list from the kernel is added to the userspace idle/runnable worker
> > list), but the kernel side is very simple. I've tested this
> > interaction extensively, I'm reasonably sure that no worker wakeups
> > are lost.
>
> Sure, but also seems somewhat congestion prone :/

The whole critical section under the loc is just several memory
read/write operations, so very short. And workers are removed from the
kernel's list of runnable/woken workers all at once; and the server
processing the runnable worker list knows how many of them are now
available to run, so the appropriate number of idle servers can be
woken (not yet implemented in my lib patch). So yes, this can be a
bottleneck, but there are ways to make it less and less likely (by
making the userspace more complicated; but this is not a concern).