From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DB437C433F5
	for <linux-mm@archiver.kernel.org>; Mon,  6 Dec 2021 11:32:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2683F6B007B; Mon,  6 Dec 2021 06:32:39 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 218146B007D; Mon,  6 Dec 2021 06:32:39 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0E0136B007E; Mon,  6 Dec 2021 06:32:39 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0056.hostedemail.com [216.40.44.56])
	by kanga.kvack.org (Postfix) with ESMTP id F25526B007B
	for <linux-mm@kvack.org>; Mon,  6 Dec 2021 06:32:38 -0500 (EST)
Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id B8721884B9
	for <linux-mm@kvack.org>; Mon,  6 Dec 2021 11:32:28 +0000 (UTC)
X-FDA: 78887156376.22.A5D41E0
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf14.hostedemail.com (Postfix) with ESMTP id 0B7ED6001992
	for <linux-mm@kvack.org>; Mon,  6 Dec 2021 11:32:27 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=bbPon1x1Cb+zVlWyG5Z9rwl6b0rpya4JWPKDzhX3GGU=; b=slQSxZZ0VZCUUEAGC0vtDR2sDH
	U+NTHEXwuB+ok1uaEzNUl3w52lNSWZCB1uQhfYnK/RuUHzvOiqDxEW1TnGsVEk/88b5qXl3yANpEs
	BWIK+ZnfRY/xehsgUWEpuSYF2qtj7vuI0ld/9IXEGANRvm/lALukgKS4IKUj7bAels1mXx2R28gi0
	fmsoqlYhgmVCq7UNfT+R7pZdGZKTepKyKmte8oiYifBFLBFg+oFe11FLtFtAzcaKFVMX6Gu4JPI/0
	R4c4VIubeVXOiQGIOyFsb0U6PSF/uM7GH6tlipCscYbok8t9VTglxIdkNpnEWfxxYlE04NZKfRh9A
	Ck/GVhLQ==;
Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux))
	id 1muCEJ-004GN4-0N; Mon, 06 Dec 2021 11:32:23 +0000
Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 508133002DB;
	Mon,  6 Dec 2021 12:32:22 +0100 (CET)
Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000)
	id 35C57202CABA2; Mon,  6 Dec 2021 12:32:22 +0100 (CET)
Date: Mon, 6 Dec 2021 12:32:22 +0100
From: Peter Zijlstra <peterz@infradead.org>
To: Peter Oskolkov <posk@posk.io>
Cc: Ingo Molnar <mingo@redhat.com>, Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-api@vger.kernel.org, Paul Turner <pjt@google.com>,
	Ben Segall <bsegall@google.com>, Peter Oskolkov <posk@google.com>,
	Andrei Vagin <avagin@google.com>, Jann Horn <jannh@google.com>,
	Thierry Delisle <tdelisle@uwaterloo.ca>
Subject: Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
Message-ID: <Ya30xsrQnwyT/R92@hirez.programming.kicks-ass.net>
References: <20211122211327.5931-1-posk@google.com>
 <20211122211327.5931-4-posk@google.com>
 <20211124200822.GF721624@worktop.programming.kicks-ass.net>
 <CAFTs51Uka8VRCHuGidw7mRwATufp87U6S8SWUVod_kU-h6T3ew@mail.gmail.com>
 <YaEUts3RbOLyvAjl@hirez.programming.kicks-ass.net>
 <CAFTs51XnN+N74i1XHvRUAUWd04-Fs9uV6ouXo=CQSQs8MaEM5A@mail.gmail.com>
 <YaUCoe07Wl9Stlch@hirez.programming.kicks-ass.net>
 <CAFTs51UzR=m6+vcjTCNOGwGu3ZwB5GMrg+cSQy2ecvCWxhZvEQ@mail.gmail.com>
 <20211129210841.GO721624@worktop.programming.kicks-ass.net>
 <CAFTs51XyGDNj89+FCn4HZqMHuenjQu2wqTOW8ow4hSUbdGrGhw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAFTs51XyGDNj89+FCn4HZqMHuenjQu2wqTOW8ow4hSUbdGrGhw@mail.gmail.com>
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 0B7ED6001992
X-Stat-Signature: 7dj7pfom4pb9wbbcg1nk8o4rznud6t11
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=slQSxZZ0;
	spf=none (imf14.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=peterz@infradead.org;
	dmarc=none
X-HE-Tag: 1638790347-16972
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


Sorry, I haven't been feeling too well and as such procastinated on this
because thinking is required :/ Trying to pick up the bits.

On Mon, Nov 29, 2021 at 03:38:38PM -0800, Peter Oskolkov wrote:
> On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra <peterz@infradead.org> wrote:
> [...]
> > > > > Another big concern I have is that you removed UMCG_TF_LOCKED. I
> > > >
> > > > OOh yes, I forgot to mention that. I couldn't figure out what it was
> > > > supposed to do.
> [...]
> >
> > So then A does:
> >
> >         A::next_tid = C.tid;
> >         sys_umcg_wait();
> >
> > Which will:
> >
> >         pin(A);
> >         pin(S0);
> >
> >         cmpxchg(A::state, RUNNING, RUNNABLE);
> 
> Hmm.... That's another difference between your patch and mine: my
> approach was "the side that initiates the change updates the state".
> So in my code the userspace changes the current task's state RUNNING
> => RUNNABLE and the next task's state,

I couldn't make that work for wakeups; when a thread blocks in a
random syscall there is no userspace to wake the next thread. And since
it seems required in this case, it's easier and more consistent to
always do it.

> or the server's state, RUNNABLE
> => RUNNING before calling sys_umcg_wait().

Yes, this is indeed required; I've found the same when trying to build
the userspace server loop. And yes, I'm starting to see where you're
coming from.

> I'm still not sure we can live without UMCG_TF_LOCKED. What if worker
> A transfers its server to worker B that A intends to context switch

	S0 running A

Therefore:

	S0::state == RUNNABLE
	A::server_tid = S0.tid
	A::state == RUNNING

you want A to switch to B, therefore:

	B::state == RUNNABLE

if B is not yet on S0 then:

	B::server_tid = S0.tid;

finally:

0:
	A::next_tid = B.tid;
1:
	A::state = RUNNABLE:
2:
	sys_umcg_wait();
3:

> into, and then worker A pagefaults or gets interrupted before calling
> sys_umcg_wait()?

So the problem is tripping umcg_notify_resume() on the labels 1 and 2,
right? tripping it on 0 and 3 is trivially correct.

If we trip it on 1 and !(A::state & TG_PREEMPT), then nothing, since
::state == RUNNING we'll just continue onwards and all is well. That is,
nothing has happened yet.

However, if we trip it on 2: we're screwed. Because at that point
::state is scribbled.

> The server will be woken up and will see that it is
> assigned to worker B; now what? If worker A is "locked" before the
> whole thing starts, the pagefault/interrupt will not trigger
> block/wake detection, worker A will keep RUNNING for all intended
> purposes, and eventually will call sys_umcg_wait() as it had
> intended...

No, the failure case is different; umcg_notify_resume() will simply
block A until someone sets A::state == RUNNING and kicks it, which will
be no-one.

Now, the above situation is actually simple to fix, but it gets more
interesting when we're using sys_umcg_wait() to build wait primitives.
Because in that case we get stuff like:

	for (;;) {
		self->state = RUNNABLE;
		smp_mb();
		if (cond)
			break;
		sys_umcg_wait();
	}
	self->state = RUNNING;

And we really need to not block and also not do sys_umcg_wait() early.

So yes, I agree that we need a special case here that ensures
umcg_notify_resume() doesn't block. Let me ponder naming and comments.
Either a TF_COND_WAIT or a whole new state. I can't decide yet.

Now, obviously if you do a random syscall anywhere around here, you get
to keep the pieces :-)


I've also added ::next_tid to the whole umcg_pin_pages() thing, and made
it so that ::next_tid gets cleared when it's been used. That way things
like:

	self->next_tid = pick_from_runqueue();
	sys_that_is_expected_to_sleep();
	if (self->next_tid) {
		return_to_runqueue(self->next_tid);
		self->next_tid = 0;
	}

Are much simpler to manage. Either it did sleep and ::next_tid is
consumed, or it didn't sleep and it needs to be returned to the
runqueue.