From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB437C433F5 for ; Mon, 6 Dec 2021 11:32:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2683F6B007B; Mon, 6 Dec 2021 06:32:39 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 218146B007D; Mon, 6 Dec 2021 06:32:39 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0E0136B007E; Mon, 6 Dec 2021 06:32:39 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0056.hostedemail.com [216.40.44.56]) by kanga.kvack.org (Postfix) with ESMTP id F25526B007B for ; Mon, 6 Dec 2021 06:32:38 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id B8721884B9 for ; Mon, 6 Dec 2021 11:32:28 +0000 (UTC) X-FDA: 78887156376.22.A5D41E0 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf14.hostedemail.com (Postfix) with ESMTP id 0B7ED6001992 for ; Mon, 6 Dec 2021 11:32:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=bbPon1x1Cb+zVlWyG5Z9rwl6b0rpya4JWPKDzhX3GGU=; b=slQSxZZ0VZCUUEAGC0vtDR2sDH U+NTHEXwuB+ok1uaEzNUl3w52lNSWZCB1uQhfYnK/RuUHzvOiqDxEW1TnGsVEk/88b5qXl3yANpEs BWIK+ZnfRY/xehsgUWEpuSYF2qtj7vuI0ld/9IXEGANRvm/lALukgKS4IKUj7bAels1mXx2R28gi0 fmsoqlYhgmVCq7UNfT+R7pZdGZKTepKyKmte8oiYifBFLBFg+oFe11FLtFtAzcaKFVMX6Gu4JPI/0 R4c4VIubeVXOiQGIOyFsb0U6PSF/uM7GH6tlipCscYbok8t9VTglxIdkNpnEWfxxYlE04NZKfRh9A Ck/GVhLQ==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1muCEJ-004GN4-0N; Mon, 06 Dec 2021 11:32:23 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 508133002DB; Mon, 6 Dec 2021 12:32:22 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 35C57202CABA2; Mon, 6 Dec 2021 12:32:22 +0100 (CET) Date: Mon, 6 Dec 2021 12:32:22 +0100 From: Peter Zijlstra To: Peter Oskolkov Cc: Ingo Molnar , Thomas Gleixner , Andrew Morton , Dave Hansen , Andy Lutomirski , Linux Memory Management List , Linux Kernel Mailing List , linux-api@vger.kernel.org, Paul Turner , Ben Segall , Peter Oskolkov , Andrei Vagin , Jann Horn , Thierry Delisle Subject: Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Message-ID: References: <20211122211327.5931-1-posk@google.com> <20211122211327.5931-4-posk@google.com> <20211124200822.GF721624@worktop.programming.kicks-ass.net> <20211129210841.GO721624@worktop.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 0B7ED6001992 X-Stat-Signature: 7dj7pfom4pb9wbbcg1nk8o4rznud6t11 Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=slQSxZZ0; spf=none (imf14.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=peterz@infradead.org; dmarc=none X-HE-Tag: 1638790347-16972 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Sorry, I haven't been feeling too well and as such procastinated on this because thinking is required :/ Trying to pick up the bits. On Mon, Nov 29, 2021 at 03:38:38PM -0800, Peter Oskolkov wrote: > On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra wrote: > [...] > > > > > Another big concern I have is that you removed UMCG_TF_LOCKED. I > > > > > > > > OOh yes, I forgot to mention that. I couldn't figure out what it was > > > > supposed to do. > [...] > > > > So then A does: > > > > A::next_tid = C.tid; > > sys_umcg_wait(); > > > > Which will: > > > > pin(A); > > pin(S0); > > > > cmpxchg(A::state, RUNNING, RUNNABLE); > > Hmm.... That's another difference between your patch and mine: my > approach was "the side that initiates the change updates the state". > So in my code the userspace changes the current task's state RUNNING > => RUNNABLE and the next task's state, I couldn't make that work for wakeups; when a thread blocks in a random syscall there is no userspace to wake the next thread. And since it seems required in this case, it's easier and more consistent to always do it. > or the server's state, RUNNABLE > => RUNNING before calling sys_umcg_wait(). Yes, this is indeed required; I've found the same when trying to build the userspace server loop. And yes, I'm starting to see where you're coming from. > I'm still not sure we can live without UMCG_TF_LOCKED. What if worker > A transfers its server to worker B that A intends to context switch S0 running A Therefore: S0::state == RUNNABLE A::server_tid = S0.tid A::state == RUNNING you want A to switch to B, therefore: B::state == RUNNABLE if B is not yet on S0 then: B::server_tid = S0.tid; finally: 0: A::next_tid = B.tid; 1: A::state = RUNNABLE: 2: sys_umcg_wait(); 3: > into, and then worker A pagefaults or gets interrupted before calling > sys_umcg_wait()? So the problem is tripping umcg_notify_resume() on the labels 1 and 2, right? tripping it on 0 and 3 is trivially correct. If we trip it on 1 and !(A::state & TG_PREEMPT), then nothing, since ::state == RUNNING we'll just continue onwards and all is well. That is, nothing has happened yet. However, if we trip it on 2: we're screwed. Because at that point ::state is scribbled. > The server will be woken up and will see that it is > assigned to worker B; now what? If worker A is "locked" before the > whole thing starts, the pagefault/interrupt will not trigger > block/wake detection, worker A will keep RUNNING for all intended > purposes, and eventually will call sys_umcg_wait() as it had > intended... No, the failure case is different; umcg_notify_resume() will simply block A until someone sets A::state == RUNNING and kicks it, which will be no-one. Now, the above situation is actually simple to fix, but it gets more interesting when we're using sys_umcg_wait() to build wait primitives. Because in that case we get stuff like: for (;;) { self->state = RUNNABLE; smp_mb(); if (cond) break; sys_umcg_wait(); } self->state = RUNNING; And we really need to not block and also not do sys_umcg_wait() early. So yes, I agree that we need a special case here that ensures umcg_notify_resume() doesn't block. Let me ponder naming and comments. Either a TF_COND_WAIT or a whole new state. I can't decide yet. Now, obviously if you do a random syscall anywhere around here, you get to keep the pieces :-) I've also added ::next_tid to the whole umcg_pin_pages() thing, and made it so that ::next_tid gets cleared when it's been used. That way things like: self->next_tid = pick_from_runqueue(); sys_that_is_expected_to_sleep(); if (self->next_tid) { return_to_runqueue(self->next_tid); self->next_tid = 0; } Are much simpler to manage. Either it did sleep and ::next_tid is consumed, or it didn't sleep and it needs to be returned to the runqueue.