From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F066C4332F for ; Wed, 12 Oct 2022 18:49:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 63C0C6B0071; Wed, 12 Oct 2022 14:49:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5EC5B6B0073; Wed, 12 Oct 2022 14:49:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 48CC06B0074; Wed, 12 Oct 2022 14:49:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 2F0CA6B0071 for ; Wed, 12 Oct 2022 14:49:43 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id EC2DC80C95 for ; Wed, 12 Oct 2022 18:49:42 +0000 (UTC) X-FDA: 80013186204.12.12A659C Received: from out0.migadu.com (out0.migadu.com [94.23.1.103]) by imf30.hostedemail.com (Postfix) with ESMTP id 350648002E for ; Wed, 12 Oct 2022 18:49:41 +0000 (UTC) Date: Wed, 12 Oct 2022 11:49:20 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1665600579; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=9Tsf5zz1GW2rML7CvmH3MY34IWxeLBZAZox8uNXTDHU=; b=H5WmaHCzc7Mu/b1qg3xj/6AAJ3EjnFJH8hsQWN8lEpm12u4w3Icd9wY4OIrVZEiRiseZtw 8b9WEv2Lq644/rOIRbjwdSdWVEBpgjiRLcrgyT5CogKbpr7IobBnmzVmpODs6qaEMuW7cn gyyNLv09fLsjlm/isyRoMmOi5hR2YsI= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Johannes Weiner Cc: Alexander Fedorov , Michal Hocko , Shakeel Butt , Vladimir Davydov , Muchun Song , Sebastian Andrzej Siewior , cgroups@vger.kernel.org, linux-mm@kvack.org Subject: Re: Possible race in obj_stock_flush_required() vs drain_obj_stock() Message-ID: References: <821923d8-17c3-f1c2-4d6a-5653c88db3e8@gmail.com> <2f9bdffd-062e-a364-90c4-da7f09c95619@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665600582; a=rsa-sha256; cv=none; b=W18SWzCYEZBAPGBe8BCKSaffCNsQ725QD29OpfVl0GVPs3tg2Cs1OR4gG8xKgwb2h9vFBu Kpbp2sHKFBvbcsrEfXqPbVgWhdxhS4ofobLXo6ilvRjl+bCtY7GNRrKS9emA30+xIHNvMK SdIDxSWla+rzKY4IzsbDdxph+OMOK1M= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=H5WmaHCz; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf30.hostedemail.com: domain of roman.gushchin@linux.dev designates 94.23.1.103 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665600582; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9Tsf5zz1GW2rML7CvmH3MY34IWxeLBZAZox8uNXTDHU=; b=iYeW0oddaqYWLD2PvRXDqLhdvH480DM9B3zBfVaYxJbbliFUWKRzpG4ePHqJPU2lw7zMKz E/IJKXbQmitjuJ7HNG3RHg0iXAok6HxEZP3Tc4R7kD0XC6iKKNRm/SR+r2vg8SPL4TaDN5 VO6fQ41K+rvtEV/NnrQFRMir8jHnkjM= X-Stat-Signature: 3738ybhq1kfsj7c9q7ko51xg16pwmoez X-Rspamd-Queue-Id: 350648002E Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=H5WmaHCz; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf30.hostedemail.com: domain of roman.gushchin@linux.dev designates 94.23.1.103 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev X-Rspam-User: X-Rspamd-Server: rspam03 X-HE-Tag: 1665600581-314579 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 12, 2022 at 01:23:11PM -0400, Johannes Weiner wrote: > On Tue, Oct 04, 2022 at 09:18:26AM -0700, Roman Gushchin wrote: > > On Mon, Oct 03, 2022 at 06:01:35PM +0300, Alexander Fedorov wrote: > > > On 03.10.2022 17:27, Michal Hocko wrote: > > > > On Mon 03-10-22 17:09:15, Alexander Fedorov wrote: > > > >> On 03.10.2022 16:32, Michal Hocko wrote: > > > >>> On Mon 03-10-22 15:47:10, Alexander Fedorov wrote: > > > >>>> @@ -3197,17 +3197,30 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) > > > >>>> stock->nr_bytes = 0; > > > >>>> } > > > >>>> > > > >>>> - obj_cgroup_put(old); > > > >>>> + /* > > > >>>> + * Clear pointer before freeing memory so that > > > >>>> + * drain_all_stock() -> obj_stock_flush_required() > > > >>>> + * does not see a freed pointer. > > > >>>> + */ > > > >>>> stock->cached_objcg = NULL; > > > >>>> + obj_cgroup_put(old); > > > >>> > > > >>> Do we need barrier() or something else to ensure there is no reordering? > > > >>> I am not reallyu sure what kind of barriers are implied by the pcp ref > > > >>> counting. > > > >> > > > >> obj_cgroup_put() -> kfree_rcu() -> synchronize_rcu() should take care > > > >> of this: > > > > > > > > This is a very subtle guarantee. Also it would only apply if this is the > > > > last reference, right? > > > > > > Hmm, yes, for the last reference only, also not sure about pcp ref > > > counter ordering rules for previous references. > > > > > > > Is there any reason to not use > > > > WRITE_ONCE(stock->cached_objcg, NULL); > > > > obj_cgroup_put(old); > > > > > > > > IIRC this should prevent any reordering. > > > > > > Now that I think about it we actually must use WRITE_ONCE everywhere > > > when writing cached_objcg because otherwise compiler might split the > > > pointer-sized store into several smaller-sized ones (store tearing), > > > and obj_stock_flush_required() would read garbage instead of pointer. > > > > > > And thinking about memory barriers, maybe we need them too alongside > > > WRITE_ONCE when setting pointer to non-null value? Otherwise > > > drain_all_stock() -> obj_stock_flush_required() might read old data. > > > Since that's exactly what rcu_assign_pointer() does, it seems > > > that we are going back to using rcu_*() primitives everywhere? > > > > Hm, Idk, I'm still somewhat resistant to the idea of putting rcu primitives, > > but maybe it's the right thing. Maybe instead we should always schedule draining > > on all cpus instead and perform a cpu-local check and bail out if a flush is not > > required? Michal, Johannes, what do you think? > > I agree it's overkill. > > This is a speculative check, and we don't need any state coherency, > just basic lifetime. READ_ONCE should fully address this problem. That > said, I think the code could be a bit clearer and better documented. > > How about the below? I'm fine with using READ_ONCE() to fix this immediate issue (I suggested it in the thread above), please feel free to add my ack: Acked-by: Roman Gushchin . We might need a barrier() between zeroing stock->cached and dropping the last reference, as discussed above, however I don't think this issue can be realistically trgiggered in the real life. However I think our overall approach to flushing is questionable: 1) we often don't flush when it's necessary: if there is a concurrent flushing we just bail out, even if that flushing is related to a completely different part of the cgroup tree (e.g. a leaf node belonging to a distant branch). 2) we can race and flush when it's not necessarily: if another cpu is busy, likely by the time when work will be executed there will be already another memcg cached. So IMO we need to move this check into the flushing thread. I'm working on a different approach, but it will take time and also likely be too invasive for @stable, so fixing the crash discovered by Alexander with READ_ONCE() is a good idea. Thanks!