From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51A9FC3DA6F for ; Fri, 25 Aug 2023 07:05:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4A0D828006E; Fri, 25 Aug 2023 03:05:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 451B78E0011; Fri, 25 Aug 2023 03:05:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 33F4F28006E; Fri, 25 Aug 2023 03:05:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 26BCE8E0011 for ; Fri, 25 Aug 2023 03:05:52 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id EBE8BB1FBD for ; Fri, 25 Aug 2023 07:05:51 +0000 (UTC) X-FDA: 81161742102.28.C652CD8 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf12.hostedemail.com (Postfix) with ESMTP id D804140010 for ; Fri, 25 Aug 2023 07:05:49 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=MRWBXewv; spf=pass (imf12.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692947150; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DFvqr2Mve2mQrFrdi3E/iQQ9tHKEJ+bL4oJ5DR3ONuA=; b=Fqx+nXd6jeQkwm5WPFoRsaSgpGNhaXtMLSJDiXGuwbbjWhq1mx/0WSOmjS3yKAMTEfS3xn ChCPG5zPxQqprwFDS4QIqGH20fKN8xXXF9gC71M2PNK6/l+JPW0J2Vv2CUzhZW/9Gdxxv+ 0hWVQ2xkZeWALP6IjKzvGQMDyMOR19o= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=MRWBXewv; spf=pass (imf12.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692947150; a=rsa-sha256; cv=none; b=OguYI7fhiM4F1aPReybNzIobR8Jw/XDre7Emq2d0vwtjwJz3FtHMUM50HCB5x7sLuvcEYm 8HIjyViSz5X8BGCPljw5xmP3cubeflwvzssIkWNk6Go/WGP4/Y5uxTpN81dIGgF3XOqqAi 8OOTH8844fncqkPdCfG251uT40B9FFk= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 92CF8206F1; Fri, 25 Aug 2023 07:05:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1692947147; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DFvqr2Mve2mQrFrdi3E/iQQ9tHKEJ+bL4oJ5DR3ONuA=; b=MRWBXewvJx4BQ8TNp/VntnCHY7bc0N+hCIA4Mr74fzytvbYazp8mM3CvngYeuTrbMnMDej mv5H77Tl1V23moyDMh5iLSMEBZWwD1inNjZ/Gf9AqYSnCq2eQWBUmqtzJ4ZQNMIBZrdDl9 t6NQNNjgN4lPtm5PdLWfN6lgFOHOdqk= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 6B17A138F9; Fri, 25 Aug 2023 07:05:47 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id fiMdF8tS6GS2OwAAMHmgww (envelope-from ); Fri, 25 Aug 2023 07:05:47 +0000 Date: Fri, 25 Aug 2023 09:05:46 +0200 From: Michal Hocko To: Yosry Ahmed Cc: Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads Message-ID: References: <20230821205458.1764662-1-yosryahmed@google.com> <20230821205458.1764662-4-yosryahmed@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: D804140010 X-Rspam-User: X-Stat-Signature: cdfh6xz9jcp3641edzobtu1baak38iqj X-Rspamd-Server: rspam01 X-HE-Tag: 1692947149-313894 X-HE-Meta: U2FsdGVkX1/Bx3M7lzZCVgWUSilmV+Q5vVZAXHSyIZ/UHL2nixI+36ywvH6UsMoWyl9Z9ckCd5T1BHYSM9VEV28tkrfVqmlywXrkV+bB4+7sLbJ2OmclEDXMnKLjymscH/3p0V/lDispQ3uKgpYheh0q2LkFVmSWjhZYe0KYoUShQvmrrCXpkFI10Y2Ti+mjhEg9SEb0/ABihLwAaLk60BQXc8lnL6S2ahmWaJMYa5LGWiTkMDJQGdGKWQx3A8avvcTslu4DfE34ntwS2rlsbXfHx/Mga4wO4p+2OlPI5CJr68DOj33W7A0Lfmm7Z9UEF8xQtU+DgPSE/D4P0loZvch6dVsSYtAxSk7+qcYz2ZeVGlZDLNDOPGZ/nL6QtJiHCzGCRaWQEpXN1cmCipj6A3Kt+bjZetwsg29UXmhFKzTG+zSwG69Goom/7ExSxVFN4aWm83FR/ezbtH2NwTpivK3Rei8J06uAjfPA9v1DdLey+ks3Zspagfv8leKbj3T/YX8/O45J4xvaXoUCAtfyAyhmaTXJXJTduw/DRn0vOixryAlDUvguM+XDTFCWnvYkGDd/20tGxfn8gISLgZFmMMLmziVMSKZVXGWwpY9M1ChrXQDy8Uw8B1C5GGArQTsj78AO6l+8aDzrJh9kak7F5AvOqDHKpmL+DpyF9jbuGT+J6S1JzTxwa7pLGuq5KKa+d2JFm5qOxM9HfBtSCDzmAERU92MQDsgS/uwh9R9xBAhXeevk4JIjoTMg2mkSfqesEtx5mKruY0nP/0dqqarUWHYrv8DUg5EmX+Jl8svFmUKkwa3SkbFWNXLiXOLPsOrAbDWHEtP5nLQXszR5ocmufbf781Ll0hWEbw8CmbBIkMzuOTl0hE3GafxWgN/iAzNOa/TXsEWoyxvIsLsxLq8lgr3UfryF5nOKBZT0NLsvBj+uX+bw5irGPlLBZnT+uFswDnKbYB8vuUWqnFyOqtM SjVJRLZs PEsljZAl0XIpP2A8kF6rDXHGYmCz6fvCT/CeObZPCq3SaA5lFQ2XgTggGBZkDikg2fBidja7JjqJaZWNztEuHTE2mtu132eYpNABRMJrtlPzURo4LbEhYdva2abrsvWIXvyUMO/SYitzenlhN0QD8APVlWcardEsPIuro0gqk+b7/H5ER8D9Z9rj44jA8MgJxmhIJKLfkKeEqhXe8LqXsktcnixXPdWZzABxmvA+yIXLJb4A= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu 24-08-23 11:50:51, Yosry Ahmed wrote: > On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed wrote: > > > > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko wrote: > > > > > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko wrote: > > > > > > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko wrote: > > > > > > > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > > > > [...] > > > > > > So to answer your question, I don't think a random user can really > > > > > > affect the system in a significant way by constantly flushing. In > > > > > > fact, in the test script (which I am now attaching, in case you're > > > > > > interested), there are hundreds of threads that are reading stats of > > > > > > different cgroups every 1s, and I don't see any negative effects on > > > > > > in-kernel flushers in this case (reclaimers). > > > > > > > > > > I suspect you have missed my point. > > > > > > > > I suspect you are right :) > > > > > > > > > > > > > Maybe I am just misunderstanding > > > > > the code but it seems to me that the lock dropping inside > > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > > > > contenders which is really dangerous when it is triggerable from the > > > > > userspace. The number of spinners at a moment is always bound by the > > > > > number CPUs but depending on timing many potential spinners might be > > > > > cond_rescheded and the worst time latency to complete can be really > > > > > high. Makes more sense? > > > > > > > > I think I understand better now. So basically because we might drop > > > > the lock and resched, there can be nr_cpus spinners + other spinners > > > > that are currently scheduled away, so these will need to wait to be > > > > scheduled and then start spinning on the lock. This may happen for one > > > > reader multiple times during its read, which is what can cause a high > > > > worst case latency. > > > > > > > > I hope I understood you correctly this time. Did I? > > > > > > Yes. I would just add that this could also influence the worst case > > > latency for a different reader - so an adversary user can stall others. > > > > I can add that for v2 to the commit log, thanks. > > > > > Exposing a shared global lock in uncontrolable way over generally > > > available user interface is not really a great idea IMHO. > > > > I think that's how it was always meant to be when it was designed. The > > global rstat lock has always existed and was always available to > > userspace readers. The memory controller took a different path at some > > point with unified flushing, but that was mainly because of high > > concurrency from in-kernel flushers, not because userspace readers > > caused a problem. Outside of memcg, the core cgroup code has always > > exercised this global lock when reading cpu.stat since rstat's > > introduction. I assume there hasn't been any problems since it's still > > there. I suspect nobody has just considered a malfunctioning or adversary workloads so far. > > I was hoping Tejun would confirm/deny this. Yes, that would be interesting to hear. > One thing we can do to remedy this situation is to replace the global > rstat lock with a mutex, and drop the resched/lock dropping condition. > Tejun suggested this in the previous thread. This effectively reverts > 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") > since now all the flushing contexts are sleepable. I would have a very daring question. Do we really need a global lock in the first place? AFAIU this locks serializes (kinda as the lock can be dropped midway) flushers and cgroup_rstat_flush_hold/release caller (a single one ATM). I can see cgroup_base_stat_cputime_show would like to have a consistent view on multiple stats but can we live without a strong guarantee or to replace the lock with seqlock instead? > My synthetic stress test does not show any regressions with mutexes, > and there is a small boost to reading latency (probably because we > stop dropping the lock / rescheduling). Not sure if we may start > seeing need_resched warnings on big flushes though. Reading 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") it seems the point of moving away from mutex was to have a more usable API. > One other concern that Shakeel pointed out to me is preemption. If > someone holding the mutex gets preempted this may starve other > waiters. We can disable preemption while we hold the mutex, not sure > if that's a common pattern though. No, not really. It is expected that holder of mutex can sleep and can be preempted as well. I might be wrong but the whole discussion so far suggests that the global rstat lock should be reconsidered. From my personal experience global locks easily triggerable from the userspace are just a receip for problems. Stats reading shouldn't be interfering with the system runtime as much as possible and they should be deterministic wrt runtime as well. -- Michal Hocko SUSE Labs