From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BEA56C6FA8F for ; Tue, 29 Aug 2023 19:14:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 48B1328002B; Tue, 29 Aug 2023 15:14:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 43A428E003B; Tue, 29 Aug 2023 15:14:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2DC2528002B; Tue, 29 Aug 2023 15:14:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 1FCD38E003B for ; Tue, 29 Aug 2023 15:14:12 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id AFE3A1402BA for ; Tue, 29 Aug 2023 19:14:11 +0000 (UTC) X-FDA: 81178092702.30.F73CC0B Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) by imf10.hostedemail.com (Postfix) with ESMTP id D1F93C0016 for ; Tue, 29 Aug 2023 19:14:09 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=GWcr3Cle; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693336449; a=rsa-sha256; cv=none; b=U8eqzUYJYBdoMrQj7P4aNBkel0eibUH50wWsPmT6sZ2ZPFG6wA77caMdtFk034v59ZKLdk OGbrpQJvIbIQHT1sqeevC7XgvvEcBU6/eJTFvSftzpxVRQ6vJyegeXNIBdnOh2s80/s9EL C2sBYKDsuTOFBTDFdRumrITBDO8jp7g= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=GWcr3Cle; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693336449; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UHgoXIRtSQrABCtZsUCGV3h62Vu637qEZw7XsfTCwms=; b=ahW8VFBt8aM6JLisLfC4aI84MMqOUHh6hmV2IQfMc44siq73pFhAhet1rcgtuJXxW1b9Df XOB570yH4GeMLKGhoHhoKjvDtu8VEVcn7CCleNTkfrKqHciZZhvp2Z1XX1THtUX2rdD7JA rBbM3RZCF5Oy4746dno5MO6fiMMxkzI= Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-99c3d3c3db9so614399666b.3 for ; Tue, 29 Aug 2023 12:14:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1693336448; x=1693941248; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=UHgoXIRtSQrABCtZsUCGV3h62Vu637qEZw7XsfTCwms=; b=GWcr3CleTxMrMuRrQVOuGU9/9NcTqSLK6r2I/gv7fubRKQFKgvleCcfe1TcrLuhyjf Hh78t7A4Ff4d7owBrR406gUOZchXe7ygw6Frf24OKl59VWIIAMZYwQxiXjnDgw5oYH42 RJeAYTxB2fDKWN1lX4AuX8ngn7xe3jE28DVs3q7iSjjnd3K9SIoDOvfdNskmJA+D7HZB QJ1/KRZsiRFNS2FLLTrIQeKl81YTfhMw6b6eLKwZorCjGm2IzPEWo2RxG6r0ZUnl3ixZ NHlRvUuIMfGK/ei3tx7Y6YQ1yQcIrKG66mGYBrT7p06ykotaKskFEALwmtCsC6eZzxSt Cc3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693336448; x=1693941248; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UHgoXIRtSQrABCtZsUCGV3h62Vu637qEZw7XsfTCwms=; b=Qgz8c3Pm+VQR7VzsN5QIZUpS3HiPS3wmZ8qyXjbh465r4X1MjYCCRAFTDUPEvvndvu gLnabNLJO5CTk4i68O1JA0/tgaRD86g4brOfSHwjjf4OkhVBLAdABj9mp/+cfEngzR6+ 4cvvF1FGG05ytcqdrzzyznxONsaL/3gCTuCa9nAaV0iObbA6CKQh6Jq6Xp/c9jPPAWCI w/iVlVuZc+EBYUc052ZJpv8OI447aOTu8OaBhOuvBaL2tcCA6TIACdjbrDoLWtRKFLkV RpLFF1SvF87z8A0ad1WKE4ofTNM/HfHin4bwdKhhsE64TLpX0Q5bu5mmV5AzjRha/BJv GR6g== X-Gm-Message-State: AOJu0YxWERwWgPQVJmqgKGisjakBB0/zsfemYsupENi6PWlWBivYdm2Y 8CBka5XqMa1anlE5F26EY8rFod0PpeFMulvGGGs+Cg== X-Google-Smtp-Source: AGHT+IF4/6s1D+KsIlmkZe48htUa9FgXMWRh9rjFF3eGngFd05g9b+pNFiGXRNiiINLy8JUbIXDoXjWOGX7H6SdNjxk= X-Received: by 2002:a17:906:3193:b0:9a2:256a:65cd with SMTP id 19-20020a170906319300b009a2256a65cdmr11655730ejy.4.1693336448130; Tue, 29 Aug 2023 12:14:08 -0700 (PDT) MIME-Version: 1.0 References: <20230821205458.1764662-1-yosryahmed@google.com> <20230821205458.1764662-4-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Tue, 29 Aug 2023 12:13:31 -0700 Message-ID: Subject: Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads To: Tejun Heo Cc: Michal Hocko , Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: D1F93C0016 X-Stat-Signature: 7pykuidc5gsfcnjesjhybzbx7989zeps X-HE-Tag: 1693336449-99979 X-HE-Meta: U2FsdGVkX1/L/4TMwsl0d0CiGkyR90spCvoQPjmpb1yPwaMiSoufmFESa/ikOCVS/1Nblq7An0x9LtSBT+JQKVTGnyGACku5spXdXlpCR0C34P0XNnUIh9heGMVzfJsX/qlbtKMMi4REEZDqplbibV9eT9J+GWXFhw2Hbv9o+0UYt4S4Xu+NKdGGZhn1bGlPyDqthyD5elXKMHEG13AVJPMEuiPzNSDgorpNxYPk4cQO/VUqfCa6uLjL7zj5RzRsUT+cbA5nkjdwZpmjogV5BVGbYYWafjYdY+Kn0Mcne6HJ67V73+Q4PKiIIayJZXuUTdQSJnxlh0LxXwUOzdosUYLHVlIIK8T8uOrIzlkBjiZyDwUkntrUaM/rJxCFwc09m7gF0RrTWRMyH6W9zIAIyW5axnKWr8BFTo2w2YxYQ3QpL+pyPNyxnOTLd0LWn0em3hj9RD1ofdihtZdiBr12qOyhOm3qjVPtHa3HeBNakSMa0jb9jdypHLQwi54sIozgmflaXSQJi2KN6TyJVgR4H4yWXxvNJrRc9psaeXRkTBS0ebmsD7vVBjGye+oT8cqThF9rBwh7cAjpzc4jlLO2oUV1fasItfuO4UrM83N+x+ZUCeKygqCMNK8TDHzTSPU0Qr+//QCSA5r+CHesDTLym5F/uUyWW3fTADPKSkJ+7sPRqKpK8NGyU5k2DFvx6SwkbLcWx9dir7qI3NZKMdppiFyD2HG3CFru5BgcirsDojHNxeaelMeMv4sezXgtz3O1afRUeQ6U9MtTrCA/NfxdpsR4FAa2FzZ3YXa7g3wDMlpXE2VjGr415jVdYeE0TLwW5lYZEXAxh2Vdm3DSgT/Q4jatEkTJB9V7roEzSpiJDW9S3+n9biFqPSl+PzCZyY4cKMFDS/zlpsCYy+1RtRVJvpK7kITWzS40UQLqFYBg5/prsfVteQkAEXmecZbYqjJYjxy0vxerYqCDOL35F19 Fxk4B/Fj RXF2k7W/VCLvrw/9xmgJjzIrVLFxuYRhd2yGC8PF0XzalTPrbc8PzHKG50vMDfzy8s8YKIJV/S59oQf1BhBY0RLuPTZFU0pmXtpjVP/+/0hO72L2qaFw/9I0zXNE+CXMftCZjZabZcaIvDP/vJvnlMN4wlRC3F+L2BaoRx8i/yaQlcCTvdOZOFY9+ouOkkSCvOn3DyLrHF6GxAdlgQdmYVSjt89/2QUWsqqpjAlt0AbiUCW23vpwH6nOKPtIO+pOJXvVhyRkXX+DDVtEo1krYUOlQrI29G2IL95P6 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 29, 2023 at 11:44=E2=80=AFAM Tejun Heo wrote: > > Hello, > > On Fri, Aug 25, 2023 at 09:05:46AM +0200, Michal Hocko wrote: > > > > I think that's how it was always meant to be when it was designed. = The > > > > global rstat lock has always existed and was always available to > > > > userspace readers. The memory controller took a different path at s= ome > > > > point with unified flushing, but that was mainly because of high > > > > concurrency from in-kernel flushers, not because userspace readers > > > > caused a problem. Outside of memcg, the core cgroup code has always > > > > exercised this global lock when reading cpu.stat since rstat's > > > > introduction. I assume there hasn't been any problems since it's st= ill > > > > there. > > > > I suspect nobody has just considered a malfunctioning or adversary > > workloads so far. > > > > > > I was hoping Tejun would confirm/deny this. > > > > Yes, that would be interesting to hear. > > So, the assumptions in the original design were: > > * Writers are high freq but readers are lower freq and can block. > > * The global lock is mutex. > > * Back-to-back reads won't have too much to do because it only has to flu= sh > what's been accumulated since the last flush which took place just befo= re. > > It's likely that the userspace side is gonna be just fine if we restore t= he > global lock to be a mutex and let them be. Most of the problems are cause= d > by trying to allow flushing from non-sleepable and kernel contexts. So basically restore the flush without disabling preemption, and if a userspace reader gets preempted while holding the mutex it's probably okay because we won't have high concurrency among userspace readers? I think Shakeel was worried that this may cause a priority inversion where a low priority task is preempted while holding the mutex, and prevents high priority tasks from acquiring it to read the stats and take actions (e.g. userspace OOMs). > Would it > make sense to distinguish what can and can't wait and make the latter gro= up > always use cached value? e.g. even in kernel, during oom kill, waiting > doesn't really matter and it can just wait to obtain the up-to-date numbe= rs. The problem with waiting for in-kernel flushers is that high concurrency leads to terrible serialization. Running a stress test with 100s of reclaimers where everyone waits shows ~ 3x slowdowns. This patch series is indeed trying to formalize a distinction between waiters who can wait and those who can't on the memcg side: - Unified flushers always flush the entire tree and only flush if no one else is flushing (no waiting), otherwise they use cached data and hope the concurrent flushing helps. This is what we currently do for most memcg contexts. This patch series opts userspace reads out of it. - Non-unified flushers only flush the subtree they care about and they wait if there are other flushers. This is what we currently do for some zswap accounting code. This patch series opts userspace readers into this. The problem Michal is raising is that dropping the lock can lead to an unbounded number of waiters and longer worst case latency. Especially that this is directly influenced by userspace. Reintroducing the mutex and removing the lock dropping code fixes that problem, but then if the mutex holder gets preempted, we face another problem. Personally I think there is a good chance there won't be userspace latency problems due to the lock as usually there isn't high concurrency among userspace readers, and we can deal with that problem if/when it happens. So far no problem is happening for cpu.stat which has the same potential problem.