From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6E3BC001DB for ; Tue, 15 Aug 2023 00:39:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 658B994000A; Mon, 14 Aug 2023 20:39:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5E1DB90000B; Mon, 14 Aug 2023 20:39:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4819E94000A; Mon, 14 Aug 2023 20:39:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 33D5690000B for ; Mon, 14 Aug 2023 20:39:55 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 0AAB4A0CB0 for ; Tue, 15 Aug 2023 00:39:55 +0000 (UTC) X-FDA: 81124481550.05.BE69EA9 Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42]) by imf26.hostedemail.com (Postfix) with ESMTP id 32314140004 for ; Tue, 15 Aug 2023 00:39:52 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=NRq5FCZ1; spf=pass (imf26.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692059993; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pUQhhVLOh6qVlfYVa6vB28rLbeIPVQEya3teNIVSHU8=; b=Xq7vZjDvCX/iDbMj7mt6EvgRXjl+cXHkS+nAIRNLwdBDNAziWadRqqKDhZXNQMieFAzUdj wkNynYUQNIcYN2zwbhWoAR+eZNKGF1abrRsGyvOevRrFvzMO0vn4MaUH89iSP4AXI3dKSY gXlMPTuZrGryfQVcUg342lbK4Zjy4v4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692059993; a=rsa-sha256; cv=none; b=nY3yuWQnUWKzPQzo+XR6TBRv5hChybBwNnIgOsOhBtUU5Frd5BAXFoYDCuMz2Dlena3I+V Q6o1IJcIciZNLal3sGJpmUkGHT7F4D5lNce+nhxnaIz3jaDOF5wyy7xAS8SVgmuZfgjYRT W49JnWIZfx6eCR/1dSi3YhQTspbNP3s= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=NRq5FCZ1; spf=pass (imf26.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f42.google.com with SMTP id a640c23a62f3a-99357737980so651961166b.2 for ; Mon, 14 Aug 2023 17:39:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692059992; x=1692664792; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=pUQhhVLOh6qVlfYVa6vB28rLbeIPVQEya3teNIVSHU8=; b=NRq5FCZ1SRDdZseSHT6A4o/nxRYfWZ+wEjq6hgC6YVl/HUk6MWEEj1UW1LWmmI3d3w sa8XjCIy3bxytM2fk8rVprVHY2XVQDxikw3DZr/P20R0q+/TfQl14tWz0K5sN/unOkGW 1ktglXpV+EdVplbN0nd2L3gIoYsuP88D81SjXtSDLQxoCpAWb8sqnWrLJ1LvoP00DXKk W7BGzI0mLLSkFnhvgxVji83SeETWFdiY9pz/VuuTeSrfnqKYIfDbqbuXXGaWt7HGFiOi kOSRkwqqI8pEaCBmt3Ko1YSmP7XktEVU4cMlpdKZBhw6Aa1J+R8gYTeo+ISbJdY3MejZ VrSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692059992; x=1692664792; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pUQhhVLOh6qVlfYVa6vB28rLbeIPVQEya3teNIVSHU8=; b=QvTN1yIu/+YVXetru45XmURspSPceIfSOWrtyWSQ0A2cDmKQVGidPLrD1Da0Dm8ov/ +y1FA4u9Z45nB8kmq2TLv7OfqVfh2uw/4XsmESJXPXpZgooKja44iRbB6XKa60m8+Osj tgGFUV0L3wfIq/uwLWOtO1Z9V40Rg9E552equebRuRYnhpzbiZGSyZi1FFANEtjO07nr XnjDEMpAWaPmmU2rOo97/1ECzS1ChHzIXX/ytjJIJaVrZez0vAXU23kqmuBiK57LnVfJ Btn3zZeeaHADEKuP5MTX+eKYKeMo2dgd7NBD7a2tcE80BlaN5+qxbiVMDfCJDBXsj/kM bnrg== X-Gm-Message-State: AOJu0YzvRkGcjzS2miPd3KGneb3Y0Z3N6MStkv3c8/gsMHjugmUzTRbB 0kpFGyJUqILpumUGVIejO/g20ZqukaFW3zNmz/LgfeZhxz/1QZIBv+Y= X-Google-Smtp-Source: AGHT+IGd80Yh38r7ND9UV8X81gz5g3oHz8jVaPo0JV5xcL5knZQL2XaC2Z/lONuBN+iOCjcw19UDy7fICHwKzfN7oc4= X-Received: by 2002:a17:907:2beb:b0:973:fd02:a41f with SMTP id gv43-20020a1709072beb00b00973fd02a41fmr10317990ejc.40.1692059991412; Mon, 14 Aug 2023 17:39:51 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Mon, 14 Aug 2023 17:39:15 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads To: Tejun Heo Cc: Michal Hocko , Shakeel Butt , Johannes Weiner , Roman Gushchin , Andrew Morton , Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 32314140004 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: efqhpth15uthea4wfftbg31b6533czwc X-HE-Tag: 1692059992-86519 X-HE-Meta: U2FsdGVkX193Pz+p2ndCYznLJeSPlc3GG60tEjOhdO0A+XFRtTE3aIaqAUkSNgNfm0MlKNzFtWep+MFMX0L7+ZUHAMDnY/VN7l9i5IHA1WAlh85rQ29CHXwJuepxpkspgUmQZDbT8Jn/Wum79cOMlZsy/wNQpjUWHroqdut44ECxz2/8A/6qbL4kf/4A32oCaPV57l0Qyi9PYlW6yWk7z3hBE+hrCoH410RaMPR/ld9BCz+fTseQD6gG1LRq8lADArcYX8HVfEywdm0/+YVIYZJLja5SUqxehZLjh38bny6bMEaj0UcN8y7OTPqH0TMHLntimlpyTqX8e8d9xnHzp1Oa2Lh1dlCLnbOp5lNn2OPON8xpZZSfr1bY2GJTNDrMuBl8uXjShPd17cojbXtpilX3JhSu1RaRpsOdXNwPuXWPTppRUSd7RiSZkr/7+8xEUigVDXO6F9Z4Wtu7EM/3qup4mI4Kl7WaHbWruC9MlK+/UvqJCeF12gs8tsC8FST+2MJUxObgtAc6E2+NgzGv+tTKj6H6Cixh7NZXAkvs7KQ850+a2yM9QbWi9DiVLmvSLNpQeH8jx+B10hnfxlpEo516BOIgtHp8NUb4EIAPdg3WN6LUcVTg8ligajz+Gxwz3GpCmJ7fdY4SNR0sJRt9Y1uGDqpc+HekarMlI3PVZZjaHqvNyy5o6tCUVglYczCwfI85O9cHXYYR45qejqHhVw22RmUqwdcOA8eUK8i25a7KLJhe3f/+nYvEeWPr2/ITEVETufPMqWOQnyKywHB0hB/dU5d7Y4bb1OTaWNV7YgmCtw16aPumlrZzaS/lwkeAhzUNK1ST3CNnldAM+01L+NC8w8DINpm2bm++tLBP56CRc1tcmM3NT38oj+UmcsUHevPJ0oaFJus6iEWfCcjOJzegHIUX/oAhSdqUXUBUKOV3Z1GSVI820EoyNo5AsAEVDJak+j0Y9uYdOg2H427 b7BMDGxi epLtllpq7OKLVYRx/Dxh0xyLrkFbXKFQhirrJSFUlK8XGsfvNwtB/9egl79SOJx3I6iqAZucxBZJ77P0qv7ZBYwskiQ5EXDtsjhQqvbjADdtEynXEG98tGriiR+CMyYHa8hZ/O3DMMg+dRUTJxOq+Xv4oINs9wcPN8dWP0v9Y/6vi1vZavANcy60tnPPPxkuY4p+wplFXvqAJ+0hE3A9sQUBlFZ7joDOnqo2qHVYhA9dopre2P/Pj/oVedMiog753ZdtFERptJ+ZNf7Pn7CHOctgLs0kCFCekS5KE X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Aug 14, 2023 at 5:35=E2=80=AFPM Tejun Heo wrote: > > Hello, > > On Mon, Aug 14, 2023 at 05:28:22PM -0700, Yosry Ahmed wrote: > > > So, the original design used mutex for synchronize flushing with the = idea > > > being that updates are high freq but reads are low freq and can be > > > relatively slow. Using rstats for mm internal operations changed this > > > assumption quite a bit and we ended up switching that mutex with a lo= ck. > > > > Naive question, do mutexes handle thundering herd problems better than > > spinlocks? I would assume so but I am not sure. > > I don't know. We can ask Waiman if that becomes a problem. > > > > * Flush-side, maybe we can break flushing into per-cpu or whatnot but > > > there's no avoiding the fact that flushing can take quite a while i= f there > > > are a lot to flush whether locks are split or not. I wonder whether= it'd > > > be possible to go back to mutex for flushing and update the users t= o > > > either consume the cached values or operate in a sleepable context = if > > > synchronous read is necessary, which is the right thing to do anywa= y given > > > how long flushes can take. > > > > Unfortunately it cannot be broken down into per-cpu as all flushers > > update the same per-cgroup counters, so we need a bigger locking > > scope. Switching to atomics really hurts performance. Breaking down > > the lock to be per-cgroup is doable, but since we need to lock both > > the parent and the cgroup, flushing top-level cgroups (which I assume > > is most common) will lock the root anyway. > > Plus, there's not much point in flushing in parallel, so I don't feel too > enthusiastic about splitting flush locking. > > > All flushers right now operate in sleepable context, so we can go > > again to the mutex if you think this will make things better. The > > Yes, I think that'd be more sane. > > > slowness problem reported recently is in a sleepable context, it's > > just too slow for userspace if I understand correctly. > > I mean, there's a certain amount of work to do. There's no way around it = if > you wanna read the counters synchronously. The only solution there would = be > using a cached value or having some sort of auto-flushing mechanism so th= at > the amount to flush don't build up too much - e.g. keep a count of the > number of entries to flush and trigger flush if it goes over some thresho= ld. I really hoped you'd continue reading past this point :) My proposed solution was to only flush the needed subtree rather than flushing the entire tree all the time, which is what we do now on the memcg side. We already have an asynchronous flusher on the memcg side that runs every 2s to try to keep the tree size bounded, and we already keep track of the magnitude of updates and only flush if it's significant. The problems in this thread and the other one are: (a) Sometimes reading from userspace is slow because we needlessly flush the entire tree. (b) Sometimes reading from userspace is inaccurate because we skip flushing if someone else is flushing, even though we don't know if they flushed the subtree we care about yet or not. I believe dropping unified flushing, if possible of course, may fix both problems. > > Thanks. > > -- > tejun