From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A7D9C433F5 for ; Tue, 11 Oct 2022 00:21:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 914D66B0072; Mon, 10 Oct 2022 20:21:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 89F198E0002; Mon, 10 Oct 2022 20:21:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 717ED8E0001; Mon, 10 Oct 2022 20:21:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 577BA6B0072 for ; Mon, 10 Oct 2022 20:21:15 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 1F467A0F6A for ; Tue, 11 Oct 2022 00:21:15 +0000 (UTC) X-FDA: 80006764110.15.7B66B38 Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45]) by imf11.hostedemail.com (Postfix) with ESMTP id A32DC4001A for ; Tue, 11 Oct 2022 00:21:14 +0000 (UTC) Received: by mail-ej1-f45.google.com with SMTP id b2so28084184eja.6 for ; Mon, 10 Oct 2022 17:21:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=t5b3uWEYh8D1bGFVBUtx/YHS7tr5OOv/UiC1a+kLQhA=; b=D+BVHNWtd51crdcezYUpYST5y49fBdPovHWh29578VE8YlXqAHn4H7iCM8O9FJb6cn bq0C2o1Yr1X2CJnuzn7wkqdcCO1cLQQUeYOAP4JlaCkkzuBsx/yCDifdN2himGrBo28F DZgaJ0Ddey7ejOeWTf05dM9xe3Gdo357UUmkeVaT0X+vy2rkXhXFV75Irvm6xlEWUpb1 tFbcKOTKzZrxO/6MRNLSnro+I2/7GJw5Y+wyF+1RLHRTjVUeugjUGANyaot1CpN7Ml/I 4Teg6/eg1jmbM1yn+UCp/Ko81+NAwD1hKvxyFOzCqUTNAh5PIYj0Us/MxMyZcAjEsULg ggZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=t5b3uWEYh8D1bGFVBUtx/YHS7tr5OOv/UiC1a+kLQhA=; b=8K9tWXIZd7rkpJIzsTObaqdB4jjrS2qtzU/XQTseHdsvYZNRdgdAfn1hGEEpgerx5x Btf4oYgNg4Kdde5aVfZstqnKBybmCH42iyvHerrsMUC2NjLBj5YiAzRNWopKTRC/0nuh hzSkdKTCcYG+P7fd5uO1IpSvVWTTbPO23ACjLGPwdfhQXVaabdhRs1j+TQDHHE6SAy9L 8o359dGuJuSpG48BRUOxZua0dgPvVirTFMXgQ7do3Yjb0m6BUeGUeAIi96b9qBJlIgnJ mDfzNg9/vddIu+CNXaC+0n00ilgez3kIr/qfdmwfsOxy1WmlxGezJ6EtW7fspAvkY1OP jD1w== X-Gm-Message-State: ACrzQf0ab856R7C6MpgpnXiRIxYEGRCODWddDtF2aSY0qp+GfB9SDCGo ltrb2Ta2MqXDfcmAQPNV01EdGvIfQP8HbIm6EXHYrzH95xE= X-Google-Smtp-Source: AMsMyM5PAmYKT2DOGjYcVqpqnfDwoCVR9NwE3AadA6Y4g24b1KBcmyhwO8kJYui97n97HDwxWYTnblw5KMtki2OUUu8= X-Received: by 2002:a5d:4909:0:b0:22e:7bbf:c8d with SMTP id x9-20020a5d4909000000b0022e7bbf0c8dmr12198466wrq.80.1665447369460; Mon, 10 Oct 2022 17:16:09 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Mon, 10 Oct 2022 17:15:33 -0700 Message-ID: Subject: Re: [RFC] memcg rstat flushing optimization To: Tejun Heo Cc: Zefan Li , Johannes Weiner , Michal Hocko , Shakeel Butt , Roman Gushchin , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Andrew Morton , Linux-MM , Cgroups , Greg Thelen Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665447674; a=rsa-sha256; cv=none; b=msaHl3Cg0WJFSAlwJx6vtw4zvosDnp7aB5Ihh816douaZdKhvKk43wyD84JigJNd16nrg3 vNftvSZ11QafHb/h+nT0DtFGOOehDBj+yud/5LS2piNpPGDDSnq5P2+3o14Us8dbhc2fxV 9UYJNP81WtfoVslvv7MI+USKR7Gadf4= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=D+BVHNWt; spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665447674; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=t5b3uWEYh8D1bGFVBUtx/YHS7tr5OOv/UiC1a+kLQhA=; b=QE6gocVLi8Myr5d4yWICA20/2dNg7OjQnerNIoX4PkKaHjqgKcu/jiqKjta0YqXNXCdCMd JylLCUnZ07q2t5VWz4kna+rkvH075vuafWRDdbzJlsfqB9s15TWRsjlhUq3dZrO7R27+ZZ zAC+TWhWaGKci+FR4Lx5En/TgZ80mWM= Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=D+BVHNWt; spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com X-Stat-Signature: 96zunyq83448ozdwytk5seyzjisqazxn X-Rspamd-Queue-Id: A32DC4001A X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1665447674-956196 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 5, 2022 at 11:38 AM Yosry Ahmed wrote: > > On Wed, Oct 5, 2022 at 11:22 AM Tejun Heo wrote: > > > > Hello, > > > > On Wed, Oct 05, 2022 at 11:02:23AM -0700, Yosry Ahmed wrote: > > > > I was thinking more that being done inside the flush function. > > > > > > I think the flush function already does that in some sense if > > > might_sleep is true, right? The problem here is that we are using > > > > Oh I forgot about that. Right. > > > > ... > > > I took a couple of crashed machines kdumps and ran a script to > > > traverse updated memcgs and check how many cpus have updates and how > > > many updates are there on each cpu. I found that on average only a > > > couple of stats are updated per-cpu per-cgroup, and less than 25% of > > > cpus (but this is on a large machine, I expect the number to go higher > > > on smaller machines). Which is why I suggested a bitmask. I understand > > > though that this depends on whatever workloads were running on those > > > machines, and that in case where most stats are updated the bitmask > > > will actually make things slightly worse. > > > > One worry I have about selective flushing is that it's only gonna improve > > things by some multiples while we can reasonably increase the problem size > > by orders of magnitude. > > I think we would usually want to flush a few stats (< 5?) in irqsafe > contexts out of over 100, so I would say the improvement would be > good, but yeah, the problem size can reasonably increase more than > that. It also depends on which stats we selectively flush. If they are > not in the same cache line we might end up bringing in a lot of stats > anyway into the cpu cache. > > > > > The only real ways out I can think of are: > > > > * Implement a periodic flusher which keeps the stats needed in irqsafe path > > acceptably uptodate to avoid flushing with irq disabled. We can make this > > adaptive too - no reason to do all this if the number to flush isn't huge. > > We do have a periodic flusher today for memcg stats (see > flush_memcg_stats_dwork). It calls __mem_cgroup_flush_stas() which > only flushes if the total number of updates is over a certain > threshold. > mem_cgroup_flush_stas_delayed(), which is called in the page fault > path, only does a flush if the last flush was a certain while ago. We > don't use the delayed version in all irqsafe contexts though, and I am > not the right person to tell if we can. > > But I think this is not what you meant. I think you meant only > flushing the specific stats needed in irqsafe contexts more frequently > and not invoking a flush at all in irqsafe contexts (or using > mem_cgroup_flush_stas_delayed()..?). Right? > > I am not the right person to judge what is acceptably up-to-date to be > honest, so I would wait for other memcgs folks to chime in on this. > > > > > * Shift some work to the updaters. e.g. in many cases, propagating per-cpu > > updates a couple levels up from update path will significantly reduce the > > fanouts and thus the number of entries which need to be flushed later. It > > does add on-going overhead, so it prolly should adaptive or configurable, > > hopefully the former. > > If we are adding overhead to the updaters, would it be better to > maintain a bitmask of updated stats, or do you think it would be more > effective to propagate updates a couple of levels up? I think to > propagate updates up in updaters context we would need percpu versions > of the "pending" stats, which would also add memory consumption. > Any thoughts here, Tejun or anyone? > > > > Thanks. > > > > -- > > tejun