From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94A3FEC875F for ; Fri, 8 Sep 2023 01:18:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D258E6B0083; Thu, 7 Sep 2023 21:18:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CD4336B0085; Thu, 7 Sep 2023 21:18:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B4CCC6B0087; Thu, 7 Sep 2023 21:18:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A201D6B0083 for ; Thu, 7 Sep 2023 21:18:56 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id E1982B439B for ; Fri, 8 Sep 2023 01:12:04 +0000 (UTC) X-FDA: 81211653768.29.A6542C2 Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42]) by imf26.hostedemail.com (Postfix) with ESMTP id 1D10A140014 for ; Fri, 8 Sep 2023 01:12:02 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=jY8z98p6; spf=pass (imf26.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694135523; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lwP1nywgzoBgchgsvrLcyHWBwwxyPh3r9Nw+HpT3wqM=; b=pnB7S/qghZ0FHltkjqYGhonKt44GsQWj2X3TawWwimjArgqmBCU8Mjxy/kc9pWBYCSw5a9 DO5XtrUFtstOFEEZYiDpVqpKkoJJ8T0rK+BY82liXVP2qtq2ZnPylNNKvIc/l7YpjJAv/h ch80zKaW2HlwreNyZ5rh5GCcYScKxxM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694135523; a=rsa-sha256; cv=none; b=uNgiuLc1Mbjo7WZqIc4LSs1+ZCCyhum7bauJmsFOLdj+qm3ggKBO21cw1kYUNMJx2U57VT Cu+Oqy23V7tuaqO/+aQJBkMAaFaG/wwZz8VExLLu7Mm+akfWDEVNKWn3PJ/uCWxKJEfXtg Kwoe4y6Q45HNbMgyauqnbEdGQd/ZXKo= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=jY8z98p6; spf=pass (imf26.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f42.google.com with SMTP id a640c23a62f3a-9a2a4a5472dso580130666b.1 for ; Thu, 07 Sep 2023 18:12:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1694135521; x=1694740321; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=lwP1nywgzoBgchgsvrLcyHWBwwxyPh3r9Nw+HpT3wqM=; b=jY8z98p6k4JpPHPI/Plwp0gSIOma59+vze1MKvfkvIcL9G/dO87GP1jMcPuXWKe0pJ EwhzeCCCZj4Y0g/w2SMW3xyNggvv4kmEJnykVIqibnI9hBNB8OuoGTv5a4ONnyujyXdp RaXz9vgF/xDoMxEkrEEnQpiGpI1wpAQje3iIu+1vRAks2jhvsvpbR3YyUhVfaVhb+/+g 3OED8pQsMg5TTrA83NL9942gu3VGFvu0d1Ru6jJgkOjMVkAmZU9Wb05fp5fxzAc3tuWr C6giPxbKJJK11KZiLjCg0EdjsthnRM0tA6ZfzuB/CcF4JZ4S7k5fybjuzIa+fZreioGb +Msw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694135521; x=1694740321; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=lwP1nywgzoBgchgsvrLcyHWBwwxyPh3r9Nw+HpT3wqM=; b=O0+uVncweDRts6ZQPNyxkbfyABoafy1u3vircF0LqvT3HEY18UubWDP/oFS9x7NiN0 ilI72Cj7/+rLQDNboDucjUe0ANcsYj0mTDEJyNdalhSTH5wA3ba8EOwQFUmngwA47PhP ZlA659pGn140njWBIPEmO1gV6R+JgENqo5mve25BUlN37/Dlp4A4Bj2VIF6K2IWX3lGp PvJZJ1o9GSo/j/zetNxcxyysC+2ol9sum0TFQNEhdTKbp62cNCGKI0AuSU0Ig/87h6AE PK99wor2pItZMXeDv/91dOKz9hUOsHvkwGpT1yMSshna5yPreRFDXIB6AzyRYPvROkRA aRHg== X-Gm-Message-State: AOJu0YwV43OSdsQjqkOBklaKxrnijbx0BUZkTPzPemKzXII/YkVi42rs wgXVXoGZHB1aSr9gtyXfJJ13CiogFY6HZUVSM+cQSA== X-Google-Smtp-Source: AGHT+IEo4HrtlJQLiB3a2CguUL41qVkvqToyGm9wIVJBUWGz9ZnQ1lO3IUapwZ91iqxoVX0OpJDvsX65JDjRAipLIvM= X-Received: by 2002:a17:907:e8e:b0:9a1:e5bf:c907 with SMTP id ho14-20020a1709070e8e00b009a1e5bfc907mr5449125ejc.2.1694135521534; Thu, 07 Sep 2023 18:12:01 -0700 (PDT) MIME-Version: 1.0 References: <20230831165611.2610118-1-yosryahmed@google.com> <20230831165611.2610118-5-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Thu, 7 Sep 2023 18:11:22 -0700 Message-ID: Subject: Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads To: Ivan Babrou Cc: Wei Xu , Michal Hocko , Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Greg Thelen Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 1D10A140014 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 4h3fpwy6y8ffg57zhe93k6yek8hyc7nt X-HE-Tag: 1694135522-657412 X-HE-Meta: U2FsdGVkX18SvddPeTlRn9TZT2W2AkG4hrW+w/o/QhhqszbN+ayOLbZJY0RuIUHG+ihQ9cSpaUiIXjlt9Pwl07KXvRdMSR8mDqHcd8o/x5Ws87e+NzExqHIGa4ahnn3WLmY134w2w0T7Y7BAr/3kGpTql+dpB7d66Aw7e1Kxo/5CRixRg0wFRbpcXA1OQcULCyHLVVFwTx6afTCgj+CEoMzSVPKifGj/hoDacMt1F90DBwRjgrYgW1t2ssCyqyjniKfAJR/vKDukzBoqolBMsUQZWNFQa2hHHPZUFNtZfaZCUb44A2DYE4dhlj4+rvEsjMOad+Bd2mGcY1U2tPjkGOqH/T4DhDVieZTQJ00WQ4ihePCuuoIDbNYiIEfD9k3PMKc23SqoXSOZZpKuJuIP4LrbMNvfM7AQD3i1ccZu7hze5JdY2EeWMkblpPRnhG6HrMpINssIecwrcG2ZHbD5bcIhvlqBOBWPJj12+K+vvc7zyP0hUp8wfePY9Geo115BmDg3qhfSgFasWfq8P5vNdVz6PzeZth1PclzpoiBCAXvh7BldktsB0bsv5IfN52UCBTnKGnWKDCmYNbLChX5KkqhBQOUPuBRg+tmBaVi9xLxghkghlVOWu8+shuZccPEF5/OvRH0+1NHezyHdqr1UF1EA9XQOJj9iAz6YngzcpI9Ku16HY9QLe4Vmq1Yy0qESZDQpP2kW2oG1kBb4RqqCES3RAu6hNpJl+rSBXJzlg9kJ6RF2R+NQIt01+pm9W1R5V2AkHq+ebexoqEzeFZiBw/Ze3oht4GkG1z5y3to+0SDNkJLo6/CBbuS9EwZqexR1u1jtxkgwxkkoscYFtFol/f1PhapurokYbVROPxMkBulp1UDlaAfWikxGD0jZshPLvOIXW1197m9at1YFayvzUXsvgvzQL67R4I7Oo77JD4SChnVNCKCfJPg7B4SBooRC+fT9lS5CIKCJ10r7cX9 kbIFWCzl LcCnzyLBihw2msPStLqirwCoSzM/lfsy/MR9AXDKfigPKxyQJ7E+qc/uNQKF7hSmsmamYynsd/DDFSvsLgaLspJH9JAPKd6gdWE1DunDLkNMNW3ZiU504A9clEIgVuHL48enYM2zzBSdMQ+gX2JEqHLD1SOkPqfjBO3eEIhIgxqHJuga6hbeevtHr2DESSrBlzENFtZqK0FL8KY6BWtl7oGNcttNypNDH4XbhCA8Vn+7hd9eMzYc4I5g21UpKPMP+yTWZbtpYaZPXC0oL90htAUc7iYdYj7UtBhDdUhmiqJMJmXzL8iCSIyadHFxGorCV51JNeJOBfpezIN3j1+Z+vNU2l8GvOFRs1cZ6Gnvez/qTplETnPaORL7jD2Kjx1HfW05THdiO8pnA54ojEC73k509GFuUfXzxT7hy X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 7, 2023 at 6:03=E2=80=AFPM Ivan Babrou wr= ote: > > On Thu, Sep 7, 2023 at 5:52=E2=80=AFPM Wei Xu wrote: > > > > On Mon, Sep 4, 2023 at 8:15=E2=80=AFAM Michal Hocko w= rote: > > > > > > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote: > > > > Unified flushing allows for great concurrency for paths that attemp= t to > > > > flush the stats, at the expense of potential staleness and a single > > > > flusher paying the extra cost of flushing the full tree. > > > > > > > > This tradeoff makes sense for in-kernel flushers that may observe h= igh > > > > concurrency (e.g. reclaim, refault). For userspace readers, stale s= tats > > > > may be unexpected and problematic, especially when such stats are u= sed > > > > for critical paths such as userspace OOM handling. Additionally, a > > > > userspace reader will occasionally pay the cost of flushing the ent= ire > > > > hierarchy, which also causes problems in some cases [1]. > > > > > > > > Opt userspace reads out of unified flushing. This makes the cost of > > > > reading the stats more predictable (proportional to the size of the > > > > subtree), as well as the freshness of the stats. Userspace readers = are > > > > not expected to have similar concurrency to in-kernel flushers, > > > > serializing them among themselves and among in-kernel flushers shou= ld be > > > > okay. Nonetheless, for extra safety, introduce a mutex when flushin= g for > > > > userspace readers to make sure only a single userspace reader can c= ompete > > > > with in-kernel flushers at a time. This takes away userspace abilit= y to > > > > directly influence or hurt in-kernel lock contention. > > > > > > I think it would be helpful to note that the primary reason this is a > > > concern is that the spinlock is dropped during flushing under > > > contention. > > > > > > > An alternative is to remove flushing from the stats reading path > > > > completely, and rely on the periodic flusher. This should be accomp= anied > > > > by making the periodic flushing period tunable, and providing an > > > > interface for userspace to force a flush, following a similar model= to > > > > /proc/vmstat. However, such a change will be hard to reverse if the > > > > implementation needs to be changed because: > > > > - The cost of reading stats will be very cheap and we won't be able= to > > > > take that back easily. > > > > - There are user-visible interfaces involved. > > > > > > > > Hence, let's go with the change that's most reversible first and re= visit > > > > as needed. > > > > > > > > This was tested on a machine with 256 cpus by running a synthetic t= est > > > > script [2] that creates 50 top-level cgroups, each with 5 children = (250 > > > > leaf cgroups). Each leaf cgroup has 10 processes running that alloc= ate > > > > memory beyond the cgroup limit, invoking reclaim (which is an in-ke= rnel > > > > unified flusher). Concurrently, one thread is spawned per-cgroup to= read > > > > the stats every second (including root, top-level, and leaf cgroups= -- > > > > so total 251 threads). No significant regressions were observed in = the > > > > total run time, which means that userspace readers are not signific= antly > > > > affecting in-kernel flushers: > > > > > > > > Base (mm-unstable): > > > > > > > > real 0m22.500s > > > > user 0m9.399s > > > > sys 73m41.381s > > > > > > > > real 0m22.749s > > > > user 0m15.648s > > > > sys 73m13.113s > > > > > > > > real 0m22.466s > > > > user 0m10.000s > > > > sys 73m11.933s > > > > > > > > With this patch: > > > > > > > > real 0m23.092s > > > > user 0m10.110s > > > > sys 75m42.774s > > > > > > > > real 0m22.277s > > > > user 0m10.443s > > > > sys 72m7.182s > > > > > > > > real 0m24.127s > > > > user 0m12.617s > > > > sys 78m52.765s > > > > > > > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08= M43KO9ME4-dsgfoQ@mail.gmail.com/ > > > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbC= jcOBZcz6POYTV-4g@mail.gmail.com/ > > > > > > > > Signed-off-by: Yosry Ahmed > > > > > > OK, I can live with that but I still believe that locking involved in > > > the user interface only begs for issues later on as there is no contr= ol > > > over that lock contention other than the number of processes involved= . > > > As it seems that we cannot make a consensus on this concern now and t= his > > > should be already helping existing workloads then let's just buy some > > > more time ;) > > > > Indeed, even though the new global mutex protects the kernel from the > > userspace contention on the rstats spinlock, its current > > implementation doesn't have any protection for the lock contention > > among the userspace threads and can cause significant delays to memcg > > stats reads. > > > > I tested this patch on a machine with 384 CPUs using a microbenchmark > > that spawns 10K threads, each reading its memory.stat every 100 > > milliseconds. Most of memory.stat reads take 5ms-10ms in kernel, with > > ~5% reads even exceeding 1 second. This is a significant regression. > > In comparison, without contention, each memory.stat read only takes > > 20us-50us in the kernel. Almost all of the extra read time is spent > > on waiting for the new mutex. The time to flush rstats only accounts > > for 10us-50us (This test creates only 1K memory cgroups and doesn't > > generate any loads other than these stat reader threads). > > > > Here are some ideas to control the lock contention on the mutex and > > reduce both the median and tail latencies of concurrent memcg stat > > reads: Thanks for the analysis, Wei! I will update the patch series based on your ideas to limit the contention on the userspace read mutex. > > > > > - Bring back the stats_flush_threshold check in > > mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats(). This > > check provides a reasonable bound on the stats staleness while being > > able to filter out a large number of rstats flush requests, which > > reduces the contention on the mutex. > > > > - When contended, upgrade the per-memcg rstats flush in > > mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these > > contended flushes together. We can wait for the ongoing flush to > > complete and eliminate repeated flush requests. > > Full root memcg flush being slow is one of the issues that prompted this = patch: > > * https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9M= E4-dsgfoQ@mail.gmail.com/ > > I don't want us to regress in this regard. It will only be a fallback if there is high concurrency among userspace reads, which will cause high contention on the mutex. In that case, the userspace reads will be slowed down by contention, which can be even worse than flush slowness as it is theoretically unbounded. I am working on a v5 now to incorporate Wei's suggestions. Would you be able to test that and verify that there are no regressions? > > > > - Wait for the mutex and the ongoing flush with a timeout. We should > > not use busy-wait, though. We can bail out to read the stats without > > a flush after enough wait. A long-stalled system call is much worse > > than somewhat stale stats in the corner cases in my opinion. > > > > Wei Xu > >