From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F229EC61CE8 for ; Thu, 12 Jun 2025 21:33:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 59A816B007B; Thu, 12 Jun 2025 17:33:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 54C696B0089; Thu, 12 Jun 2025 17:33:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 43AA26B008A; Thu, 12 Jun 2025 17:33:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 226016B007B for ; Thu, 12 Jun 2025 17:33:08 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7AC6614020A for ; Thu, 12 Jun 2025 21:33:07 +0000 (UTC) X-FDA: 83548049214.28.3D9D333 Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51]) by imf25.hostedemail.com (Postfix) with ESMTP id 91877A000A for ; Thu, 12 Jun 2025 21:33:05 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="FFx/rtqG"; spf=pass (imf25.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749763985; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/W0E+0erWFRI6IuIwCEM9TnSMgthsjepghVvZBJq5Ag=; b=mUFgU56NzQikfyfnnid4rF1b7zg3Sac/NjcGDJrRAax1BpOv5VL5//WdcgweDseFNauGXP uZdhIZdGKHvUCemX0nfSt/sUemBGbWp+rNSTnpn3bfEKS+yJEAL7J60RldCUdn+lDGWxc6 teq2vGqjnRdhw1bxTucUiJZ9wBHnom8= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="FFx/rtqG"; spf=pass (imf25.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749763985; a=rsa-sha256; cv=none; b=VPswdWAfbR5wj/X1q5MlDw+Nv87+illHMq4+py+nWT5AM3PWzn2z4OaVKB3K5CgnEVB8Hp wTMygu7fWMCYgICuiEbRFOuiS/01YFaKqknoLGIG1dAZzmJ8/DVPRWGkKEOiaLSErxagfx caegoz+Cq8nCvsT8HCXvfOHv7LHzyUI= Received: by mail-qv1-f51.google.com with SMTP id 6a1803df08f44-6facf4d8ea8so16641556d6.0 for ; Thu, 12 Jun 2025 14:33:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749763984; x=1750368784; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/W0E+0erWFRI6IuIwCEM9TnSMgthsjepghVvZBJq5Ag=; b=FFx/rtqGuakpOPKByI/bxAV7rwAZjsJ5QR+igafyWMQSOZXL2pYbcXuddrtXvi7UJ5 7K1iyvMNHTIURfxmILr4ahlb/mHarvKbd3YFvI/Hggeoq1U96z+aHl6DO+jLG4CmYm7R tDwjsneIAgWvo58eD6aBv4U6ivxUDNZYYY9zZizhI56jLMv56KHdaosQ3v+c4TJJUZ0h upA7DJR2mOvQhOrqFZABS7vjI3lvGKJ0zQtaexYkS4sukJMPCtlgy4dpFEZhwobs0gOe qRIV2CLvzErKX+RtzVS8lYReTrjeaRj0gPfN13vB0T0iLXjYHkHqiQUIZK2h3a/CDwl5 eq1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749763984; x=1750368784; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/W0E+0erWFRI6IuIwCEM9TnSMgthsjepghVvZBJq5Ag=; b=eT7W1dWqJ09QE8vTYPk9WrD6XtyJLS3UwSuAYygNMbzzJGPIIP829wSxFzfyI8Afnn 5zE7GcmCnwk2yzQEes6Sm9qkGuhkRRh/7iu8brfyCUI2o3BE8QfoItsTY8CFm/KJTGHp 6g7vmN4nK5PtrjnuIGiYaUMsCeOESx0gaBHZzhsJn1S/FwnHANerz3nVh0SwfsN2ZK6H Nmjri+jmrOwAxmO7rQerCjp2j1Q8/0C3ovBoIUzr7UR4b0OCTxSo0BHE5vNQMaKrLp5r b6aDbIgMZWnznMsuFRT+3hU/e0aMla1xiihZonUQIrUbqIc2BavYNyz00bf03LxAyWnE C1oQ== X-Forwarded-Encrypted: i=1; AJvYcCVGfjEkIZsBzCK1iPqVfqF8AQHMdyNK2QuSQ1i6u0jUvIWPH+7i49s9dVjdBNGFwpjdchWiJZ5bjQ==@kvack.org X-Gm-Message-State: AOJu0YzFnXlfHflxqlgmiL9eJM5s6ZbLuEwvCLD3AMmXjwX2Dn03zCNe vD2a5znZpi978DmGps1LnyE2DQghEhmsa+05UVXaVjNh35p6YyiCpH7PD9G7FgzR5dGDWUz1hTP NYwn64D8AMOwf+akkoWbCWkvFZO5n6T4= X-Gm-Gg: ASbGnct4ZVhyxB+vdW1jYFjeZ/nFHM4thKhXeJqg4wCm+bcJ3FlQdC2qVFcJl4QYE++ CgQPxgkqwYmLZvTucdBDBODR0r1/YRHrbO0GWUqRERFOP6tJjbuTYQGtQRZNxf83y6gCFrsExmF lcD7/TwysyCD5WbYNL8kOxmHGSKEWWaogW3qvpefhsLHkFRQWA2qPyP/h3ikM= X-Google-Smtp-Source: AGHT+IHzFi+Yhlfe3OfvbdCP6olLDf/Vj/PEhdMIhOfhpXE3X2YdVUC/TGPx3vhfOdJdkc4ZxsUiIIWnYNlXs0IovoQ= X-Received: by 2002:ad4:4ea7:0:b0:6f8:af2b:8ba4 with SMTP id 6a1803df08f44-6fb3e602d09mr6664676d6.21.1749763984434; Thu, 12 Jun 2025 14:33:04 -0700 (PDT) MIME-Version: 1.0 References: <20250612103743.3385842-1-youngjun.park@lge.com> In-Reply-To: From: Nhat Pham Date: Thu, 12 Jun 2025 14:32:53 -0700 X-Gm-Features: AX0GCFtGOh_XJ_ZJGj3ooDLvZL6D9FGFouUhOscf4aRJ6ZvEyKWTCl2MLFPzvLI Message-ID: Subject: Re: [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization To: Kairui Song Cc: youngjun.park@lge.com, linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, shikemeng@huaweicloud.com, bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org, muchun.song@linux.dev, iamjoonsoo.kim@lge.com, taejoon.song@lge.com, gunho.lee@lge.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 91877A000A X-Stat-Signature: jyae5fh1q6dksamtbjotxyidshapam5d X-Rspam-User: X-HE-Tag: 1749763985-427434 X-HE-Meta: U2FsdGVkX190YdxQA7BpTsbqT9NU/b3Ru+KkRMl7FkT6YgPwrF92smj8dT7rqhmrkhmLWAiQCREGFjnssc8dUNUzqoKa5/IrA4G7mMPDRB8huWiKOyihNUE9lcTXgpkn04bbHZh7JyxgHRJqV73m4iu5ctuUTWx6DQX2/GAT6BE0PJh3vK5UJbFpOxhswJ4XHi91KcDWBn0TqMQakv1uDIFlvDpIx4+I5p8frRqW/sR92isqpPUjUrwhogUjUIl5SSvLUFCynmvBhpD0OYXNSHgXra8/LNXGVnLhr1OxgZjBa5uDUrKrydk1UsQTNwdJcn19PrAVyzJ3jumSrh5wcLlFg1VkxStPyZ5+nLFjxNPdsp2IFghGUZpMVVVsGkIkAx32utB5nO+508VQGDWS/yY0GxLGZBuDq4i8WYkOO4sh7gmHH65DjgMfIjeJBXUBVg80xpQV/wb3K3URWhmStK7vadu/UfaSo+dYEg0RwpVc380nupaTfxkXcO/Jq6u6zvjIvQQenkkScShsuCrINvPGGgOTonZlEY2FCakwRyJfNsgodAgINDqnuRH5YTsE37d/3pAYxtLDlj5xfrRsrpyDRor9ty3scJlcIhMdTMECcGPxwgv157ODDvL+e3CLll5GN8R0xW/6hOaFH6cAnT1SkebOyTlbNTNtdiRx8uw6287H/X7wy8J5IhrL8PGnQSxy+IcoKgMAZY2IRyAJg8EqDLPYrKVEIXvbsXxORey5aKtwIrjJVb3glD1F+2UPyM6hQqMHmtLXNjYlOCSx69+f1zm/SBfW5JEK6HK9B3dmlmlWTEJKNxhnRpx2KsgFrWEgFz5xrUytflo8Rbpj7i8rmGXFDsxLrFZfKEW2Rme6wZpwFZmr71o9fYkE1x1+XzOZpXJ+aEZ8c0FWiMVJyuNTbnb6eLl5tXRejPF5Wz/DF+gdi1Tq1crlhiz+ajqehKN65+/BjctCXVxQjuA rf7vaYTb AHpO64AGeGXy55orXzPIv0xIhRA/dReZNkccQAiV6phHPpec0e31lQRQiP3LGJb0Aqy7fy1um0hZV5gj9aw4Nm58QPa1G9ATRKAzH1TI7/ljK5oz+0GBNPZW0H2pECllAqsWPIk/mtSiyiiYn8VdY46Xx/+1+P4z2xzHkwArAvrjBI5TnTD9GV0B0scvwiqH8xXztP4mzVN6H21LlruAxjGlsQRD1toqHLF4F6TyPzllp08uP02RVh1ad3m4C6AnwzdAqRBSl9hS0ByJ1DXlznlkT15ckbvYmhQaZfVbMeY3ZMhIgcEjbHmimsKU/3NmzrBxzpsJJRiRImwUk43rzjw7FcjMB+VddoSOfZX4Tz2y8soLs/jZNn6lerMIEMn44WuTOaLk190vBdFln8gAX0PzEWQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 12, 2025 at 5:24=E2=80=AFAM Kairui Song wrot= e: > > On Thu, Jun 12, 2025 at 6:38=E2=80=AFPM wrote: > > > > From: Youngjun Park > > > > Introduction > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > I am a kernel developer working on platforms deployed on commercial con= sumer devices. > > Due to real-world product requirements, needed to modify the Linux kern= el to support > > a new swap management mechanism. The proposed mechanism allows assignin= g different swap > > priorities to swap devices per cgroup. > > I believe this mechanism can be generally useful for similar constraine= d-device scenarios > > and would like to propose it for upstream inclusion and solicit feedbac= k from the community. We're mostly just using zswap and disk swap, for now, so I don't have too much input for this. Kairui, would this design satisfy your zram use case as well? > > > > Motivation > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Core requirement was to improve application responsiveness and loading = time, especially > > for latency critical applications, without increasing RAM or storage ha= rdware resources. > > Device constraints: > > - Linux-based embedded platform > > - Limited system RAM > > - Small local swap > > - No option to expand RAM or local swap > > To mitigate this, we explored utilizing idle RAM and storage from nearb= y devices as remote > > swap space. To maximize its effectiveness, we needed the ability to con= trol which swap devices > > were used by different cgroups: > > - Assign faster local swap devices to latency critical apps > > - Assign remote swap devices to background apps > > However, current Linux kernel swap infrastructure does not support per-= cgroup swap device > > assignment. > > To solve this, I propose a mechanism to allow each cgroup to specify it= s own swap device > > priorities. > > > > Evaluated Alternatives > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > 1. **Per-cgroup dedicated swap devices** > > - Previously proposed upstream [1] > > - Challenges in managing global vs per-cgroup swap state > > - Difficult to integrate with existing memory.limit / swap.max seman= tics > > 2. **Multi-backend swap device with cgroup-aware routing** > > - Considered sort of layering violation (block device cgroup awarene= ss) > > - Swap devices are commonly meant to be physical block devices. > > - Similar idea mentioned in [2] > > 3. **Per-cgroup swap device enable/disable with swap usage contorl** > > - Expand swap.max with zswap.writeback usage > > - Discussed in context of zswap writeback [3] > > - Cannot express arbitrary priority orderings > > (e.g. swap priority A-B-C on cgroup C-A-B impossible) > > - Less flexible than per-device priority approach > > 4. **Per-namespace swap priority configuration** > > - In short, make swap namespace for swap device priority > > - Overly complex for our use case > > - Cgroups are the natural scope for this mechanism > > > > Based on these findings, we chose to prototype per-cgroup swap priority= configuration > > as the most natural, least invasive extension of the existing kernel me= chanisms. > > > > Design and Semantics > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > - Each swap device gets a unique ID at `swapon` time > > - Each cgroup has a `memory.swap.priority` interface: > > - Show unique ID by memory.swap.priority interface > > - Format: `unique_id:priority,unique_id:priority,...` > > - All currently-active swap devices must be listed > > - Priorities follow existing swap infrastructure semantics > > - The interface is writeable and updatable at runtime > > - A priority configuration can be reset via `echo "" > memory.swap.prio= rity` > > - Swap on/off events propagate to all cgroups with priority configurati= ons > > > > Example Usage > > ------------- > > # swap device on > > $ swapon > > NAME TYPE SIZE USED PRIO > > /dev/sdb partition 300M 0B 10 > > /dev/sdc partition 300M 0B 5 > > > > # assign custom priorities in a cgroup > > $ echo "1:5,2:10" > memory.swap.priority > > $ cat memory.swap.priority > > Active > > /dev/sdb unique:1 prio:5 > > /dev/sdc unique:2 prio:10 > > > > # adding new swap device later > > $ swapon /dev/sdd --priority -1 > > $ cat memory.swap.priority > > Active > > /dev/sdb unique:1 prio:5 > > /dev/sdc unique:2 prio:10 > > /dev/sdd unique:3 prio:-2 > > > > # reset cgroup priority > > $ echo "" > memory.swap.priority > > $ cat memory.swap.priority > > Inactive > > /dev/sdb unique:1 prio:10 > > /dev/sdc unique:2 prio:5 > > /dev/sdd unique:3 prio:-2 > > > > Implementation Notes > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > The items mentioned below are to be considered during the next patch wo= rk. > > > > - Workaround using per swap cpu cluster as before > > - Priority propgation of child cgroup > > - And other TODO, XXX > > - Refactoring for reviewability and maintainability, comprehensive test= ing > > and performance evaluation > > Hi Youngjun, > > Interesting idea. For your current approach, I think all we need is > per-cgroup swap meta info structures (and infrastures for maintaining > and manipulating them). Agreed. > > So we have a global version and a cgroup version of "plist, next > cluster list, and maybe something else", right? And then > once the allocator is folio aware it can just prefer the cgroup ones > (as I mentioned in another reply) reusing all the same other > routines. Changes are minimal, the cgroup swap meta infos > and control plane are separately maintained. > > It seems aligned quite well with what I wanted to do, and can be done > in a clean and easy to maintain way. > > Meanwhile with virtual swap, things could be even more flexible, not > only changing the priority at swapout time, it will also provide > capabilities to migrate and balance devices adaptively, and solve long > term issues like mTHP fragmentation and min-order swapout etc.. Agreed. > > Maybe they can be combined, like maybe cgroup can be limited to use > the virtual device or physical ones depending on priority. Seems all > solvable. Just some ideas here. 100% > > Vswap can cover the priority part too. I think we might want to avoid > duplicated interfaces. Yeah as long as we have a reasonable cgroup interface, we can always change the implementation later. We can move things to virtual swap, etc. at a latter time. > > So I'm just imagining things now, will it be good if we have something > like (following your design): > > $ cat memcg1/memory.swap.priority > Active > /dev/vswap:(zram/zswap? with compression params?) unique:0 prio:5 > > $ cat memcg2/memory.swap.priority > Active > /dev/vswap:/dev/nvme1 unique:1 prio:5 > /dev/vswap:/dev/nvme2 unique:2 prio:10 > /dev/vswap:/dev/vda unique:3 prio:15 > /dev/sda unique:4 prio:20 > > $ cat memcg3/memory.swap.priority > Active > /dev/vda unique:3 prio:5 > /dev/sda unique:4 prio:15 > > Meaning memcg1 (high priority) is allowed to use compressed memory > only through vswap, and memcg2 (mid priority) uses disks through vswap > and fallback to HDD. memcg3 (low prio) is only allowed to use slow > devices. > > Global fallback just uses everything the system has. It might be over > complex though? Sounds good to me. > > > > > > Future Work > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > These are items that would benefit from further consideration > > and potential implementation. > > > > - Support for per-process or anything else swap prioritization This might be too granular. > > - Optional usage limits per swap device (e.g., ratio, max bytes) > > - Generalizing the interface beyond cgroups > > > > References > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > [1] https://lkml.iu.edu/hypermail/linux/kernel/1404.0/02530.html > > [2] https://lore.kernel.org/linux-mm/CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejB= kiwFHgy9uMk-F8Y-w@mail.gmail.com > > [3] https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkX= UZSBVJrcGFXCA@mail.gmail.com > > > > All comments and feedback are greatly appreciated. > > Patch will follow. > > > > Sincerely, > > Youngjun Park > > > > youngjun.park (2): > > mm/swap, memcg: basic structure and logic for per cgroup swap priorit= y > > control > > mm: swap: apply per cgroup swap priority mechansim on swap layer > > > > include/linux/memcontrol.h | 3 + > > include/linux/swap.h | 11 ++ > > mm/Kconfig | 7 + > > mm/memcontrol.c | 55 ++++++ > > mm/swap.h | 18 ++ > > mm/swap_cgroup_priority.c | 335 +++++++++++++++++++++++++++++++++++++ > > mm/swapfile.c | 129 ++++++++++---- > > 7 files changed, 523 insertions(+), 35 deletions(-) > > create mode 100644 mm/swap_cgroup_priority.c > > > > base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 > > -- > > 2.34.1 > > > >