From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4EF2C7EE30 for ; Tue, 1 Jul 2025 13:08:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8B45B6B0099; Tue, 1 Jul 2025 09:08:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 88C056B00B4; Tue, 1 Jul 2025 09:08:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7C8996B00B5; Tue, 1 Jul 2025 09:08:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 67DAB6B0099 for ; Tue, 1 Jul 2025 09:08:53 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id F0363C0469 for ; Tue, 1 Jul 2025 13:08:52 +0000 (UTC) X-FDA: 83615725704.01.3D0B334 Received: from lgeamrelo07.lge.com (lgeamrelo07.lge.com [156.147.51.103]) by imf14.hostedemail.com (Postfix) with ESMTP id 372E510000E for ; Tue, 1 Jul 2025 13:08:49 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of youngjun.park@lge.com designates 156.147.51.103 as permitted sender) smtp.mailfrom=youngjun.park@lge.com; dmarc=pass (policy=none) header.from=lge.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751375331; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PD3Kp8apKs+7iy7NJRt19OyVwRp2czGHThFogMzMh2c=; b=SLMJNtrmGxI8KWlETTrBsnxWjmdMwHHD/7cfhgkiSE2A5paEtpE6RHAwi3Pzvfg5xDpnyt UHNQgiYNu/jpNgoEu2+MR3JlpH+e1juKtN/icionVPkfhN/JA2/xoioYbiVSfMmARCpqdd Sk0m6JhDPf+PZolyi2eX3txOO3vnYb8= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of youngjun.park@lge.com designates 156.147.51.103 as permitted sender) smtp.mailfrom=youngjun.park@lge.com; dmarc=pass (policy=none) header.from=lge.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751375331; a=rsa-sha256; cv=none; b=n0PKHHGLMHPxwDUe1p3VTIymyfZZCrfgpdUp3zqLGzUgTgrYuBt994Ufr7+eN0cRFOALdV BZxfmilEyW41TsyCqfHiJIbrv2CG4JDRjKt/P4gpKGIWqFdLShAWynznD9UHbrcpGVfy8l 93iwcxJPWXMKQnX2lb1dlnYY2umnTU0= Received: from unknown (HELO yjaykim-PowerEdge-T330) (10.177.112.156) by 156.147.51.103 with ESMTP; 1 Jul 2025 22:08:46 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com Date: Tue, 1 Jul 2025 22:08:46 +0900 From: YoungJun Park To: Michal =?iso-8859-1?Q?Koutn=FD?= Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, shikemeng@huaweicloud.com, kasong@tencent.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org, muchun.song@linux.dev, iamjoonsoo.kim@lge.com, taejoon.song@lge.com, gunho.lee@lge.com Subject: Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control Message-ID: References: <20250612103743.3385842-1-youngjun.park@lge.com> <20250612103743.3385842-2-youngjun.park@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 372E510000E X-Stat-Signature: 3t6989uo8nmqk7ztr7nwtmq6t7tg91r9 X-HE-Tag: 1751375329-317537 X-HE-Meta: U2FsdGVkX1/nVsScusjBf9ixU7KvHtqsIKU5j0sVmZ4usXqzTJZZTaLyxER9nt2vucFhFtPdpqqIvrulBdsPpP8L1hkBrUbWn0HE2AcLeHgnxpcpXPbCGGfm3oKeD0VoGeRlOzkhwidyGlAOsH9iCRJQ6GBnWR38Rie5CaVm+ZZpLSBLWiVKAYmd58JVNxmAMVJwVERx5ZRaHmgQ4Y8vaxADFCdB0uRSpj8eBuXI6GLZSC1d1QVbYFflWyu6sA5i7LG3uGaD5lmtNQr5kMCNZYICjNXaXHS3MxBrLOXMWppzZW+vjoT2BRKwLz3k5csxGaVPB9OyXaeVTAbzjfeoC4FtIrQGwnhk7z2v5dXhZYcTBtwQ6q8aIPkiM/5DPxtd1XhhE5bgTQbUmwissfxhNhD7jhwIwdb25kNZH2jDho31eB3EUciRkLU30wlZfFiRkq45O5LD/ZcenUCDat8WiUGdEqS4B9/X8TEk9gyCHWhNGuI32RfLn6csgnLAiHs3ZS7chr8EvbC1c9X6jZWHMSw9FUuX88LuA2g/ClO+PWHAKQc8FUEexF62eyb7OJkBXYp9FI3o9z+PBH23uEEiZHMK96MfxCf7sXRDE43SUiDTBWkhxvHzEW3b9Wf2eyyG0U2ut2VkglEOuLkxYDB1cmJIhGCDg+zAxAv0WaOgYWbWTtyXky585WYRudNcCivxiQl/iHQZWOM6E1khERT4+T8wnpIQmjqeY2uMpGou36y6ZkHUZOhPe+ajagsFCfOwRwb1YOgOBKzAERQbCgFzpy18qLKwkRCsUqNuFaLb9rx69KhGvePhZxVbDxxVGdUBf+LUTJPM1q6SQnY8OhZzZDsgtaS805SD1/fpHSAZ1jPJUKTvjsAYi7YSdcMGVIrkTdNm+vZnbwHhCOuD3iqSd3k8qryGnwbsuJoz1SsOjH3uU237Y6t8cuXwi33aNjHGBu2WfBPP8yR/uk1E7so 6gSbJHbT qeOr/6dXwJSJOTP/0I9DE7Sew50FW8wCp6w2D+9qBqmgAZQylmqPXI2FzsFtuN8UwXIbcuz3g4XLdisgPU/wsD3H8mxUdAIstXBovyB1YZBWPasInn98pLFtlgslX07qPA8RJCxvWI/iY4s2D9aJQn6gPPLVah05fwPRnkLPywWNQZY6MkquFEdi7ujGOGDJisSgcvfeH0zGB5tyZ/msozTRchaQE+W1AxFmrtTx7bfGwLU5Xfa+MpTjohXfV92vUID6V1iQooou7i9c= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 30, 2025 at 07:39:47PM +0200, Michal Koutný wrote: > On Wed, Jun 18, 2025 at 09:07:51PM +0900, YoungJun Park wrote: > > This is because cgroups can still restrict swap device usage and control > > device order without requiring explicit priorities for all devices. > > In this view, the cgroup interface serves more as a limit or preference > > mechanism across the full set of available swap devices, rather than > > requiring full enumeration and configuration. Hello Michal, Thank you very much for your thoughtful review and for sharing your insights. I’d like to share my thoughts and the reasoning behind my current direction, including some points I considered in relation to your suggestions. > I was wondering whether your use cases would be catered by having > memory.swap.max limit per device (essentially disable swap to undesired > device(s) for given group). The disadvantage is that memory.swap.max is > already existing as scalar. Alternatively, remapping priorities to I did consider implementing this kind of control. In that design, it would work similarly to memory.swap.max but per device: the implementation would iterate through the swap devices in priority order and maintain per-cgroup counters for each device’s usage. It would also need to handle proper counter cleanup after use, and ensure that usage checks also happen on the fastpath where per-CPU caches for swap device clusters come into play. >From a runtime behavior perspective, the priority-based approach seemed preferable, as it allows more flexible control: the configured cgroup can strongly prefer the desired device and benefit from faster selection at allocation time. I also considered how this would coexist with the existing swap.max interface, but given the additional implementation and runtime overhead this would introduce, I decided to hold it back and chose a priority-based approach instead. > already existing as scalar. Alternatively, remapping priorities to > memory.swap.weight -- with sibling vs sibling competition and children > treated with weight of parent when approached from the top. I find this > weight semantics little weird as it'd clash with other .weight which are > dual to this (cgroups compete over one device vs cgroup is choosing > between multiple devices). Your point about the semantic mismatch is very valid. I agree that reusing .weight semantics here could be confusing: .weight usually expresses competition among siblings for a shared resource, whereas here, the goal is to steer selection among multiple devices within a single cgroup’s scope. The swap priority concept already exists as an independent mechanism, so mapping it into a .weight field might not align well in practice. > Please try to take the existing distribution models into account not to > make something overly unidiomatic, I also thought about possible alignment with existing mechanisms like zswap.writeback. One alternative could be to adopt an on/off style mechanism similar to zswap.writeback including propagation strategy. On implementation-wise, this could be handled by including or excluding devices from the cgroup’s swap device priority list. (The direction I suggested on) However, this approach also has limitations in certain use cases. For example, if we want to enforce a different ordering than the global system swap priority, an on/off switch alone is not sufficient. One possible example would be: (Some cgroup use the slowest available swap device but with a larger capacity avoiding swap failure.) Global swap: A (fast) -> B (slower) -> C (slowest) Cgroup swap: C (slowest) -> B (slower) -> A (fast) This kind of configuration cannot be achieved only with an on/off switch. I think that priority approach might not map perfectly to the existing major distribution models (like limit, weight, etc.), I cautiously see this as an extension of the resource control interfaces, building on the solid foundation that the cgroup mechanism already provides. I am working to ensure that the proposed interface and propagation behavior integrate properly with parent cgroups and follow the same interface style. Here is the current version I am working on now. (It turned out a bit long, but I felt it might be useful to share it with you.) memory.swap.priority A read-write flat-keyed file which exists on non-root cgroups. Example: (after swapon) $ swapon NAME TYPE SIZE USED PRIO /dev/sdb partition 300M 0B 10 /dev/sdc partition 300M 0B 5 /dev/sdd partition 300M 0B -2 To assign priorities to swap devices in the current cgroup, write one or more lines in the following format: Example: (writing priorities) $ echo "1 4" > memory.swap.priority $ echo "2 -2" > memory.swap.priority $ echo "3 -1" > memory.swap.priority Example: (reading after write) $ cat memory.swap.priority 1 4 2 -2 3 -1 The priority semantics are consistent with the global swap system: - Higher values indicate higher preference. - See Documentation/admin-guide/mm/swap_numa.rst for swap numa autobinding. Note: A special value of -1 means the swap device is completely excluded from use by this cgroup. Unlike the global swap priority, where negative values simply lower the priority, setting -1 here disables allocation from that device for the current cgroup only. If any ancestor cgroup has set a swap priority configuration, it is inherited by all descendants. In that case, the child’s own configuration is ignored and the topmost configured ancestor determines the effective priority ordering. memory.swap.priority.effective A read-only file showing the effective swap priority ordering actually applied to this cgroup, after resolving inheritance from ancestors. If there is no configuration in the current cgroup and its ancestors, this file shows the global swap device priority from `swapon`, in the form of unique_id priority pairs. Example: (global only) $ swapon NAME TYPE SIZE USED PRIO /dev/sdb partition 300M 0B 10 /dev/sdc partition 300M 0B 5 /dev/sdd partition 300M 0B -2 $ cat /sys/fs/cgroup/parent/child/memory.swap.priority.effective 1 10 2 5 3 -2 Example: (with parent override) # Parent cgroup configuration $ cat /sys/fs/cgroup/parent/memory.swap.priority 1 4 2 -2 # Child cgroup configuration (ignored because parent overrides) $ cat /sys/fs/cgroup/parent/child/memory.swap.priority 1 8 2 5 # Effective priority seen by the child $ cat /sys/fs/cgroup/parent/child/memory.swap.priority.effective 1 4 2 -2 In this case: - If no cgroup sets any configuration, the output matches the global `swapon` priority. - If an ancestor has a configuration, the child inherits it and ignores its own setting. I hope my explanation clarifies my intention, and I would truly appreciate your positive consideration and any further thoughts you might have. Best regards, Youngjun Park