From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 47152C83F07 for ; Mon, 7 Jul 2025 14:45:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DCAC36B03F8; Mon, 7 Jul 2025 10:45:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DA28C6B03F9; Mon, 7 Jul 2025 10:45:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CDFBA6B03FA; Mon, 7 Jul 2025 10:45:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BAF206B03F8 for ; Mon, 7 Jul 2025 10:45:32 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 69CF616010D for ; Mon, 7 Jul 2025 14:45:32 +0000 (UTC) X-FDA: 83637742104.02.BD72FCF Received: from lgeamrelo03.lge.com (lgeamrelo03.lge.com [156.147.51.102]) by imf13.hostedemail.com (Postfix) with ESMTP id 9BAE320005 for ; Mon, 7 Jul 2025 14:45:29 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=lge.com; spf=pass (imf13.hostedemail.com: domain of youngjun.park@lge.com designates 156.147.51.102 as permitted sender) smtp.mailfrom=youngjun.park@lge.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751899530; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Qf4AsbMlFZY2H58s18n5M8xh3oiz89AygsFDu34i+OY=; b=jrug48w7LAFDHEk5c7Lq6fIp0dQKGm62SKLphww4oJkxFt9pjRz77u29pXwEydXSnpvuvw Xo2ISsrKX8b6H6veM/JOr4LdRX5B5T//INU0U1fhYvxYYQGS4sWukDK3MXSwZUF5UMGKsc jbvyVhss96aoj7TrojMpCVZswScVOyg= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=lge.com; spf=pass (imf13.hostedemail.com: domain of youngjun.park@lge.com designates 156.147.51.102 as permitted sender) smtp.mailfrom=youngjun.park@lge.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751899530; a=rsa-sha256; cv=none; b=8QBo7Hl8yqGte2HQBpZ5LVpTJ88mVlfbnzMx5fv6eySFw1FG7hNk3W0KI7wmkA46ns9Jdr 3igHQ59/FeYZb50F0ThoZUleq7+7RE79Dneq5wO1grTxxqGloQTyOZL0bFX0QgrrQxc0EP lmLxVnzJ1j4gSocg03PpQ8eJuqlWj9c= Received: from unknown (HELO yjaykim-PowerEdge-T330) (10.177.112.156) by 156.147.51.102 with ESMTP; 7 Jul 2025 23:45:25 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com Date: Mon, 7 Jul 2025 23:45:25 +0900 From: YoungJun Park To: Michal =?iso-8859-1?Q?Koutn=FD?= Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, shikemeng@huaweicloud.com, kasong@tencent.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org, muchun.song@linux.dev, iamjoonsoo.kim@lge.com, taejoon.song@lge.com, gunho.lee@lge.com Subject: Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control Message-ID: References: <20250612103743.3385842-1-youngjun.park@lge.com> <20250612103743.3385842-2-youngjun.park@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: 9BAE320005 X-Stat-Signature: 8kfs6831fu71y1dni1abnar8btneoajx X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1751899529-677528 X-HE-Meta: U2FsdGVkX18smehlng0lpIakoNd+kmFVdXBe/f6wEgQrEgweNnqWkXTFumpFUCnTUveJo1lC/cFaXMck86uatWB8CR2sgjGyA8LyzYSGx+BHXnfs+PQitInvgrOKZVpMfRjxlfyEayLjX3C9ID3Q/tuN+hPSYoNcgVczdxJ7/GiIjHeoG21XtA1kCSuJwYJG6n9RJDEoR8nYSUitqo8D9jiFP47QKr4j/w9m9X6SbItXGxMa7Jl76tSajefH9uCC58qWjDNeOPyYN2xWSRQA3aD883fW47r1iCQrQ5B3A/6YPPDb2YDef4X6m66GgdDCnl8L8m9eqLqWVuluk+Deqoqwgshqluknrvz8MKjJspf2vkmy5mjiahb1HLramtQtrnfe11Q5caZyTa1GisHfqcIhJtFThfxStoCDYNqGIh1TiTNlDuHdQ4l9dTjFldRnDOiri5n5EBU2hwljEaOXQ8xHqpjnnzx3ZFWb6GTNUa3/82aWU5f294cHK/LtCs51GmH+Wt0ljWuX4AisflLys0K1GtaIJjriux0tFqv5h2k9EW6uJA+ySx21xWIy5LkCD0Qqnt7BlAmK/ookhVQ6WYl+D7toI41zsTkkMZvZ4/zBEP9sCYN4SqMvykJJgw+TPouxlxsKm42uldxXn5ftFYm/VfQ+/0vb8c8J8+YbFdXDm1b81WdjPNiZkG6pp9VLwqd3ZQ6ugfsAPr/zhC9YuDtNqg6NvJneXp9GwvaTdm01Nx+zCRkwtj/fhsmaNs1dv+fc/oRJBL6w8aF63Q5FzXAfqt2HdGSo71l/i4ad4Wa2AAaXe0d/bAP7pV4zi8udqbYJuvZsAWczbQuq0lJVRCwrn0ZviY7FPUNonanSfU5QmFIi9cWdLZBKQskAs8p+twjymJ4UKjqqzIhnv0H0fgsMKrhdT2/qeaFgNmALScu5qtsLHC5hG4el9gj/Fe6Yl9Njwm5Kxb1SwPNy3E0 R20zViuI GwTg7NZ2r1LEMPkYWhnrQOHaQnqTbGugJKTucMGQaw7mmi0DT9GY903HeIxej6le2NdUFDLFtEVCHvCq4TvU2inrUeVh302kLKIjVPDS+txDOMvtP/d4ebwPI6OUk0UCDSLSK3LZkcBt+q36Vkz8WB0RhmDq8VfZEYa8PCieyWo28eJ+ArFy/Lqr3LQSPPRBxV2yeaSp9VWr0ynp6Sj4AUwfmF303BPiKQhPj2x2u7ALlbJf8jvld8fz3r0Nxa4iKM62lBvq0+JLroXI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jul 07, 2025 at 11:59:49AM +0200, Michal Koutný wrote: > Hello. > > On Tue, Jul 01, 2025 at 10:08:46PM +0900, YoungJun Park wrote: > > memory.swap.priority > ... > > > To assign priorities to swap devices in the current cgroup, > > write one or more lines in the following format: > > > > > > How would the user know this unique_id? (I don't see it in /proc/swaps.) The unique_id is a new concept I introduced to refer to assigned swap devices. It's allocated whenever a swap device is turned on. I did explore other key identifiers like the swap device path, but I determined that providing a separate unique_id is more suitable for this context. Initially, I proposed printing it directly from memory.swap.priority to facilitate usage like: $ swapon NAME TYPE SIZE USED PRIO /dev/sdb partition 300M 0B 10 /dev/sdc partition 300M 0B 5 $ cat memory.swap.priority Active /dev/sdb unique:1 prio:10 /dev/sdc unique:2 prio:5 Following your suggestion, I've deprecated this initial proposal and considered four alternatives. I'm currently leaning towards options 2 and 4, and I plan to propose option 4 as the primary approach: 1. /proc/swaps with ID: We've rejected this due to potential ABI changes. 2. New /proc interface: This could be /proc/swaps with the ID, or a dedicated swapdevice file with the ID. While viable, I prefer not to add new /proc interfaces if we can avoid it. 3. /sys/kernel/mm/swap/ location: (Similar to vma_ra_enabled) This was rejected because sysfs typically shows configured values, not dynamic identifiers, which would be inconsistent with existing conventions. 4. Align memory.swap.priority.effective with /proc/swaps: Aligning the order of id prio pairs in memory.swap.priority.effective with the output order of /proc/swaps would allow users to infer which swap device corresponds to which ID. For example: $ swapon NAME TYPE SIZE USED PRIO /dev/sdb partition 300M 0B 10 /dev/sdc partition 300M 0B 5 $ cat memory.swap.priority.effective Active 1 10 // this is /dev/sdb 2 5 // this is /dev/sdc > > Note: > > A special value of -1 means the swap device is completely > > excluded from use by this cgroup. Unlike the global swap > > priority, where negative values simply lower the priority, > > setting -1 here disables allocation from that device for the > > current cgroup only. > > The divergence from the global semantics is little bit confusing. > There should better be a special value (like 'disabled') in the interface. > And possible second special value like 'none' that denotes the default > (for new (unconfigured) cgroups or when a new swap device is activated). > Thank you for your insightful comments and suggestions regarding the default values. I was initially focused on providing numerical values for these settings. However, using keywords like "none" and "disabled" for default values makes the semantics much more natural and user-friendly. Based on your feedback and the cgroup-v2.html documentation on default values, I propose the following semantics: none: This applies priority based on the global swap priority. It's important to note that for negative priorities, this implies following NUMA auto-binding rules, rather than a direct application of the negative value itself. disabled: This keyword explicitly excludes the swap device from use by this cgroup. Here's how these semantics would translate into usage: echo "default none" > memory.swap.priority or echo "none" > memory.swap.priority: * When swapon is active, the cgroup's swap device priority will follow the global swap priority. echo "default disabled" > memory.swap.priority or echo "default" > memory.swap.priority: * When swapon is active, the swap device will be excluded from allocation within this cgroup. echo " none" > memory.swap.priority: * The specified swap device will follow its global swap priority. echo " disabled" > memory.swap.priority: * The specified swap device will be excluded from allocation for this cgroup. echo " " > memory.swap.priority: * This sets a specific priority for the specified swap device. > ... > > In this case: > > - If no cgroup sets any configuration, the output matches the > > global `swapon` priority. > > - If an ancestor has a configuration, the child inherits it > > and ignores its own setting. > > The child's priority could be capped by ancestors' instead of wholy > overwritten? (So that remains some effect both.) Regarding the child's priority being capped or refined by ancestors' settings, I've considered allowing the child's priority to resolve its own settings when the sorted priority order is consistent and the child's swap devices are a subset of the parent's. Here's a visual representation of how that might work: +-----------------+ | Parent cgroup | | (Swaps: A, B, C)| +--------+--------+ | | (Child applies settings to its own children) v +--------+--------+ | Child cgroup | | (Swaps: B, C) | | (B & C resolved by child's settings) +--------+--------+ | +-------------------+ | | v v +--------+--------+ +--------+--------+ | Grandchild cgroup | | Grandchild 2 cgroup | | (Swaps: C) | | (Swaps: A) | | (C resolved by | | (A not in B,C; | | grandchild's | | resolved by | | child's settings)| | child's settings)| +-------------------+ +-------------------+ However, this feature isn't currently required for our immediate use case, and it adds notable complexity to the implementation. I suggest we consider this as a next step if the current feature is integrated into the kernel and sees widespread adoption or any further use cases or requirements. Best regards, Youngjun Park