From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A84DCC61DB2 for ; Fri, 13 Jun 2025 06:56:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2FEE76B007B; Fri, 13 Jun 2025 02:56:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2AF226B0089; Fri, 13 Jun 2025 02:56:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C5E86B008A; Fri, 13 Jun 2025 02:56:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id EE3C46B007B for ; Fri, 13 Jun 2025 02:56:18 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 9BC4F14058E for ; Fri, 13 Jun 2025 06:56:18 +0000 (UTC) X-FDA: 83549468436.24.919FFE4 Received: from lgeamrelo07.lge.com (lgeamrelo07.lge.com [156.147.51.103]) by imf02.hostedemail.com (Postfix) with ESMTP id BEA7780004 for ; Fri, 13 Jun 2025 06:56:15 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=lge.com; spf=pass (imf02.hostedemail.com: domain of youngjun.park@lge.com designates 156.147.51.103 as permitted sender) smtp.mailfrom=youngjun.park@lge.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749797777; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UxyAglZkP+ywcd21humyevsEPxCD5SzacQm+nidxzEo=; b=CrOd3Vnu9egGBNPuPH71zXTLtlCPWMR0Dq26Q6gYC9SALfvIgWmikZWjhvVGeCt+x/PvID xYzqYt2huzEq6KNTZTXWOOAG8l6PMEnEP/kzIsZDHsCE9LOgWmSVH+bdHDBeX3TcjgZG3C pzsp3k4bMnYWvnznfW+8vDijuGXp/9U= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=lge.com; spf=pass (imf02.hostedemail.com: domain of youngjun.park@lge.com designates 156.147.51.103 as permitted sender) smtp.mailfrom=youngjun.park@lge.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749797777; a=rsa-sha256; cv=none; b=hRl4oqdSBdYUPEVWXv2xxrbT/aEU++tdK8Fsx1IWLaxOT+NXFXOUmwgfCU1M1UtEYYiq55 pqUT3m+bDwXkvyypbCI5g3sdMZHDFjPRqQL6bzqm1Dlo7xCeh8pqFb3nVycDmrpsYz+1f6 rJmF8P4SsNBkSn0RJofZQGjXtJ0jry4= Received: from unknown (HELO yjaykim-PowerEdge-T330) (10.177.112.156) by 156.147.51.103 with ESMTP; 13 Jun 2025 15:56:12 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com Date: Fri, 13 Jun 2025 15:56:12 +0900 From: YoungJun Park To: Kairui Song Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org, muchun.song@linux.dev, iamjoonsoo.kim@lge.com, taejoon.song@lge.com, gunho.lee@lge.com Subject: Re: [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization Message-ID: References: <20250612103743.3385842-1-youngjun.park@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: BEA7780004 X-Rspamd-Server: rspam03 X-Rspam-User: X-Stat-Signature: ur5attnxfrquf9wrpdkac8ewrg4bpehp X-HE-Tag: 1749797775-113934 X-HE-Meta: U2FsdGVkX199lxUQI6Dq9d4aifBEcWOGMNu/SmcrJb6pG9PFmFsSD0p1UAO0+ub276IU7xD/SYSLU2pwtZTLHGrD+P6v34tW00TBi5l63W5zTrMs6yTuc1I/sStTn/f5IEIuOHDCMCAdCQLIua01Ytn+Vo16ygJPmnPogfd5YXBQgaH+23biLvPX5tRA5BljBoWv8A18xDtRh8/44AnBsvEVNOqbVrukjP3wpJyCl0hCMxEae/EkmDVcV1QhS7DMTsi5dVZol/hbgAQuKQDLoNPytK7NgYqy79PxFg3G5bROlWhcaFKUdpfipbIKkuDWaPpT1HtvFpuDpBGGF/yQASe1PUqWOr/NxTbn4t5Rpk0aWhiPsxd8qdIKCcSXBkadZF2OannChf0/tN1HsMYnqcAn2i4IP7UVvGsTXCEeMHWiVAEkwy2pbdGk163qaJBovT4er1NF5Y5P1IHS6FRZVLpgQi0T6dyKQJstnb3yAStsitp8+U0GYiWfQWFS6uRtQsDezQKT9qrcGSzPbW7Uj6LbEpx3UfYupgnBsjx74dJKsr807i8qUxkBMOO1H/qwbEOiBvMJeasgxyu1hx4AOzWZRekDOyov7NgRqeuqKO3lcxXbx41Sv0it85p9bPj/5xbtq5uVqK2KGnouN7a3M3GoKD/WoKrOIfZGbh9C7vrdoMWmFfmN0h8vagRmdXErnIuQ7wCsf+mwK8GpvxAcVBNp8KqP0s5Rwll+QAbLsq4AB/eYCwtAPO9Q3eP0+2sUdeqkYigftj3rcJKTuDEk/uN7E7ZW99BnUNx9gdgXY2Z8jgd68Vg7Xj8sBEUGrgX3Evw0X4n1+kNlfyKZeyWKZThfivkwTWDyoxac8bt6dATfww8o7+HS6wyl9D7kgaQD6yxFPN3tLAch6azxWIR0ycXNqa50aZYgY4iOwgVYM/H5PpcCJqOiuk1Og5dBip/3eReAkxmvkFeestolFvL RmcD2w8o h+QyOaGdSYyvFaDCJOKZJDvaBfpqfrBZyTKubIsJWQO33OYchWPFIC7on+5SF5uDRjgaW1Jm0FRBz7+UoEOYugGwjtaEfwMQrUTrtA+LOXhoJzqh91Ln1otB497GFY0jB895QXqUuzQs3OiVmNptDxQPI8eI+CbeT815cpGyS/Vb0/MC9FWXm2PA8XaGZUpZlZVnX3soUKWfs87F01M5rwg/YFtHi1atp/UUkPOeyOser94FMx4RH2O9RQEYZmHqIh8gI2UxXNTkplGM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 12, 2025 at 08:24:08PM +0800, Kairui Song wrote: > On Thu, Jun 12, 2025 at 6:38 PM wrote: > > > > From: Youngjun Park > > > > Introduction > > ============ > > I am a kernel developer working on platforms deployed on commercial consumer devices. > > Due to real-world product requirements, needed to modify the Linux kernel to support > > a new swap management mechanism. The proposed mechanism allows assigning different swap > > priorities to swap devices per cgroup. > > I believe this mechanism can be generally useful for similar constrained-device scenarios > > and would like to propose it for upstream inclusion and solicit feedback from the community. > > > > Motivation > > ========== > > Core requirement was to improve application responsiveness and loading time, especially > > for latency critical applications, without increasing RAM or storage hardware resources. > > Device constraints: > > - Linux-based embedded platform > > - Limited system RAM > > - Small local swap > > - No option to expand RAM or local swap > > To mitigate this, we explored utilizing idle RAM and storage from nearby devices as remote > > swap space. To maximize its effectiveness, we needed the ability to control which swap devices > > were used by different cgroups: > > - Assign faster local swap devices to latency critical apps > > - Assign remote swap devices to background apps > > However, current Linux kernel swap infrastructure does not support per-cgroup swap device > > assignment. > > To solve this, I propose a mechanism to allow each cgroup to specify its own swap device > > priorities. > > > > Evaluated Alternatives > > ====================== > > 1. **Per-cgroup dedicated swap devices** > > - Previously proposed upstream [1] > > - Challenges in managing global vs per-cgroup swap state > > - Difficult to integrate with existing memory.limit / swap.max semantics > > 2. **Multi-backend swap device with cgroup-aware routing** > > - Considered sort of layering violation (block device cgroup awareness) > > - Swap devices are commonly meant to be physical block devices. > > - Similar idea mentioned in [2] > > 3. **Per-cgroup swap device enable/disable with swap usage contorl** > > - Expand swap.max with zswap.writeback usage > > - Discussed in context of zswap writeback [3] > > - Cannot express arbitrary priority orderings > > (e.g. swap priority A-B-C on cgroup C-A-B impossible) > > - Less flexible than per-device priority approach > > 4. **Per-namespace swap priority configuration** > > - In short, make swap namespace for swap device priority > > - Overly complex for our use case > > - Cgroups are the natural scope for this mechanism > > > > Based on these findings, we chose to prototype per-cgroup swap priority configuration > > as the most natural, least invasive extension of the existing kernel mechanisms. > > > > Design and Semantics > > ==================== > > - Each swap device gets a unique ID at `swapon` time > > - Each cgroup has a `memory.swap.priority` interface: > > - Show unique ID by memory.swap.priority interface > > - Format: `unique_id:priority,unique_id:priority,...` > > - All currently-active swap devices must be listed > > - Priorities follow existing swap infrastructure semantics > > - The interface is writeable and updatable at runtime > > - A priority configuration can be reset via `echo "" > memory.swap.priority` > > - Swap on/off events propagate to all cgroups with priority configurations > > > > Example Usage > > ------------- > > # swap device on > > $ swapon > > NAME TYPE SIZE USED PRIO > > /dev/sdb partition 300M 0B 10 > > /dev/sdc partition 300M 0B 5 > > > > # assign custom priorities in a cgroup > > $ echo "1:5,2:10" > memory.swap.priority > > $ cat memory.swap.priority > > Active > > /dev/sdb unique:1 prio:5 > > /dev/sdc unique:2 prio:10 > > > > # adding new swap device later > > $ swapon /dev/sdd --priority -1 > > $ cat memory.swap.priority > > Active > > /dev/sdb unique:1 prio:5 > > /dev/sdc unique:2 prio:10 > > /dev/sdd unique:3 prio:-2 > > > > # reset cgroup priority > > $ echo "" > memory.swap.priority > > $ cat memory.swap.priority > > Inactive > > /dev/sdb unique:1 prio:10 > > /dev/sdc unique:2 prio:5 > > /dev/sdd unique:3 prio:-2 > > > > Implementation Notes > > ==================== > > The items mentioned below are to be considered during the next patch work. > > > > - Workaround using per swap cpu cluster as before > > - Priority propgation of child cgroup > > - And other TODO, XXX > > - Refactoring for reviewability and maintainability, comprehensive testing > > and performance evaluation > > Hi Youngjun, > > Interesting idea. For your current approach, I think all we need is > per-cgroup swap meta info structures (and infrastures for maintaining > and manipulating them). > > So we have a global version and a cgroup version of "plist, next > cluster list, and maybe something else", right? And then > once the allocator is folio aware it can just prefer the cgroup ones > (as I mentioned in another reply) reusing all the same other > routines. Changes are minimal, the cgroup swap meta infos > and control plane are separately maintained. > > It seems aligned quite well with what I wanted to do, and can be done > in a clean and easy to maintain way. > > Meanwhile with virtual swap, things could be even more flexible, not > only changing the priority at swapout time, it will also provide > capabilities to migrate and balance devices adaptively, and solve long > term issues like mTHP fragmentation and min-order swapout etc.. > > Maybe they can be combined, like maybe cgroup can be limited to use > the virtual device or physical ones depending on priority. Seems all > solvable. Just some ideas here. I had been thinking about the work related to vswap and alignment, so I'm glad to hear that they can harmonize. > Vswap can cover the priority part too. I think we might want to avoid > duplicated interfaces. > > So I'm just imagining things now, will it be good if we have something > like (following your design): > > $ cat memcg1/memory.swap.priority > Active > /dev/vswap:(zram/zswap? with compression params?) unique:0 prio:5 > > $ cat memcg2/memory.swap.priority > Active > /dev/vswap:/dev/nvme1 unique:1 prio:5 > /dev/vswap:/dev/nvme2 unique:2 prio:10 > /dev/vswap:/dev/vda unique:3 prio:15 > /dev/sda unique:4 prio:20 > > $ cat memcg3/memory.swap.priority > Active > /dev/vda unique:3 prio:5 > /dev/sda unique:4 prio:15 > > Meaning memcg1 (high priority) is allowed to use compressed memory > only through vswap, and memcg2 (mid priority) uses disks through vswap > and fallback to HDD. memcg3 (low prio) is only allowed to use slow > devices. > > Global fallback just uses everything the system has. It might be over > complex though? Just looking at the example usage which you mention, it seems flexible and good. I will think more about this in relation to it.