From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 859CBC61DB2 for ; Fri, 13 Jun 2025 10:45:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E23996B008A; Fri, 13 Jun 2025 06:45:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DD3AF6B008C; Fri, 13 Jun 2025 06:45:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C9AFF6B0092; Fri, 13 Jun 2025 06:45:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id AAD0A6B008A for ; Fri, 13 Jun 2025 06:45:23 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 11D0F160143 for ; Fri, 13 Jun 2025 10:45:23 +0000 (UTC) X-FDA: 83550045726.07.0887397 Received: from lgeamrelo07.lge.com (lgeamrelo07.lge.com [156.147.51.103]) by imf14.hostedemail.com (Postfix) with ESMTP id 2839D100002 for ; Fri, 13 Jun 2025 10:45:19 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of youngjun.park@lge.com designates 156.147.51.103 as permitted sender) smtp.mailfrom=youngjun.park@lge.com; dmarc=pass (policy=none) header.from=lge.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749811521; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3vLHWwKDdv5GI6+jUTpUYA4zp5c+iqTmyOJoUK7sDlY=; b=c8EQDMUdlgAZopghhFzvt/aMzjzWrOI8QDb7p4Ah6F1/u0vxoWnz/ck3uSwNt+FWGvP40F EJ/lCKzR2W/IxoNErl7MgDv41Hcm7kIvFdR2T7KANJwE4im5lPD4vnh8XGuIPzoojwTyts uek21bAHTzPiBxWORj4N2NlkvxBkSWg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749811521; a=rsa-sha256; cv=none; b=dJb8oKY88TYTvS3wR9U5tBcXH/8iF5MDyKck7VBoTrMS0yWBaOsYCeLlQ596ZUDtRxO3lw NTPmMrjiFGKkR+1IHZVzTix8y76utpV76Mqw18M7irtJMQXbfEFs5NW2CMtVXJvvnW70CK eYXh+R+fZ+vFVGt66++5TNqSPZ3VSjs= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of youngjun.park@lge.com designates 156.147.51.103 as permitted sender) smtp.mailfrom=youngjun.park@lge.com; dmarc=pass (policy=none) header.from=lge.com Received: from unknown (HELO yjaykim-PowerEdge-T330) (10.177.112.156) by 156.147.51.103 with ESMTP; 13 Jun 2025 19:45:16 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com Date: Fri, 13 Jun 2025 19:45:16 +0900 From: YoungJun Park To: Kairui Song Cc: Nhat Pham , linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, shikemeng@huaweicloud.com, bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org, muchun.song@linux.dev, iamjoonsoo.kim@lge.com, taejoon.song@lge.com, gunho.lee@lge.com Subject: Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer Message-ID: References: <20250612103743.3385842-1-youngjun.park@lge.com> <20250612103743.3385842-3-youngjun.park@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Queue-Id: 2839D100002 X-Stat-Signature: pn71po5fifqntsriuykxe7pe5u3k89ac X-Rspamd-Server: rspam04 X-HE-Tag: 1749811519-877433 X-HE-Meta: U2FsdGVkX1+mgE5/sry64iyC1vAoaftE47LI2+6oK2eN4sq7mIQNJP+T1/ERqjPFuOXL6XNTcNxDOfHapq3L7WphjybYmsrpIEjguttZWsZ3fhOf5LbHKvZ5Zo2NnzQcQ7C8BD8zcA59Adf+dAyeTs634pEUvFpbCgyhekvsagaTOqvxmDjSYnWN/Uc4S1DnrOp2hh0yswkCwSWP0d5b/C/qPlb6KDMIHo2cWmdR13MQG/rrL03qWurgOCvonnwXp9iXI6wWV48FHq3DB7XUVLcpnWgGLYUDNmzu3htngr8XnZF/i4gbbGctK30OleKklBEbw7uoWy+WfJB0t8vKCdGv1iWntpyvFckQrbR9HrLUvPuavKTQGowmFBJi1VyjiUozECHqczMjov4j/QYMtOXvody43gaUd6G902sWNap5dYh9yNAmGVtXP3IEobLLUdVZ2MhjbJkyWfFYdJ9lUa1JGv2kbMolNp6r5DRnS1q/5PN/2XFO+9ZSr9DOWKLGTL8fmilzyt84E0iT5gDgk/dql4YXXhauzJIz1wqdTNjSB75TbzZghYnbPH+HK+u94NyHvJT04fRfLJQolj2BJuGyHErvxf7aHnSyXiNF8Tgp1GND/y4D+rET48xb6dkXLqWZ15z7Qyv7jsnWrvc4ge5uHUReFsbyEnEYUK0SAi622Ojeeur4NYhpx9bjybc2YhUgHo5ySO0ZP3w9ieDiEuVpxbeCNHX8C+EAbXK4Fz9KhdDJVz2OypH4JUHpsSBt9g2Zn0i838nvgH5HsqzUWFLlVB9DCZItpjxukairIJEvQWQTYUfI9jPZuFeBpf4Lj98z7zffVldFu9Vekk0wCEzGjNbhOTHCLv5A2jTTAo/xC2x/T1KqqHA8VzixuXlxdQfB+ZqrrAQx6ISc3vqj6/+r1WO9+CivmlZvQv0KHBUnV3dWaqw4pQz1ZtgNL7fztDj134x8djSKAATYmt3 jDKuzYSX 2WlaVTeuiHKdkpDgl1Jqb90Npzq3VQjI6V2yhNUrcFgCmkbR7ewa8hPaEIc59u8axwvNjb8soDPdnQyE1scNRpZmzq2Pd/fIan0TiiKQhRjaEQMRLJD6Bro8wLxCcQYBVne19b376uSRgjz+Bm6sxh/gaPDaRP9uQYhbdxyuALSnvzS0ZBS4YlSdMkcNK7rd5Rl8QMcmohqZWraCdTGrx6cccuSnYMEQY69odsWgQKwU6Y84M8OgvHLFW+gG3niwIKhvI5zl17uP9VfcvoQQmv7vIYHq20oKoiUT8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 13, 2025 at 03:38:37PM +0800, Kairui Song wrote: > On Fri, Jun 13, 2025 at 3:36 PM Kairui Song wrote: > > > > On Fri, Jun 13, 2025 at 3:11 PM YoungJun Park wrote: > > > > > > On Thu, Jun 12, 2025 at 01:08:08PM -0700, Nhat Pham wrote: > > > > On Thu, Jun 12, 2025 at 11:20 AM Kairui Song wrote: > > > > > > > > > > On Fri, Jun 13, 2025 at 1:28 AM Nhat Pham wrote: > > > > > > > > > > > > On Thu, Jun 12, 2025 at 4:14 AM Kairui Song wrote: > > > > > > > > > > > > > > On Thu, Jun 12, 2025 at 6:43 PM wrote: > > > > > > > > > > > > > > > > From: "youngjun.park" > > > > > > > > > > > > > > > > > > > > > > Hi, Youngjun, > > > > > > > > > > > > > > Thanks for sharing this series. > > > > > > > > > > > > > > > This patch implements swap device selection and swap on/off propagation > > > > > > > > when a cgroup-specific swap priority is set. > > > > > > > > > > > > > > > > There is one workaround to this implementation as follows. > > > > > > > > Current per-cpu swap cluster enforces swap device selection based solely > > > > > > > > on CPU locality, overriding the swap cgroup's configured priorities. > > > > > > > > > > > > > > I've been thinking about this, we can switch to a per-cgroup-per-cpu > > > > > > > next cluster selector, the problem with current code is that swap > > > > > > > > > > > > What about per-cpu-per-order-per-swap-device :-? Number of swap > > > > > > devices is gonna be smaller than number of cgroups, right? > > > > > > > > > > Hi Nhat, > > > > > > > > > > The problem is per cgroup makes more sense (I was suggested to use > > > > > cgroup level locality at the very beginning of the implementation of > > > > > the allocator in the mail list, but it was hard to do so at that > > > > > time), for container environments, a cgroup is a container that runs > > > > > one type of workload, so it has its own locality. Things like systemd > > > > > also organize different desktop workloads into cgroups. The whole > > > > > point is about cgroup. > > > > > > > > Yeah I know what cgroup represents. Which is why I mentioned in the > > > > next paragraph that are still making decisions based per-cgroup - we > > > > just organize the per-cpu cache based on swap devices. This way, two > > > > cgroups with similar/same priority list can share the clusters, for > > > > each swapfile, in each CPU. There will be a lot less duplication and > > > > overhead. And two cgroups with different priority lists won't > > > > interfere with each other, since they'll target different swapfiles. > > > > > > > > Unless we want to nudge the swapfiles/clusters to be self-partitioned > > > > among the cgroups? :) IOW, each cluster contains pages mostly from a > > > > single cgroup (with some stranglers mixed in). I suppose that will be > > > > very useful for swap on rotational drives where read contiguity is > > > > imperative, but not sure about other backends :-? > > > > Anyway, no strong opinions to be completely honest :) Was just > > > > throwing out some ideas. Per-cgroup-per-cpu-per-order sounds good to > > > > me too, if it's easy to do. > > > > > > Good point! > > > I agree with the mention that self-partitioned clusters and duplicated priority. > > > One concern is the cost of synchronization. > > > Specifically the one incurred when accessing the prioritized swap device > > > From a simple performance perspective, a per-cgroup-per-CPU implementation > > > seems favorable - in line with the current swap allocation fastpath. > > > > > > It seems most reasonable to carefully compare the pros and cons of the > > > tow approaches. > > > > > > To summaraize, > > > > > > Option 1. per-cgroup-per-cpu > > > Pros: upstream fit. performance. > > > Cons: duplicate priority(some memory structure consumtion cost), > > > self partioned cluster > > > > > > Option 2. per-cpu-per-order(per-device) > > > Pros: Cons of Option1 > > > Cons: Pros of Option1 > > > > > > It's not easy to draw a definitive conclusion right away, > > > I should also evaluate other pros and cons that may arise during actual > > > implementation. > > > so I'd like to take some time to review things in more detail > > > and share my thoughs and conclusions in the next patch series. > > > > > > What do you think, Nhat and Kairui? > > > > Ah, I think what might be best fits here is, each cgroup have a pcp > > device list, and each device have a pcp cluster list: > > > > folio -> mem_cgroup -> swap_priority (maybe a more generic name is > > better?) -> swap_device_pcp (recording only the *si per order) > > swap_device_info -> swap_cluster_pcp (cluster offset per order) > > Sorry the truncate made this hard to read, let me try again: > > folio -> > mem_cgroup -> > swap_priority (maybe a more generic name is better?) -> > swap_device_pcp (recording only the *si per order) > > And: > swap_device_info -> > swap_cluster_pcp (cluster offset per order) > > And if mem_cgroup -> swap_priority is NULL, > fallback to a global swap_device_pcp. Thank you for quick and kind feedback. This is a really good idea :) On my workaround proposal, I just need to add the swap_device_pcp part along with some refactoring. And the naming swap_cgroup_priority... I adopted the term "swap_cgorup_priority" based on the perspective of the functionality I'm aiming to implement. Here are some words that immediately come to mind. (Like I said, just come to mind) * swap_tier, swap_order, swap_selection, swap_cgroup_tier, swap_cgroup_order, swap_cgroup_selection.... I'll try to come up with a more suitable conceptual name as I continue working on the patch. In the meantime, I'd appreciate any suggestions or feedback you may have. Thanks again your feedback and suggestions.