From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6D78C61DB2 for ; Fri, 13 Jun 2025 07:38:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F0F4C6B0095; Fri, 13 Jun 2025 03:38:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EE70E6B0096; Fri, 13 Jun 2025 03:38:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DFCB96B0098; Fri, 13 Jun 2025 03:38:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BF3FC6B0095 for ; Fri, 13 Jun 2025 03:38:58 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 42DFE1D79CB for ; Fri, 13 Jun 2025 07:38:58 +0000 (UTC) X-FDA: 83549575956.10.5A3A5DF Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175]) by imf29.hostedemail.com (Postfix) with ESMTP id 480C7120002 for ; Fri, 13 Jun 2025 07:38:56 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=bZ2pMSpS; spf=pass (imf29.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.175 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749800336; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2snE+bbgVXzx5opbdAh5YB6aoSkRDd4dkM12RKy9Zzo=; b=WDE0kQJr0r7ILgmNLPDESnfH/QN9cfawTqcpF7AKssGk25kwS3iQ62UTaHK5edH03vHrfA h2jCXhaYzkxwZkawRRzTq9CGsVuk8sT/ulYLhrQSshYkECnFY52Z7huAnylAJue+gGECYT AGKezqQb1CbPJ4MMnpV1K48ciNsXEE8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749800336; a=rsa-sha256; cv=none; b=mSxgG+UcJ6j9lLFRpe8wCQY2U9VNBdtzkdFjYdSrgJfwG8hMTHsvQcsrx5C50dEEV0nXqj n4cKg+yBDc6+2cdIEH+NqUWf4CzCS1GpNmQf9+XfnOiEqhdTa3MgaX8EXdn+/wAdU/UV5H fnq6e/KQZri3IqhZK7vhzYyyz3a1Hz0= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=bZ2pMSpS; spf=pass (imf29.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.175 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-lj1-f175.google.com with SMTP id 38308e7fff4ca-32b3d7372f6so5268371fa.1 for ; Fri, 13 Jun 2025 00:38:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749800335; x=1750405135; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=2snE+bbgVXzx5opbdAh5YB6aoSkRDd4dkM12RKy9Zzo=; b=bZ2pMSpSWxgelwa9L/GlZLDmG4ARxAnwbohOc5gmNbkyqizIQ20KXY41jzAduTi+go PICb6o9AYDbOTtSvPOBkHnu8oqtH3xRee/cT7+0KmGrzapAtGqjtnpwYGsIy0SCKCMco xVANhgdO9mNh/cBqmccOULY7SRceZnHhzMmhHQ+dYtoTNNQg0V5VKg5uIqiXapfKGUK9 Dfj825LBjTNaeEz5FOnly5JP/4mzIPdWqZYNn9Ar/lJoHraXCgUo0COJpmmpUXadA+P7 +cN89STvdRxUpbXSi/DDtPyhkXxgdK2ADuOMoq8dNXRHDhOAHx3kP40EzVt8cpX+rhG3 YIDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749800335; x=1750405135; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2snE+bbgVXzx5opbdAh5YB6aoSkRDd4dkM12RKy9Zzo=; b=h7wUoVds1EY6ds4rov3B2UveDH0Srb+2bOXuHUh2MyKocPowVMV0ACy7mnC6m1+xVa Q08Bcmfi2teKuJ7I4wehlLPXdT2Vdfoi6j0vMu5ONaT4qL/Yf7fdvOpn9L/uyT+H4Nhk S1Ivbxrl2PQkeaty2PaLZwZwZWfnKDHK3eRCo+AX2FZSS2UXyDOi2khGrPkCT+6rLU8D P2UVI+Km40ow6lcco95V3JnONGimMPL6WfdBjodHe3t8QVk2gh7+BO7HILUmcoP/HcZ7 dI+Hu9q/Ei0SNLB00qktCMcXESazMLOrtaQblvgRqgmJzskA1Sh1oo46N4cigMG+Ufeu EJDg== X-Forwarded-Encrypted: i=1; AJvYcCV4665mGa6yHN6jU/qgYrKmaOiNWhkjAt9wQ4JI2odyQrDR4JEB56L/kuQ9wNMrBNs/CYB5CA+jiA==@kvack.org X-Gm-Message-State: AOJu0YxyqCOe5fctGs30v/dHm/JGnbVlTu/F+ZlaB5e1k1dfz59b0+xI o2yb0e04t6Hq3kx4aGrQ/LR4Z0r6nHDYo3WKJL7xuOGZUexEjY5PxYrOmnMuPqa+puYnHxkFjBq N8xjrV5t62s164H7OXunuTQWsfW/WeVXg0ytmQxVNbQ== X-Gm-Gg: ASbGncskD5N6jXw2vuM3HfSzn1rmk7zv1FXV9m4eInvvb+G+U+0uhyuttwp/dQt/Otk N08JDoWaGQwrxenObrx+YMoxaNpjdPllsmC6hj4myV6ToLyqKS7wMQAJvm/dQrr8Y7aEMHhhB9F X6eLW3jNtyCsKO9RglMoaA+cd8zeKsxfMZZMZu6CaQLcpQ/r8caQHWSg== X-Google-Smtp-Source: AGHT+IFh4Z8yJABYXTto0CU65jTSk1UpTtOC9b00IjTJotmc+C0ThKDtLHCbIRjhpDo5TyNPFR6UwC+r5s3jPRZRZ2s= X-Received: by 2002:a2e:bea0:0:b0:30d:c4c3:eafa with SMTP id 38308e7fff4ca-32b3fd76eb9mr4076051fa.7.1749800334347; Fri, 13 Jun 2025 00:38:54 -0700 (PDT) MIME-Version: 1.0 References: <20250612103743.3385842-1-youngjun.park@lge.com> <20250612103743.3385842-3-youngjun.park@lge.com> In-Reply-To: From: Kairui Song Date: Fri, 13 Jun 2025 15:38:37 +0800 X-Gm-Features: AX0GCFvf-NtVlOutH0cboEpQxncMH9Qo2kIalrzH-ZX5C1AFR5cVChJfxC3ceRc Message-ID: Subject: Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer To: YoungJun Park Cc: Nhat Pham , linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, shikemeng@huaweicloud.com, bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org, muchun.song@linux.dev, iamjoonsoo.kim@lge.com, taejoon.song@lge.com, gunho.lee@lge.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 480C7120002 X-Stat-Signature: kacnuiecwzukxia7oce81ppygetkpat3 X-Rspamd-Server: rspam04 X-HE-Tag: 1749800336-20877 X-HE-Meta: U2FsdGVkX1/7+9VTbzmsWxECNhtMHgC+VJzSwp31Ei1E99SRoE6sOo4PmzfPTGcJtFB1uukaE1QAZSC7YyWQD3wT/tAGzf/dzOT6TI2aBrEvEmR4bALLMyVvQmbsrgxvj+PBWat0hWS60QXhC7vtWdDw2cFzzhnwZgMFJsgf5e2CqGirgxq2g9Lr8o5hAJEVTpxly6MGNe+UBqHYpSLYRGwqApQOYrThML0sOxpZ84C0sZ1P13scjE4bm6FLlb9eGAhGOYA5M+IDi8zIU0kTMHvUAM+1W+ibnmbD3N3rx7BNKPDvmH37X84rH0qvDc3HHvTe4OmaFScUqgWX7WcMW5GTT5Vz5zAqACN+Ep/+s3LCS/8InqCOBVjwWQJSQvPtrN+g75olsiSw9u04EB5F+jSkHkctzJJpjeAJNnS0mmIGo48sdyDtfDhvI3NWB3EOVbOKyZsSA7bP/w5QkFCij31kYS+2QyQ7aDbhx0xV99KZQaoRgXOVRZsM+8npU1SoFXq/qGxzkT23duZcgZqu7cOkRuL6rEjMDREP6fXB3iSUCsjjclXvauD6yOVh7CQxfgG15yebwbV+OPhhL2NcmlMG5fRNV4mu1EsU/5pKXD1z8rpQad+jOMmV2Li0eXxZdO5U3Rhj67Ea2Bpnzi146snwZ1LZCbLsoiH7+facnCJ6cfbSRgHiAXr9TN3HduYtct93GuaCYMrUeYTQe8KcKkMiPBJ3HTDv3siodDhk35YJuL/rfGVFgahmJi9L/hFLSL3NxGfklg/AIhftiQ29Yfdnsgk2aFFmTr0Dz8NKBbBEvmsH+7qM9reDXOCjDMC7b1ujPnojRx4pS254fEo/HQ7t3pBGVH4bkkE4PzqmmEEBOes6tZWsd9uInSiFGlcYI3uYMWLREJQoaDC17o1U7WSyCo9qHW11P0SqHah4Y+03tmAW+b3woCbhhUm9/n9135+OHTmFQKuF6/qfvm+ MM6buSRC Ff+Bwg4ErRh6EmEMjGXqg9OCW8O777IUXf09mSnMRGs0rIK6BJ+gJjTRu+zIx4e5XQCZLOFT3ANtFosoCr8IZ7W3e2fTI4zrsJtVXNoVlo843lkjzwVcW6KX+lwz20yrkbP03bFcAOp15KhhMBOqrqBDC1bPsPR9y9ocxM6BOsCpphWJg4y1Cuvhf+oy7TVcJROsJ+3pglhssIsDbaCbURUcloGpORkPCLQFauxaELJCWqODECQGR0k+ZZSBOiaaNl2ixBjx1L66e2PMdyG0bNcQMwqH3aDh0DZ4zEs4Bxo3RkmrHinRnMl+BwPWorzX36ewmdQKRHEPkr4KvP6yfrZatPjVB8hrzcXKc6JuzFsqNo323w8BLpaeKZQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 13, 2025 at 3:36=E2=80=AFPM Kairui Song wrot= e: > > On Fri, Jun 13, 2025 at 3:11=E2=80=AFPM YoungJun Park wrote: > > > > On Thu, Jun 12, 2025 at 01:08:08PM -0700, Nhat Pham wrote: > > > On Thu, Jun 12, 2025 at 11:20=E2=80=AFAM Kairui Song wrote: > > > > > > > > On Fri, Jun 13, 2025 at 1:28=E2=80=AFAM Nhat Pham wrote: > > > > > > > > > > On Thu, Jun 12, 2025 at 4:14=E2=80=AFAM Kairui Song wrote: > > > > > > > > > > > > On Thu, Jun 12, 2025 at 6:43=E2=80=AFPM = wrote: > > > > > > > > > > > > > > From: "youngjun.park" > > > > > > > > > > > > > > > > > > > Hi, Youngjun, > > > > > > > > > > > > Thanks for sharing this series. > > > > > > > > > > > > > This patch implements swap device selection and swap on/off p= ropagation > > > > > > > when a cgroup-specific swap priority is set. > > > > > > > > > > > > > > There is one workaround to this implementation as follows. > > > > > > > Current per-cpu swap cluster enforces swap device selection b= ased solely > > > > > > > on CPU locality, overriding the swap cgroup's configured prio= rities. > > > > > > > > > > > > I've been thinking about this, we can switch to a per-cgroup-pe= r-cpu > > > > > > next cluster selector, the problem with current code is that sw= ap > > > > > > > > > > What about per-cpu-per-order-per-swap-device :-? Number of swap > > > > > devices is gonna be smaller than number of cgroups, right? > > > > > > > > Hi Nhat, > > > > > > > > The problem is per cgroup makes more sense (I was suggested to use > > > > cgroup level locality at the very beginning of the implementation o= f > > > > the allocator in the mail list, but it was hard to do so at that > > > > time), for container environments, a cgroup is a container that run= s > > > > one type of workload, so it has its own locality. Things like syste= md > > > > also organize different desktop workloads into cgroups. The whole > > > > point is about cgroup. > > > > > > Yeah I know what cgroup represents. Which is why I mentioned in the > > > next paragraph that are still making decisions based per-cgroup - we > > > just organize the per-cpu cache based on swap devices. This way, two > > > cgroups with similar/same priority list can share the clusters, for > > > each swapfile, in each CPU. There will be a lot less duplication and > > > overhead. And two cgroups with different priority lists won't > > > interfere with each other, since they'll target different swapfiles. > > > > > > Unless we want to nudge the swapfiles/clusters to be self-partitioned > > > among the cgroups? :) IOW, each cluster contains pages mostly from a > > > single cgroup (with some stranglers mixed in). I suppose that will be > > > very useful for swap on rotational drives where read contiguity is > > > imperative, but not sure about other backends :-? > > > Anyway, no strong opinions to be completely honest :) Was just > > > throwing out some ideas. Per-cgroup-per-cpu-per-order sounds good to > > > me too, if it's easy to do. > > > > Good point! > > I agree with the mention that self-partitioned clusters and duplicated = priority. > > One concern is the cost of synchronization. > > Specifically the one incurred when accessing the prioritized swap devic= e > > From a simple performance perspective, a per-cgroup-per-CPU implementat= ion > > seems favorable - in line with the current swap allocation fastpath. > > > > It seems most reasonable to carefully compare the pros and cons of the > > tow approaches. > > > > To summaraize, > > > > Option 1. per-cgroup-per-cpu > > Pros: upstream fit. performance. > > Cons: duplicate priority(some memory structure consumtion cost), > > self partioned cluster > > > > Option 2. per-cpu-per-order(per-device) > > Pros: Cons of Option1 > > Cons: Pros of Option1 > > > > It's not easy to draw a definitive conclusion right away, > > I should also evaluate other pros and cons that may arise during actual > > implementation. > > so I'd like to take some time to review things in more detail > > and share my thoughs and conclusions in the next patch series. > > > > What do you think, Nhat and Kairui? > > Ah, I think what might be best fits here is, each cgroup have a pcp > device list, and each device have a pcp cluster list: > > folio -> mem_cgroup -> swap_priority (maybe a more generic name is > better?) -> swap_device_pcp (recording only the *si per order) > swap_device_info -> swap_cluster_pcp (cluster offset per order) Sorry the truncate made this hard to read, let me try again: folio -> mem_cgroup -> swap_priority (maybe a more generic name is better?) -> swap_device_pcp (recording only the *si per order) And: swap_device_info -> swap_cluster_pcp (cluster offset per order) And if mem_cgroup -> swap_priority is NULL, fallback to a global swap_device_pcp.