From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3F70DC27C50
	for <linux-mm@archiver.kernel.org>; Mon,  3 Jun 2024 21:19:05 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 276986B00A0; Mon,  3 Jun 2024 17:19:04 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2255A6B00A1; Mon,  3 Jun 2024 17:19:04 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0CA716B00A2; Mon,  3 Jun 2024 17:19:04 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id D844E6B00A0
	for <linux-mm@kvack.org>; Mon,  3 Jun 2024 17:19:03 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 5C2CA80503
	for <linux-mm@kvack.org>; Mon,  3 Jun 2024 21:19:03 +0000 (UTC)
X-FDA: 82190842566.21.F1326DD
Received: from mail-ej1-f51.google.com (mail-ej1-f51.google.com [209.85.218.51])
	by imf28.hostedemail.com (Postfix) with ESMTP id 79C64C000E
	for <linux-mm@kvack.org>; Mon,  3 Jun 2024 21:19:00 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=vPdFHMJg;
	spf=pass (imf28.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717449540; a=rsa-sha256;
	cv=none;
	b=wqrHEtlHu/+iMelsKNJqGryuzFkwjVSV+L0G6hfrEbPg118PPlNv1ScQ8P6/5US8Jnk+gY
	R2nf92Qlkrv6EiTBk1ezEE0T0owpaiRVhfhqh/7eYzG9grkOMpKBee+vGSy4QdnuTJkKVx
	4kQQ+lrmxbaVP1FgxBhYE2SaGMZqIwg=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=vPdFHMJg;
	spf=pass (imf28.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1717449540;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=cVrbBmRHGcvk3HRKRL9q8xpV11gQOI4yc6uZd/VdwtI=;
	b=vD+UkqgmZshZTILXxMb1ojd4FnkUROct9Vi9Fo35h03s6/AvUPQHQMRDsJx8vTcBXj932A
	eHt9FkgfB4NZ7cdIxrC4P4DWpZm1BVoFuSy+6lIN0l8GuS1OK1/w4pHpYJpeJI/IhRw2mw
	3/NLevdAK291ZZbi//RBeydLKUFk4Pw=
Received: by mail-ej1-f51.google.com with SMTP id a640c23a62f3a-a68bf84d747so245229566b.3
        for <linux-mm@kvack.org>; Mon, 03 Jun 2024 14:19:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1717449539; x=1718054339; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=cVrbBmRHGcvk3HRKRL9q8xpV11gQOI4yc6uZd/VdwtI=;
        b=vPdFHMJg5zcVBnUVqq/n8zabInjEJQcifby6phNa5kYP6GcmibGgJdt4i2/vKBnaSs
         5HAOSOWvOXokb/nFRzLQmmy96lmVGJ0PBK1O6iB7WiDTvMoIQRTiSBlIQrk5nNPXXnDs
         9kpPYttTxV/NUub2xEOu7yEvwjHDp252OkGWer33298Kptd7gHHzn3qf3RDO2TXaq1D1
         hm7lt+4f6HdlrFwPdGORAjhL650Ah0lbkLo5N2FJYeBOEzoK0DkrjT2qav67TAWjDcDe
         xotxEvdtsouhvtjwAhUVRInM88rx1PU/u5Efw5EtOTw5XKjBoBMa7L8w7EINOTWBVQ8+
         BHfQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1717449539; x=1718054339;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=cVrbBmRHGcvk3HRKRL9q8xpV11gQOI4yc6uZd/VdwtI=;
        b=bc9FlGF2tMRCZigItFTc5FgBqvZNsLwtfKBEPAA9Ag63pfKuRJF7NqBmAsXz93/dPs
         lc3CIXXrvbXc5kvdNujT6Mf6gu45DLPEVs1pLlYGRX/XUi8CIiWIak/+6cL3NHp6OpL0
         IEgJmClM7XuW60OzqyEK7kyusWHMKwpIuhFqk02kqPut8KNfkIpiVS+GzS4pRcx94DmY
         HmodoIaCMgHqDN/dzeqEuU2xp7cG4w175dhvFmXS3IBtNCPj3SD5vvv3mZjv/OQtVqay
         aQlfc0dpVrM9B6X/Bub8Iv9wLNnuWWou+iqZBY/9HcEWcZ2xwNaHvKV6sNPKxC0xXi8e
         b/xw==
X-Forwarded-Encrypted: i=1; AJvYcCXIeBKEd+IVzkLI7e2DKW8U1eFL3AiNf4N1Y0mVxQ8cKNLPgQSmQh4hmtwTMdlHhHl3F/D5peo5NM1vupYrBn5i2RM=
X-Gm-Message-State: AOJu0Ywtj5JfMcfWc98EFIxNGH4J6Bp9n3O6P/w1QDr2qqqo06z4tzk+
	cjn8ytw7WVSrOrXHbcDJaw9SZ1ghzlFtB8FUQ1W0KBM5i04L/XD/emHyeyXLmAB2cgnKDtGAhEJ
	fyLgL/MhBtXmZymLrxedJkMZPbsRK6hoMFwTO
X-Google-Smtp-Source: AGHT+IH/yEPXI0M7Xtcm0ZPZnhcTf9/owXFiJF2fVVCdieCSU1q1szG/HYEtuc6mqzhN0uHgQECl//zWoc7b8F/JoC8=
X-Received: by 2002:a17:906:6149:b0:a5c:e2ea:ba59 with SMTP id
 a640c23a62f3a-a681ff463demr629307666b.29.1717449538399; Mon, 03 Jun 2024
 14:18:58 -0700 (PDT)
MIME-Version: 1.0
References: <20240408183946.2991168-1-ryan.roberts@arm.com>
In-Reply-To: <20240408183946.2991168-1-ryan.roberts@arm.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Mon, 3 Jun 2024 14:18:19 -0700
Message-ID: <CAJD7tkYyRCTVwiVeN_AXmzagpAPKKhPg-9UkWk=EWDVYMchvxQ@mail.gmail.com>
Subject: Re: [PATCH v7 0/7] Swap-out mTHP without splitting
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, 
	Matthew Wilcox <willy@infradead.org>, Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>, 
	Yu Zhao <yuzhao@google.com>, Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>, 
	Kefeng Wang <wangkefeng.wang@huawei.com>, Barry Song <21cnbao@gmail.com>, 
	Chris Li <chrisl@kernel.org>, Lance Yang <ioworker0@gmail.com>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Zi Yan <ziy@nvidia.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 79C64C000E
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Stat-Signature: 9x48et8ozweu5rnxbh6r3r1qar7f8wpm
X-HE-Tag: 1717449540-387887
X-HE-Meta: U2FsdGVkX1/wwcyrQrOLLwbNunou1Vs7TdnGzru3v7k3HL7TP1T3FSy4m3m+e0KV6sfCvujR6QHKnTsCNPziBOGBGTIe3iUS6JHyd22PsgWMOgRkLRiQzQPTaeep3v8Q6YwsJxy1SDmt0vUBAoo/ZClbGp7KyhOJHlkVYVaq2lx6+3Up/OvClfwXMfUmhXiyWtQWemQjXuZVe+quMk+v6pgjv8SUzPtxZ9ZMGrHM+g0HIHObX7Ers//+/AlXvXEtVVPOvltvLhN3iMqb74PPowupKiWmj/xyq+HoZLFohC11pTDYXIdjrkk+inkXx3iGeH9OcQ2Zl5txDmNj0OXHQ28g80s4jr9OrKdtMv24125lKGGhAqTv0ho/mS30hv3hZTWUl0rifrSPmhQm8VAhoiZGw9vHZwFJXNL+EJv71/ak9N2EO/QN2KTwmzSUuu4HlBURjQ0HSrnKpxd+dlxsKotisLu0BdldHtC4uG18pN+iQS+vq08JMHMdDzw7YbctBhMlVTQMh4/Yw/QwZREuHbN4Jt0fXMMuEQwqJaBaXm7HQktDEcvhI2lPblxLItbFckCZ/9Kqkvw/sHIxRhEqhbKTuNBwAIzYIzuMWQngDYUCcbbpfWopzxldU23iigl+owPhqH8USn7popPkORpyNshh65zxddBD9WTIhLkhUS1qGydeCX+hZiMHzMIsRUAm6wWCBfcvywvRDBjg1vCoVunwR6934gnHNZzNtbFOz0jQYmbt39Ynpr6x6C8VFFUahRyITdZsalo/SEszMchbLp1gCFT+4UtJqZbX5yFKyd3TQAcvdPcn8jdmc13NWE9O85fxIer0dBaihsbgYqSKpn+v+0wpxp1YR6xul8QM5+Oe0yG1b07ulmbGkT2b1u8ktBpyyxPNbGO6Z8UT4sNZ2JATEJrkVpFgbO7i/+2Cyf8XCc1lcaYoGYzkfOu8rwozzmvS+dP22K4i0zvlX04
 WT19IRp8
 zMkx6S3v8VPiPqQSIv5aQ92E19IYUcoHk310pQAAMl9YzU18iPbie0wLSW3z4Xr9j8wKc5YnZoDQe3nOccAZWxndqObyGkHMzb7dnza/53mmc/1cPy66EOrcxCL97+pwtZiKoc6W8yYkuJmId5c6hWYhIGn3yYT8Kuylq7pMZeLDjH2bWTdgJ9Ivucutnb4R5OYGMP7nsRpz1hq8GBXFQ3qGDxb2y95G9YnfZ497Od4uEb3IgnABHTEJ4MieEDO5dBoSOq5unVvJior/Kq9hOci8ZDhykXvwlcE8CJW5zPOorSt0PF8ixmat4Dl7Cu97hbVkZ4SeMXzEXFvW2/TpROOdIQlI9hz2mUWDRluD/b4dT4Nhsay3v3aVV+H+la3c4ZPz/mebPz5vDeVo=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Apr 8, 2024 at 11:40=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> Hi All,
>
> This series adds support for swapping out multi-size THP (mTHP) without n=
eeding
> to first split the large folio via split_huge_page_to_list_to_order(). It
> closely follows the approach already used to swap-out PMD-sized THP.
>
> There are a couple of reasons for swapping out mTHP without splitting:
>
>   - Performance: It is expensive to split a large folio and under extreme=
 memory
>     pressure some workloads regressed performance when using 64K mTHP vs =
4K
>     small folios because of this extra cost in the swap-out path. This se=
ries
>     not only eliminates the regression but makes it faster to swap out 64=
K mTHP
>     vs 4K small folios.
>
>   - Memory fragmentation avoidance: If we can avoid splitting a large fol=
io
>     memory is less likely to become fragmented, making it easier to re-al=
locate
>     a large folio in future.
>
>   - Performance: Enables a separate series [7] to swap-in whole mTHPs, wh=
ich
>     means we won't lose the TLB-efficiency benefits of mTHP once the memo=
ry has
>     been through a swap cycle.
>
> I've done what I thought was the smallest change possible, and as a resul=
t, this
> approach is only employed when the swap is backed by a non-rotating block=
 device
> (just as PMD-sized THP is supported today). Discussion against the RFC co=
ncluded
> that this is sufficient.
>
>
> Performance Testing
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CP=
Us. The
> VM is set up with a 35G block ram device as the swap device and the test =
is run
> from inside a memcg limited to 40G memory. I've then run `usemem` from
> vm-scalability with 70 processes, each allocating and writing 1G of memor=
y. I've
> repeated everything 6 times and taken the mean performance improvement re=
lative
> to 4K page baseline:
>
> | alloc size |                baseline |           + this series |
> |            | mm-unstable (~v6.9-rc1) |                         |
> |:-----------|------------------------:|------------------------:|
> | 4K Page    |                    0.0% |                    1.3% |
> | 64K THP    |                  -13.6% |                   46.3% |
> | 2M THP     |                   91.4% |                   89.6% |
>
> So with this change, the 64K swap performance goes from a 14% regression =
to a
> 46% improvement. While 2M shows a small regression I'm confident that thi=
s is
> just noise.
>
> ---
> The series applies against mm-unstable (as of 2024-04-08) after dropping =
v6 of
> this series from it. The performance numbers are from v5. Since the delta=
 is
> very small I don't anticipate any performance changes. I'm optimistically=
 hoping
> this is the final version.
>
>
> Changes since v6 [6]
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>   - patch #1
>     - swap_page_trans_huge_swapped() takes order instead of nr_pages (per=
 Chris)
>   - patch #2
>     - Fix bug in swap_pte_batch() to consider swp pte bits (per David)
>     - Improved docs for clear_not_present_full_ptes() (per David)
>     - Improved docs for free_swap_and_cache_nr() (per David)
>   - patch #5
>     - Split out change to get_swap_pages() interface into own patch (per =
David)
>   - patch #6 (was patch #5)
>     - Improved readability of shrink_folio_list() with longer lines (per =
David)
>
>
> Changes since v5 [5]
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>   - patch #2
>     - Don't bother trying to reclaim swap if none of the entries' refs ha=
ve gone
>       to 0 in free_swap_and_cache_nr() (per Huang, Ying)
>   - patch #5
>     - Only update THP_SWPOUT_FALLBACK counters for pmd-mappable folios (p=
er
>       Barry Song)
>   - patch #6
>     - Fix bug in madvise_cold_or_pageout_pte_range(): don't continue with=
out ptl
>       (reported by Barry [8], sysbot [9])
>
>
> Changes since v4 [4]
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>   - patch #3:
>     - Added R-B from Huang, Ying - thanks!
>   - patch #4:
>     - get_swap_pages() now takes order instead of nr_pages (per Huang, Yi=
ng)
>     - Removed WARN_ON_ONCE() from get_swap_pages()
>     - Reworded comment for scan_swap_map_try_ssd_cluster() (per Huang, Yi=
ng)
>     - Unified VM_WARN_ON()s in scan_swap_map_slots() to scan: (per Huang,=
 Ying)
>     - Removed redundant "order =3D=3D 0" check (per Huang, Ying)
>   - patch #5:
>     - Marked list_empty() check with data_race() (per David)
>     - Added R-B from Barry and David - thanks!
>   - patch #6:
>     - Implemented mkold_ptes() generic helper (pre David)
>     - Enhanced folio_pte_batch() to report any_young (per David)
>     - madvise_cold_or_pageout_pte_range() sets old in batch (per David)
>     - Added R-B from Barry - thanks!
>
>
> Changes since v3 [3]
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>  - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
>  - Simplified max offset calculation (per Huang, Ying)
>  - Reinstated struct percpu_cluster to contain per-cluster, per-order `ne=
xt`
>    offset (per Huang, Ying)
>  - Removed swap_alloc_large() and merged its functionality into
>    scan_swap_map_slots() (per Huang, Ying)
>  - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_=
HUGE
>    by freeing swap entries in batches (see patch 2) (per DavidH)
>  - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
>  - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
>  - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" pa=
tch
>    since it's not actually a problem for THP as I first thought.
>
>
> Changes since v2 [2]
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>    allocation. This required some refactoring to make everything work nic=
ely
>    (new patches 2 and 3).
>  - Fix bug where nr_swap_pages would say there are pages available but th=
e
>    scanner would not be able to allocate them because they were reserved =
for the
>    per-cpu allocator. We now allow stealing of order-0 entries from the h=
igh
>    order per-cpu clusters (in addition to exisiting stealing from order-0
>    per-cpu clusters).
>
>
> Changes since v1 [1]
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>  - patch 1:
>     - Use cluster_set_count() instead of cluster_set_count_flag() in
>       swap_alloc_cluster() since we no longer have any flag to set. I was=
 unable
>       to kill cluster_set_count_flag() as proposed against v1 as other ca=
ll
>       sites depend explicitly setting flags to 0.
>  - patch 2:
>     - Moved large_next[] array into percpu_cluster to make it per-cpu
>       (recommended by Huang, Ying).
>     - large_next[] array is dynamically allocated because PMD_ORDER is no=
t
>       compile-time constant for powerpc (fixes build error).
>
>
> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.robert=
s@arm.com/
> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.robert=
s@arm.com/
> [3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts=
@arm.com/
> [4] https://lore.kernel.org/linux-mm/20240311150058.1122862-1-ryan.robert=
s@arm.com/
> [5] https://lore.kernel.org/linux-mm/20240327144537.4165578-1-ryan.robert=
s@arm.com/
> [6] https://lore.kernel.org/linux-mm/20240403114032.1162100-1-ryan.robert=
s@arm.com/
> [7] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmai=
l.com/
> [8] https://lore.kernel.org/linux-mm/CAGsJ_4yMOow27WDvN2q=3DE4HAtDd2PJ=3D=
OQ5Pj9DG+6FLWwNuXUw@mail.gmail.com/
> [9] https://lore.kernel.org/linux-mm/579d5127-c763-4001-9625-4563a9316ac3=
@redhat.com/
>
> Thanks,
> Ryan
>
> Ryan Roberts (7):
>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>   mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
>   mm: swap: Simplify struct percpu_cluster
>   mm: swap: Update get_swap_pages() to take folio order
>   mm: swap: Allow storage of all mTHP orders
>   mm: vmscan: Avoid split during shrink_folio_list()
>   mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

+Zi Yan

While looking at the page splitting code, I noticed that
split_huge_page_to_list_to_order() will refuse to split a folio in the
swapcache to any order higher than 0. It has the following check:

if (new_order) {
        /* Only swapping a whole PMD-mapped folio is supported */
        if (folio_test_swapcache(folio))
                return -EINVAL;
        ...
}

I am guessing with this series this may no longer be applicable?

>
>  include/linux/pgtable.h |  59 ++++++++
>  include/linux/swap.h    |  35 +++--
>  mm/huge_memory.c        |   3 -
>  mm/internal.h           |  75 +++++++++-
>  mm/madvise.c            |  99 +++++++-----
>  mm/memory.c             |  17 ++-
>  mm/swap_slots.c         |   6 +-
>  mm/swapfile.c           | 325 +++++++++++++++++++++++-----------------
>  mm/vmscan.c             |  20 +--
>  9 files changed, 422 insertions(+), 217 deletions(-)
>
> --
> 2.25.1
>
>