From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F70DC27C50 for ; Mon, 3 Jun 2024 21:19:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 276986B00A0; Mon, 3 Jun 2024 17:19:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2255A6B00A1; Mon, 3 Jun 2024 17:19:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0CA716B00A2; Mon, 3 Jun 2024 17:19:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D844E6B00A0 for ; Mon, 3 Jun 2024 17:19:03 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 5C2CA80503 for ; Mon, 3 Jun 2024 21:19:03 +0000 (UTC) X-FDA: 82190842566.21.F1326DD Received: from mail-ej1-f51.google.com (mail-ej1-f51.google.com [209.85.218.51]) by imf28.hostedemail.com (Postfix) with ESMTP id 79C64C000E for ; Mon, 3 Jun 2024 21:19:00 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vPdFHMJg; spf=pass (imf28.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717449540; a=rsa-sha256; cv=none; b=wqrHEtlHu/+iMelsKNJqGryuzFkwjVSV+L0G6hfrEbPg118PPlNv1ScQ8P6/5US8Jnk+gY R2nf92Qlkrv6EiTBk1ezEE0T0owpaiRVhfhqh/7eYzG9grkOMpKBee+vGSy4QdnuTJkKVx 4kQQ+lrmxbaVP1FgxBhYE2SaGMZqIwg= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vPdFHMJg; spf=pass (imf28.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717449540; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cVrbBmRHGcvk3HRKRL9q8xpV11gQOI4yc6uZd/VdwtI=; b=vD+UkqgmZshZTILXxMb1ojd4FnkUROct9Vi9Fo35h03s6/AvUPQHQMRDsJx8vTcBXj932A eHt9FkgfB4NZ7cdIxrC4P4DWpZm1BVoFuSy+6lIN0l8GuS1OK1/w4pHpYJpeJI/IhRw2mw 3/NLevdAK291ZZbi//RBeydLKUFk4Pw= Received: by mail-ej1-f51.google.com with SMTP id a640c23a62f3a-a68bf84d747so245229566b.3 for ; Mon, 03 Jun 2024 14:19:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1717449539; x=1718054339; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=cVrbBmRHGcvk3HRKRL9q8xpV11gQOI4yc6uZd/VdwtI=; b=vPdFHMJg5zcVBnUVqq/n8zabInjEJQcifby6phNa5kYP6GcmibGgJdt4i2/vKBnaSs 5HAOSOWvOXokb/nFRzLQmmy96lmVGJ0PBK1O6iB7WiDTvMoIQRTiSBlIQrk5nNPXXnDs 9kpPYttTxV/NUub2xEOu7yEvwjHDp252OkGWer33298Kptd7gHHzn3qf3RDO2TXaq1D1 hm7lt+4f6HdlrFwPdGORAjhL650Ah0lbkLo5N2FJYeBOEzoK0DkrjT2qav67TAWjDcDe xotxEvdtsouhvtjwAhUVRInM88rx1PU/u5Efw5EtOTw5XKjBoBMa7L8w7EINOTWBVQ8+ BHfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717449539; x=1718054339; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cVrbBmRHGcvk3HRKRL9q8xpV11gQOI4yc6uZd/VdwtI=; b=bc9FlGF2tMRCZigItFTc5FgBqvZNsLwtfKBEPAA9Ag63pfKuRJF7NqBmAsXz93/dPs lc3CIXXrvbXc5kvdNujT6Mf6gu45DLPEVs1pLlYGRX/XUi8CIiWIak/+6cL3NHp6OpL0 IEgJmClM7XuW60OzqyEK7kyusWHMKwpIuhFqk02kqPut8KNfkIpiVS+GzS4pRcx94DmY HmodoIaCMgHqDN/dzeqEuU2xp7cG4w175dhvFmXS3IBtNCPj3SD5vvv3mZjv/OQtVqay aQlfc0dpVrM9B6X/Bub8Iv9wLNnuWWou+iqZBY/9HcEWcZ2xwNaHvKV6sNPKxC0xXi8e b/xw== X-Forwarded-Encrypted: i=1; AJvYcCXIeBKEd+IVzkLI7e2DKW8U1eFL3AiNf4N1Y0mVxQ8cKNLPgQSmQh4hmtwTMdlHhHl3F/D5peo5NM1vupYrBn5i2RM= X-Gm-Message-State: AOJu0Ywtj5JfMcfWc98EFIxNGH4J6Bp9n3O6P/w1QDr2qqqo06z4tzk+ cjn8ytw7WVSrOrXHbcDJaw9SZ1ghzlFtB8FUQ1W0KBM5i04L/XD/emHyeyXLmAB2cgnKDtGAhEJ fyLgL/MhBtXmZymLrxedJkMZPbsRK6hoMFwTO X-Google-Smtp-Source: AGHT+IH/yEPXI0M7Xtcm0ZPZnhcTf9/owXFiJF2fVVCdieCSU1q1szG/HYEtuc6mqzhN0uHgQECl//zWoc7b8F/JoC8= X-Received: by 2002:a17:906:6149:b0:a5c:e2ea:ba59 with SMTP id a640c23a62f3a-a681ff463demr629307666b.29.1717449538399; Mon, 03 Jun 2024 14:18:58 -0700 (PDT) MIME-Version: 1.0 References: <20240408183946.2991168-1-ryan.roberts@arm.com> In-Reply-To: <20240408183946.2991168-1-ryan.roberts@arm.com> From: Yosry Ahmed Date: Mon, 3 Jun 2024 14:18:19 -0700 Message-ID: Subject: Re: [PATCH v7 0/7] Swap-out mTHP without splitting To: Ryan Roberts Cc: Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li , Lance Yang , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Zi Yan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 79C64C000E X-Rspam-User: X-Rspamd-Server: rspam12 X-Stat-Signature: 9x48et8ozweu5rnxbh6r3r1qar7f8wpm X-HE-Tag: 1717449540-387887 X-HE-Meta: U2FsdGVkX1/wwcyrQrOLLwbNunou1Vs7TdnGzru3v7k3HL7TP1T3FSy4m3m+e0KV6sfCvujR6QHKnTsCNPziBOGBGTIe3iUS6JHyd22PsgWMOgRkLRiQzQPTaeep3v8Q6YwsJxy1SDmt0vUBAoo/ZClbGp7KyhOJHlkVYVaq2lx6+3Up/OvClfwXMfUmhXiyWtQWemQjXuZVe+quMk+v6pgjv8SUzPtxZ9ZMGrHM+g0HIHObX7Ers//+/AlXvXEtVVPOvltvLhN3iMqb74PPowupKiWmj/xyq+HoZLFohC11pTDYXIdjrkk+inkXx3iGeH9OcQ2Zl5txDmNj0OXHQ28g80s4jr9OrKdtMv24125lKGGhAqTv0ho/mS30hv3hZTWUl0rifrSPmhQm8VAhoiZGw9vHZwFJXNL+EJv71/ak9N2EO/QN2KTwmzSUuu4HlBURjQ0HSrnKpxd+dlxsKotisLu0BdldHtC4uG18pN+iQS+vq08JMHMdDzw7YbctBhMlVTQMh4/Yw/QwZREuHbN4Jt0fXMMuEQwqJaBaXm7HQktDEcvhI2lPblxLItbFckCZ/9Kqkvw/sHIxRhEqhbKTuNBwAIzYIzuMWQngDYUCcbbpfWopzxldU23iigl+owPhqH8USn7popPkORpyNshh65zxddBD9WTIhLkhUS1qGydeCX+hZiMHzMIsRUAm6wWCBfcvywvRDBjg1vCoVunwR6934gnHNZzNtbFOz0jQYmbt39Ynpr6x6C8VFFUahRyITdZsalo/SEszMchbLp1gCFT+4UtJqZbX5yFKyd3TQAcvdPcn8jdmc13NWE9O85fxIer0dBaihsbgYqSKpn+v+0wpxp1YR6xul8QM5+Oe0yG1b07ulmbGkT2b1u8ktBpyyxPNbGO6Z8UT4sNZ2JATEJrkVpFgbO7i/+2Cyf8XCc1lcaYoGYzkfOu8rwozzmvS+dP22K4i0zvlX04 WT19IRp8 zMkx6S3v8VPiPqQSIv5aQ92E19IYUcoHk310pQAAMl9YzU18iPbie0wLSW3z4Xr9j8wKc5YnZoDQe3nOccAZWxndqObyGkHMzb7dnza/53mmc/1cPy66EOrcxCL97+pwtZiKoc6W8yYkuJmId5c6hWYhIGn3yYT8Kuylq7pMZeLDjH2bWTdgJ9Ivucutnb4R5OYGMP7nsRpz1hq8GBXFQ3qGDxb2y95G9YnfZ497Od4uEb3IgnABHTEJ4MieEDO5dBoSOq5unVvJior/Kq9hOci8ZDhykXvwlcE8CJW5zPOorSt0PF8ixmat4Dl7Cu97hbVkZ4SeMXzEXFvW2/TpROOdIQlI9hz2mUWDRluD/b4dT4Nhsay3v3aVV+H+la3c4ZPz/mebPz5vDeVo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Apr 8, 2024 at 11:40=E2=80=AFAM Ryan Roberts = wrote: > > Hi All, > > This series adds support for swapping out multi-size THP (mTHP) without n= eeding > to first split the large folio via split_huge_page_to_list_to_order(). It > closely follows the approach already used to swap-out PMD-sized THP. > > There are a couple of reasons for swapping out mTHP without splitting: > > - Performance: It is expensive to split a large folio and under extreme= memory > pressure some workloads regressed performance when using 64K mTHP vs = 4K > small folios because of this extra cost in the swap-out path. This se= ries > not only eliminates the regression but makes it faster to swap out 64= K mTHP > vs 4K small folios. > > - Memory fragmentation avoidance: If we can avoid splitting a large fol= io > memory is less likely to become fragmented, making it easier to re-al= locate > a large folio in future. > > - Performance: Enables a separate series [7] to swap-in whole mTHPs, wh= ich > means we won't lose the TLB-efficiency benefits of mTHP once the memo= ry has > been through a swap cycle. > > I've done what I thought was the smallest change possible, and as a resul= t, this > approach is only employed when the swap is backed by a non-rotating block= device > (just as PMD-sized THP is supported today). Discussion against the RFC co= ncluded > that this is sufficient. > > > Performance Testing > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CP= Us. The > VM is set up with a 35G block ram device as the swap device and the test = is run > from inside a memcg limited to 40G memory. I've then run `usemem` from > vm-scalability with 70 processes, each allocating and writing 1G of memor= y. I've > repeated everything 6 times and taken the mean performance improvement re= lative > to 4K page baseline: > > | alloc size | baseline | + this series | > | | mm-unstable (~v6.9-rc1) | | > |:-----------|------------------------:|------------------------:| > | 4K Page | 0.0% | 1.3% | > | 64K THP | -13.6% | 46.3% | > | 2M THP | 91.4% | 89.6% | > > So with this change, the 64K swap performance goes from a 14% regression = to a > 46% improvement. While 2M shows a small regression I'm confident that thi= s is > just noise. > > --- > The series applies against mm-unstable (as of 2024-04-08) after dropping = v6 of > this series from it. The performance numbers are from v5. Since the delta= is > very small I don't anticipate any performance changes. I'm optimistically= hoping > this is the final version. > > > Changes since v6 [6] > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > - patch #1 > - swap_page_trans_huge_swapped() takes order instead of nr_pages (per= Chris) > - patch #2 > - Fix bug in swap_pte_batch() to consider swp pte bits (per David) > - Improved docs for clear_not_present_full_ptes() (per David) > - Improved docs for free_swap_and_cache_nr() (per David) > - patch #5 > - Split out change to get_swap_pages() interface into own patch (per = David) > - patch #6 (was patch #5) > - Improved readability of shrink_folio_list() with longer lines (per = David) > > > Changes since v5 [5] > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > - patch #2 > - Don't bother trying to reclaim swap if none of the entries' refs ha= ve gone > to 0 in free_swap_and_cache_nr() (per Huang, Ying) > - patch #5 > - Only update THP_SWPOUT_FALLBACK counters for pmd-mappable folios (p= er > Barry Song) > - patch #6 > - Fix bug in madvise_cold_or_pageout_pte_range(): don't continue with= out ptl > (reported by Barry [8], sysbot [9]) > > > Changes since v4 [4] > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > - patch #3: > - Added R-B from Huang, Ying - thanks! > - patch #4: > - get_swap_pages() now takes order instead of nr_pages (per Huang, Yi= ng) > - Removed WARN_ON_ONCE() from get_swap_pages() > - Reworded comment for scan_swap_map_try_ssd_cluster() (per Huang, Yi= ng) > - Unified VM_WARN_ON()s in scan_swap_map_slots() to scan: (per Huang,= Ying) > - Removed redundant "order =3D=3D 0" check (per Huang, Ying) > - patch #5: > - Marked list_empty() check with data_race() (per David) > - Added R-B from Barry and David - thanks! > - patch #6: > - Implemented mkold_ptes() generic helper (pre David) > - Enhanced folio_pte_batch() to report any_young (per David) > - madvise_cold_or_pageout_pte_range() sets old in batch (per David) > - Added R-B from Barry - thanks! > > > Changes since v3 [3] > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying) > - Simplified max offset calculation (per Huang, Ying) > - Reinstated struct percpu_cluster to contain per-cluster, per-order `ne= xt` > offset (per Huang, Ying) > - Removed swap_alloc_large() and merged its functionality into > scan_swap_map_slots() (per Huang, Ying) > - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_= HUGE > by freeing swap entries in batches (see patch 2) (per DavidH) > - vmscan splits folio if its partially mapped (per Barry Song, DavidH) > - Avoid splitting in MADV_PAGEOUT path (per Barry Song) > - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" pa= tch > since it's not actually a problem for THP as I first thought. > > > Changes since v2 [2] > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0 > allocation. This required some refactoring to make everything work nic= ely > (new patches 2 and 3). > - Fix bug where nr_swap_pages would say there are pages available but th= e > scanner would not be able to allocate them because they were reserved = for the > per-cpu allocator. We now allow stealing of order-0 entries from the h= igh > order per-cpu clusters (in addition to exisiting stealing from order-0 > per-cpu clusters). > > > Changes since v1 [1] > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > - patch 1: > - Use cluster_set_count() instead of cluster_set_count_flag() in > swap_alloc_cluster() since we no longer have any flag to set. I was= unable > to kill cluster_set_count_flag() as proposed against v1 as other ca= ll > sites depend explicitly setting flags to 0. > - patch 2: > - Moved large_next[] array into percpu_cluster to make it per-cpu > (recommended by Huang, Ying). > - large_next[] array is dynamically allocated because PMD_ORDER is no= t > compile-time constant for powerpc (fixes build error). > > > [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.robert= s@arm.com/ > [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.robert= s@arm.com/ > [3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts= @arm.com/ > [4] https://lore.kernel.org/linux-mm/20240311150058.1122862-1-ryan.robert= s@arm.com/ > [5] https://lore.kernel.org/linux-mm/20240327144537.4165578-1-ryan.robert= s@arm.com/ > [6] https://lore.kernel.org/linux-mm/20240403114032.1162100-1-ryan.robert= s@arm.com/ > [7] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmai= l.com/ > [8] https://lore.kernel.org/linux-mm/CAGsJ_4yMOow27WDvN2q=3DE4HAtDd2PJ=3D= OQ5Pj9DG+6FLWwNuXUw@mail.gmail.com/ > [9] https://lore.kernel.org/linux-mm/579d5127-c763-4001-9625-4563a9316ac3= @redhat.com/ > > Thanks, > Ryan > > Ryan Roberts (7): > mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags > mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() > mm: swap: Simplify struct percpu_cluster > mm: swap: Update get_swap_pages() to take folio order > mm: swap: Allow storage of all mTHP orders > mm: vmscan: Avoid split during shrink_folio_list() > mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD +Zi Yan While looking at the page splitting code, I noticed that split_huge_page_to_list_to_order() will refuse to split a folio in the swapcache to any order higher than 0. It has the following check: if (new_order) { /* Only swapping a whole PMD-mapped folio is supported */ if (folio_test_swapcache(folio)) return -EINVAL; ... } I am guessing with this series this may no longer be applicable? > > include/linux/pgtable.h | 59 ++++++++ > include/linux/swap.h | 35 +++-- > mm/huge_memory.c | 3 - > mm/internal.h | 75 +++++++++- > mm/madvise.c | 99 +++++++----- > mm/memory.c | 17 ++- > mm/swap_slots.c | 6 +- > mm/swapfile.c | 325 +++++++++++++++++++++++----------------- > mm/vmscan.c | 20 +-- > 9 files changed, 422 insertions(+), 217 deletions(-) > > -- > 2.25.1 > >