From: Ryan Roberts <ryan.roberts@arm.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@redhat.com>,
Matthew Wilcox <willy@infradead.org>,
Gao Xiang <xiang@kernel.org>, Yu Zhao <yuzhao@google.com>,
Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting
Date: Wed, 11 Oct 2023 08:42:52 +0100 [thread overview]
Message-ID: <45130848-2a12-4e9f-b325-e7da87ff3151@arm.com> (raw)
In-Reply-To: <87zg0pfyux.fsf@yhuang6-desk2.ccr.corp.intel.com>
On 11/10/2023 07:37, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
>> Hi All,
>>
>> This is an RFC for a small series to add support for swapping out small-sized
>> THP without needing to first split the large folio via __split_huge_page(). It
>> closely follows the approach already used by PMD-sized THP.
>>
>> "Small-sized THP" is an upcoming feature that enables performance improvements
>> by allocating large folios for anonymous memory, where the large folio size is
>> smaller than the traditional PMD-size. See [1].
>>
>> In some circumstances I've observed a performance regression (see patch 2 for
>> details), and this series is an attempt to fix the regression in advance of
>> merging small-sized THP support.
>>
>> I've done what I thought was the smallest change possible, and as a result, this
>> approach is only employed when the swap is backed by a non-rotating block device
>> (just as PMD-sized THP is supported today). However, I have a few questions on
>> whether we should consider relaxing those requirements in certain circumstances:
>>
>>
>> 1) block-backed vs file-backed
>> ==============================
>>
>> The code only attempts to allocate a contiguous set of entries if swap is backed
>> by a block device (i.e. not file-backed). The original commit, f0eea189e8e9
>> ("mm, THP, swap: don't allocate huge cluster for file backed swap device"),
>> stated "It's hard to write a whole transparent huge page (THP) to a file backed
>> swap device". But didn't state why. Does this imply there is a size limit at
>> which it becomes hard? And does that therefore imply that for "small enough"
>> sizes we should now allow use with file-back swap?
>>
>> This original commit was subsequently fixed with commit 41663430588c ("mm, THP,
>> swap: fix allocating cluster for swapfile by mistake"), which said the original
>> commit was using the wrong flag to determine if it was a block device and
>> therefore in some cases was actually doing large allocations for a file-backed
>> swap device, and this was causing file-system corruption. But that implies some
>> sort of correctness issue to me, rather than the performance issue I inferred
>> from the original commit.
>>
>> If anyone can offer an explanation, that would be helpful in determining if we
>> should allow some large sizes for file-backed swap.
>
> swap use 'swap extent' (swap_info_struct.swap_extent_root) to map from
> swap offset to storage block number. For block-backed swap, the mapping
> is pure linear. So, you can use arbitrary large page size. But for
> file-backed swap, only PAGE_SIZE alignment is guaranteed.
Ahh, I see, so its a correctness issue then. Thanks!
>
>> 2) rotating vs non-rotating
>> ===========================
>>
>> I notice that the clustered approach is only used for non-rotating swap. That
>> implies that for rotating media, we will always fail a large allocation, and
>> fall back to splitting THPs to single pages. Which implies that the regression
>> I'm fixing here may still be present on rotating media? Or perhaps rotating disk
>> is so slow that the cost of writing the data out dominates the cost of
>> splitting?
>>
>> I considered that potentially the free swap entry search algorithm that is used
>> in this case could be modified to look for (small) contiguous runs of entries;
>> Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead
>> of single byte.
>>
>> I haven't looked into this idea in detail, but wonder if anybody thinks it is
>> worth the effort? Or perhaps it would end up causing bad fragmentation.
>
> I doubt anybody will use rotating storage to back swap now.
I'm often using a QEMU VM for testing with an Ubuntu install. The disk is
enumerating as rotating storage and the swap device is file-backed. But I guess
the former issue at least, is me setting up QEMU with the incorrect options.
>
>> Finally on testing, I've run the mm selftests and see no regressions, but I
>> don't think there is anything in there specifically aimed towards swap? Are
>> there any functional or performance tests that I should run? It would certainly
>> be good to confirm I haven't regressed PMD-size THP swap performance.
>
> I have used swap sub test case of vm-scalbility to test.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/
Great - I shall take a look!
>
> --
> Best Regards,
> Huang, Ying
next prev parent reply other threads:[~2023-10-11 7:42 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-10 14:21 Ryan Roberts
2023-10-10 14:21 ` [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2023-10-11 7:43 ` Huang, Ying
2023-10-11 8:17 ` Kefeng Wang
2023-10-11 10:15 ` Ryan Roberts
2023-10-11 10:16 ` Ryan Roberts
2023-10-10 14:21 ` [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
2023-10-11 7:44 ` Ryan Roberts
2023-10-11 8:25 ` Huang, Ying
2023-10-11 10:36 ` Ryan Roberts
2023-10-11 17:14 ` Ryan Roberts
2023-10-16 6:17 ` Huang, Ying
2023-10-16 12:10 ` Ryan Roberts
2023-10-17 5:44 ` Huang, Ying
2023-10-11 6:37 ` [RFC PATCH v1 0/2] " Huang, Ying
2023-10-11 7:42 ` Ryan Roberts [this message]
2023-10-13 16:31 ` Ryan Roberts
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45130848-2a12-4e9f-b325-e7da87ff3151@arm.com \
--to=ryan.roberts@arm.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=shy828301@gmail.com \
--cc=willy@infradead.org \
--cc=xiang@kernel.org \
--cc=ying.huang@intel.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox